Markov Text Analysis

bwall · October 15, 2012

If you have known me a while, you know I like to work with markov chains to do various things. In this case, it is to match typed text to its author based on matching markov chains. I'll do a brief overview of the concept real quick. We all tend to type things a bit differently, making distinguishable typos, saying things certain ways, even hitting random characters in a similar way. People who have been typing for a long time mostly do it based on muscle memory which is not very different from a markov chain.

If we do an analysis of the sequence the characters are outputted, we can use that information in markov chains to determine how close it matches to a person's typing style. This idea can be applied to identify who typed something out of a group of people(so good for confirming or denying that a profile using the same nickname as another can be matched to that nickname on more than just the name), to generate the most likely password guesses for a particular user, and even to detect if a new user connected to a server actually matches the usual user. It has a lot of applications, but the results can be hard to decipher. That is something I am still working on. The source for this project can be found here in python: https://github.com/bwall/markov-analysis

I'm still working on it and trying to make input/output as widely applicable as possible.

bobbyb1980 · October 15, 2012

I have experimented with this idea. I didn't run your script but it looks like it resembles the google trigrams method of handwriting/language identification, or comparing it to a predefined set of strings.

In English, I found this method to be very prone to false positives, just because two texts use a lot of words with '-ing' or '-ly' or even a particular word/vocabulary doesn't mean they have the same author. The average English speaker uses about 17K base words, which IMO isn't enough words to rely only on this method when you're talking about matching possibly billions of words and tens of thousands of authors.

For mine, I had to add more variables to increase chances of true positives, for example, if the author uses "like/as" in the same line as "a/to", they're probably doing a simile, or if the author uses word patterns like "word1....word2....word2....word1" it is probably a metaphor. Then you can say, ok, text 1 and text 2 both have not only similar trigrams, but both authors use hyperboles and similes, then you can have an extra "layer" to wean out false positives based solely on string matches. There are tons of figure of speech patterns like this that a script can recognize.

Edited October 15, 2012 by bobbyb1980

digip · October 15, 2012

I have experimented with this idea. I didn't run your script but it looks like it resembles the google trigrams method of handwriting/language identification, or comparing it to a predefined set of strings.

In English, I found this method to be very prone to false positives, just because two texts use a lot of words with '-ing' or '-ly' or even a particular word/vocabulary doesn't mean they have the same author. The average English speaker uses about 17K base words, which IMO isn't enough words to rely only on this method when you're talking about matching possibly billions of words and tens of thousands of authors.

For mine, I had to add more variables to increase chances of true positives, for example, if the author uses "like/as" in the same line as "a/to", they're probably doing a simile, or if the author uses word patterns like "word1....word2....word2....word1" it is probably a metaphor. Then you can say, ok, text 1 and text 2 both have not only similar trigrams, but both authors use hyperboles and similes, then you can have an extra "layer" to wean out false positives based solely on string matches. There are tons of figure of speech patterns like this that a script can recognize.

Do you have links to the Google Trigrams thing? Never heard of it before..

Edited October 15, 2012 by digip

bobbyb1980 · October 15, 2012

Could be wrong, but I think the trigrams were originally designed by google (but abandoned after they stopped maintaining the translator), I heard that somewhere.

All a "trigram" is, is a massive list of three character strings like "ing", "and" or "ion" that were originally used to identify what language a text is written in. People sometimes use them to try to ID the author of a language also, but in my limited experience I've found that using trigrams (or matching character strings) to ID an author is inaccurate and shows a lot of false positives.

The first link describes this concept in detail, and the second two show some python implementations based on trigrams.

http://www.cavar.me/damir/LID/

http://pypi.python.org/pypi/guess-language

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

digip · October 15, 2012

I can see how someone like Google would use that to use their language tools to sort of auto detect the language a text was written in(not to mention unicode) but I think Bwall was talking more about things like a handwriting signature, where you can put two pieces of paper side by side, and do handwriting analysis to tell if they were penned by the same author, in much the same way hand writing style can be fingerprinted, so should the way people speak on the internet in typed text. Not so much about identifying which language the text is written in, but more if it was written by the same author, so say you have the guy "Fake Steve Jobs" who was also a writer for a magazine or some other online news site. There should be a way to take text written by someone on one site, and compare it with that of another author on another site, and determine based on paragraphs, words used, punctuation, and so on, if it was written by the same person and they more or less have multiple personas to say. At least, I think thats more what Bwall is after, vs just identifying, this is English, or this is Portuguese based on these 3 vowels or these three commonly used letters in a specif language.

Edited October 15, 2012 by digip

bobbyb1980 · October 15, 2012

Right, and the python implementation in my 2nd link does exactly that, or "comparing characteristic footprints of various registers or authors". I understand that he's trying to compare authors and not ID languages. However, when I have tried using a different algorithm to do that (but fundamentally the same method as bwall's method since the basis of both methods is matching strings), it didn't work very well.

IMO, to accurately do this you need to compare speech patterns, and not just word patterns (trigrams should be used, but as a supplementary method, not primary). Look at my paragraphs, vs. your paragraphs. How would you tell the difference between the two based solely on character matches? We both use proper punctuation, spelling, and grammar. We both use similar vocabulary. How could a *program* see the difference? Of course we're both going to use "ing" and "ion" in certain frequencies, as will everyone who writes in English, which is why, for me at least, character matches showed many false positives.

You need to compare whether the authors both use common figures of speech (oxymorons, hyperboles, similes, etc etc), or whether the authors commonly use pronouns with or without certain verbs (this method is used to ID slang), or comparison of the instance of pronouns each author uses (a program can see if someone talks about themselves a lot if they use "I" often). You can also programatically compare instances of adjectives, so you can know if a certain author is descriptive. There are many many many examples like this, unfortunately just not open sourced ones.

digip · October 15, 2012

Yeah, thats more what I was getting at, like phrasing of the same sentences, but with how they are punctuated, and if the same spelling mistakes happens for common words in those same ways across different sites, "phrase one from site a" compared with "phrase two from site b" when they both might say the same thing, but how in which they say it, is it verbatim enough, close enough to their writing method and style. Do they reuse a word with relative frequency and in the same context, even if its in the wrong context, do they repeatedly use it in the same, wrong context, and so on. I think its a great undertaking of a project for anyone, and requires more than just understanding string comparisons within programming, but also needs someone on the team to throw ideas from the psychological side, like the way social engineers with years of practice learn to read body language and micro expressions, its much hard to know a persons context and can easily be misinterpreted in print vs say, vocal patterns, and even then, stress, anger, sadness, and all these other emotions, come into play in how we write depending on how our day went.

I think in order to more easily identify two writings as being from the same person, you also have to be able to factor in data about the individuals, and if you knew one of the people in real life, identifying traits manually, would probably be much easier, than static analysis. We know with programming its much easier to identify flaws for example with static analysis, but in textual print on the web for just normal conversations, like in chat rooms, that can be much harder. For example, Sabu was arrested months ahead of when it was publicly announced, but people responding and talking to FBI controlled accounts of his on Twitter, email and IRC, were none the wiser about if it was really him or not.

I don't doubt though, that some high level government program(not code but as in faction of the government of some three letter acronym), has their own people for statistical analysis, because I would gather that you need some level of actual human comparison and analysis, and software would as you said, lead to a lot of false positives. Unless it was worked on by a team of people who already do the physical, manual side of writing analysis help the computer programmers write the software and help them through that process, as programmers you would need to also be a psych major or someone like a field agent who has a background on intel gathering and knowing how to fingerprint patterns.

One of my fav movies is the one with Russell Crowe, based somewhat on a true story, of the professor from Princeton where he could see patterns in things like newspaper articles and pull out messages. Most of which we know was just his illness and delusions, but there are people like that who can read through something and see the patterns or messages between the lines and probably be able to compare two stories written by different people's aliases and determine it was the same person. Steven King for example writes under several pen names, but I am sure people could at some point figure out when two books are written by him under different names. I'd be interested to see how they do it though, and then you would apply that to writing your algorithm for analysis.

bobbyb1980 · October 15, 2012

I used to use it for language ID, but instead of ID'ing a language with trigrams I'd use the trigrams to find unique characteristics like metaphors and whatnot inside of texts to confirm that it's English (because even when the trigrams are used for only language ID they still give false positives, like locating English inside of a text of another Germanic language). I could see someone using that method though to use regex in python to compare posts on forums.hak5.org and forums.backtrack.com to find who has similar writing patterns.

digip · October 15, 2012

it would be cool to see something like it in action though, to be able to paste in two texts side by side, like in Notepad++ compare mode, that automatically greps out matching phrases where you can specify things like "more than 3 words, more than 5 words, etc" and pulls similar matches and things like spelling, case sensitivity, etc, then compiles where the matches are and give you some sort of ratio to similarities. Someone purposely trying to obfuscate their identity though, might be wise enough to go all camel case and purposely change their writing style, but someone like say, anonymous posters who want to hijack a thread under multiple names to reinforce their point of view or mess with things like poll systems and comment forms on news sites, would probably not be so conscience of their writing and how they spoof them self.

bwall · October 18, 2012

I agree bobbyb, the initial PoC with basic character sequence signatures does not seem to be verbose enough. Ideally, I wanted to do matches on those because it allowed for the smallest databases, but I clearly need to expand to 3 char combos, 4 char combos, and then full words as well.

Sign In

Markov Text Analysis

Recommended Posts

bwall

Link to comment

Share on other sites

bobbyb1980

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

bobbyb1980

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

bobbyb1980

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

bobbyb1980

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

bwall

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Browse

Activity