Fun with text analysis
Tue 26 Oct 2010 by mskala Tags used: compsci, linguistics, publishingI wrote before about the writing style analysis toy; at that time I said the "blogosphere" wasn't ready for such technology, and I still believe that, but I recently did something sort of related that might interest you, and the stakes are a little lower, so I'm going to share it here.
The thing is, in my novel draft, there are 45 chapters, and some of them are deliberately written in different styles from others. I thought it'd be interesting to see if I could detect that statistically. I apologize for not posting the actual text here - you'll have to wait until the book is published - but I'll at least give you the raw numbers to play with and walk you through the analysis.
First of all, I counted the number of paragraphs, sentences, words, and letters in each chapter. This is a little sensitive to definitions - for instance, if I write an open quotation mark, three sentences of dialogue, a closing quotation mark, and then "she said.", is that three sentences? Four? One? Really it's one, with a quoted string inside it that itself contains three more; but it's not at all clear what definition is most likely to give useful stylistic analysis. For my purposes, I simply used the one that was easiest to program: I already have some Perl code that looks for sentence breaks of the kind that should have a double horizontal space in an ASCII manuscript file; so I just declared those to be sentence breaks for the purposes of my count. Similarly, to measure word lengths, I filtered out all punctuation except hyphens, and apostrophes surrounded on both sides by letters with no spaces, and then declared space- or dash-separated (note a dash is not the same as a hyphen) chunks of what was left to be words.
From there I was able to calculate means and histograms of smaller units per larger unit: letters per word, words per sentence, sentences per paragraph. For letters per word, I counted how many one-letter words, how many two-letter words, and so on up to nine, then a count of all "10 letters or more" words. Similarly, I got ten counts in appropriate-sized bins for words per sentence (I used bins of size five for those, because sentences tended to be in the up-to-50-words range) and ten counts of sentences per paragraph. I normalized the histograms to sum to 1 for each chapter, so that longer chapters wouldn't simply have bigger counts across the board. The result was a data file with 38 columns of data (one index, four totals, and three kinds of counts each formatted as one mean plus ten columns of histogram). You can download the data here as a tab-separated ASCII file with a few lines of header; most good spreadsheets should be able to read it.
That basically gave me a description of the chapters of my book as a set of 45 points in 33-dimensional space. I'm only considering the means and histograms as the interesting data here; the other numbers in the file are either the location of the chapter within the book, or primarily determined by the size of the chapter, and neither of those captures the "style" information I want to be looking at. Lacking a set of 33-dimensional senses to perceive those points, it was necessary to reduce it to something more comprehensible.
For that I applied a technique called principal component analysis (PCA). This is a standard statistical technique; its function can be described in several different ways, many of which are very difficult to understand, but one of the ones I like is that your high-dimensional data probably doesn't really occupy all the dimensions in which you've expressed it. For instance, points strung out more or less along a line can be thought of as basically one-dimensional, even if that line goes off at an angle across the plane, or even through three-dimensional space. Similarly, all your points could be on, or close to, a two-dimensional plane, even if that plane exists in three-dimensional space, and even if enough points are not exactly on the plane (just close) that you can't say they are really on the plane. The function of PCA is to find that plane, or some type of thing higher-dimensional than a plane that is mathematically analogous to it.
Another way of thinking about PCA is that it rotates the data through the high-dimensional space in such a way as to move interesting information as much as possible into the lowest-numbered dimensions. Then if you cut off just the lowest-numbered dimension, you get the best possible one-dimensional approximation; if you cut off the first two, you get the best possible two-dimensional approximation, and so on. "Best possible" applies here in a certain mathematical sense which might well not be what you actually want in a particular application. The intricacies of such issues are a big part of what keeps people like me employed.
I used GNU Octave, which is more or less a free clone of MATLAB, to do the PCA. The results are also in a file you can download; I added a column of index numbers to make plotting easier, but the rest of it is just the Octave output: a 45x33 matrix, where the columns are the principal components.
Now, the fun part: I ran the first three columns of that (index number and first two principal components) through GNUplot, and here's what came out.
That's a scatter plot; the numbers are chapter numbers (1 to 45), and the axes are the numbers that came out of the principal component analysis. I'd like to emphasize that by the nature of PCA, the components and their units do not necessarily have any meaning at all; it's just that the first component is the direction along which the sample has the most variance, the second component is the direction orthogonal to that one along which the sample has the most remaining variance, and so on, and the units are some mixture of the units of the original data, according to the eigenvectors of the covariance matrix. If the components turn out to be meaningful in some way, that's cool but not guaranteed.
The interesting part for me is that the numbers do seem to have arranged themselves into some kind of structure. There's a sort of blob a little to the left of and below the centre, but some chapters stand out as not being in that blob. So the next step is, are those chapters somehow special?
Well, this next step is a little difficult for you to reproduce without the book in front of you - and by the time you get your copy the chapter numbering will probably have changed anyway (maybe I'll re-do the stats then); but I can tell you that it does seem to make some sense. In particular, there are some characters in my story who are not human, and in particular, some who are anime catgirls. They don't think like humans; they think more like cats; and that's meant to be reflected in the writing style of the chapters that are written from their point of view. One of the differences is simply that they think in present tense instead of past tense... but this analysis shouldn't be able to detect that, should it? Remember I'm only counting things like words per sentence, absolutely nothing about grammatical issues like past or present tense. Nothing in this analysis does any parsing or syntactic analysis beyond those simple counts. Nonetheless, maybe the catgirl-POV chapters will stand out anyway. There are six of them, and their numbers are 6, 13, 18, 27, 28, and 45.
Take a look at the chart: five of those six chapters do stand out nicely. Number 6 is extreme, it's in the upper right nowhere near the rest. The analysis definitely detected that that's a special chapter. Numbers 13, 18, 27, and 28 are clearly tending in the same direction as number 6; whatever it has, they have too, though not quite as much. But 45 is not so special. It's much further down, near chapter 11, in the tendril that stretches off to the right. I note that although chapter 45 is through the eyes of a catgirl, it's much more like some of the other chapters in subject matter (she is narrating a different plot line from the other "catgirl" chapters).
Can we learn anything about the other extreme-seeming chapters? Well, chapter 24 is far to the right, and it seems like other chapters are tending toward it. I took a look at that chapter and found it was all dialogue, and the others near it are chapters that contain a lot of dialogue. It appears that that first component is basically measuring amount of dialogue in the chapter; and it seems reasonable that this PCA would key on that, because dialogue tends to have one-sentence paragraphs, and short sentences, much more so than other kinds of text. We could probably confirm this by looking at the eigenvectors that came out of the PCA process (not posted - I didn't save them - but it'd be easy to recreate them). Instead, I took a look at chapter 16, which represents the opposite extreme. Sure enough, that's a chapter with no dialogue at all, and a lot of technical description, written in a very formal tone with a lot of long sentences and such. It stands to reason it would appear as the opposite of the "all dialogue" chapter.
I'm not sure that much more of interest can be extracted from this very simple analysis; some of the other distinctions I'd been hoping to see don't seem to have appeared, and I think they may require looking at more dimensions of input or output (such as things related to the meanings of the words - at the moment, any two words of the same length are treated as exactly the same). Nonetheless, it's kind of cool that even at this level it seems I can reliably detect dialogue and catgirlishness.
Do note that this is all comparing not only chunks of writing from the same person, but different chapters of the same book. The fact that we still see noticeable clusters corresponding to detectably different styles, is part of why author identification by stylistic analysis is hard. If we threw in some chapters written by someone else, we might well find that chapters written by me, from different characters' viewpoints, are more different from each other than chapters written by me in general are different from chapters written by you. This particular exercise isn't designed to test that.
10 comments
Catgirls do use short, to-the-point sentences. They also say "mew" pretty often, and that may be enough to show up as a noticeably higher proportion of three-letter words than in the rest of the text.
Matt - 2010-10-27 15:26
Also, nobody uses double spaces after periods anymore.
Owen - 2010-10-27 15:37
On double spaces at the ends of sentences: it depends. Most books for the popular market these days seem to use equal space between words and sentences. But LaTeX makes a wider space (not necessarily exactly double) after sentences; and LaTeX is standard throughout computer science and mathematics.
Two typewriter spaces after a sentence is also part of the standard typewriter manuscript format, whether the book ends up being typeset that way or not. Editors may or may not complain, but they will definitely notice, if you don't follow the standard - and there's an important reason to follow the standard in that if you don't, the estimating rules based on people following the standard no longer exactly apply, and that annoys editors and typesetters.
It's possible to argue that a wider space after sentences is good because it makes it easier for the reader to know where the sentences end; some people say that it's bad, but I haven't heard any stronger argument against it than your "Nobody does that anymore." Just for my own purposes there's a significant advantage that I can use it to recognize sentence boundaries in software, which would be a lot harder without it.
Matt - 2010-10-27 17:50
You obviously write shorter chapters/chapters of a more varied length. I try to pace out 18 chapters for a 100kword book, 5555 words per chapter. Remember that you're writing for people who have a set period of time in their day (commute, while the compiler runs, after my goes goes home.) Trying to adhere to an artificial structure like 18x5555 forces me to pace it differently than free-form chapter structures, I think. I'm enjoying it, anyway.
Owen - 2010-10-27 18:20
Owen - 2010-10-27 18:21
Owen - 2010-10-27 18:31
Matt - 2010-10-28 01:05
Sometimes periods don't end sentences; for instance in the case of abbreviations like Mr. That's especially hard because in this paragraph you need to know that the first "Mr." occurs at the end of a sentence and I'm not talking about a person named "Mr. That's," whereas the other two are quoted in the middle of a sentence. Knowing where sentences end is a tricky problem.
On the other hand, sometimes sentences end with things other than periods, such as question marks and exclamation points - but those, similarly, do not ALWAYS end sentences. You can't solve it with simple computerish rules, and that's one reason that double spaces are valuable to both computers and typesetters.
Matt - 2010-10-28 01:15
In the mean time, you should totally check out Lacuna Expanse. Warning: It's highly addictive. If you use my referral code below, you'll end up in a sector close to me when you found your fledgling civilization.
https://us1.lacunaexpanse.com/#referral=d495b511-0452-39a3-b134-4ec7a32ce3f9
Seriously, I'm totally addicted. Can't stop!
Owen - 2010-11-08 21:04
I notice that chapters 6, 13, 18, 27, and 28 have high mean sentences per paragraph, relative to the rest of the data set, but not so many mean words per sentence. Is your catgirls' dialogue composed of lots of short, to-the-point sentences? (This comes to mind, too: http://www.nytimes.com/1988/04/18/books/classic-french-novel-is-americanized.html)
trythil - 2010-10-27 15:07