A note on similarity search

« Judging covers by the book | Home | Disabling the "same directory ... »

Mon 19 Jul 2010 by mskala Tags used: compsci, linguistics, publishing, web

Hi! I'm a scientific researcher. I have a PhD in computer science. My doctoral dissertation is mostly about the mathematical background of "similarity search." That means looking at things to find other things they are similar to. I've travelled the world to present my work on similarity search at scientific conferences - and some very smart people with very limited funds chose to use those funds to pay for me to do that.

Argument from authority has its limitations, but I would like to make very clear: I am an expert in the specific area of how computers can answer questions of the form "Which thing does this thing most resemble?" Gee, why would I mention this right now?

It happens that one of my areas of specialization is computational linguistics. That is the use of computers to study language (and vice versa). I study, among other subjects, the features that make one writer's style different from another's, and how (if at all) a computer can measure those features. It would be a very interesting scientific experiment to take a bunch of samples of different writers' work, do stats on them, and then take other pieces of writing and compare them against the samples to see which writer in the database any given input most resembled the work of. Sounds like a fun project, right?

Such a thing could not ever be anywhere near 100% accurate. Depending on what standard you're comparing against, it might be as good as 50% accurate, or it might be lucky to get 10%, or it might be even worse. Sometimes if you put in your own writing, it would produce a silly answer - a writer you think is very much unlike you. For instance, it might say you were like a writer of a different gender or ethnicity from you. It might even do that when there are other writers in the database that you think you should be rated as more like. Sometimes even for writers in the database themselves, it would rate a given piece of writing as more like some other writer's writing than like that of the person who actually wrote it, especially if the system were designed to measure style instead of primarily identity, but even if it were designed to measure identity and nothing else. Those are unavoidable properties of how this kind of classification system works. Classifying writers' styles is especially difficult because writers consciously change their styles, to imitate one another or to avoid doing so, to create different artistic effects in different pieces of writing. Any writer who has a characteristic style noticeable uniformly in all their work is probably doing it wrong, to be blunt.

It also would be absolutely impossible to ensure that the list of writers in such a system's database would be perfectly balanced by any specific criterion - for instance, race, gender, or genre. Attempting to achieve such a balance would absolutely mean sacrificing other important things to some and probably a very large degree. One of the many reasons is that putting a writer into such a system requires a large sample of the writer's work, in a form that the training algorithm can process, and those samples are hard to find especially if one wants to avoid signing away one's soul to an organization that claims to own the data. It's not as simple as putting a paperback in a scanner and walking away.

Designing the training data for machine learning systems in general is difficult, expensive, and a really big problem. Putting in one writer absolutely means not putting in someone else; the number that can be accomodated is limited. Adding more writers may mean requiring a bigger sample from each of them, depending on the behaviour of the training algorithm, and will almost certainly mean more CPU cycles required to do the training. The cross-section of training data available is shaped by whatever happens to be available electronically already for other reasons, because nobody has the funding to create a database completely from scratch.

For instance, in the case of a writing classifier it's much easier to run stats on books from Baen than from other publishers, because they have that free online library thing; but that means getting a database skewed toward science fiction. It's much easier to run stats on books from before Mickey Mouse because they're out of copyright and available through Project Gutenberg; but that means being skewed in the direction of whatever books the Project Gutenberg volunteers like, because they don't make any effort to "balance" by any other criterion than what their volunteers volunteer to do. It also means being skewed toward the writers who were commercially successful enough in earlier times for their work to have survived enough to end up in PG - and commercial success even today, and all the more in the past, is not distributed in an even-handed manner. The uneven distribution of commercial success decades before I was born may be someone's fault but I don't believe it's mine.

Researchers like me spend mindboggling amounts of time and money constructing databases of samples (called "corpora") to use in this kind of project. We can never include all the writers we would like to, and if you want us to include writers you are interested in and we aren't particularly, hey, friend, the university is right over there. You go get the PhD and obtain the grant money, go learn what a hidden Markov model is and how to train one, read a few books on support vector machines, publish a few papers, spend the decade in poverty and celibacy that I spent, run around knocking on doors forever and sacrifice your other dreams, and then maybe you'll have some standing to tell me how to do my job.

Even if the list of authors in the database were balanced by any given criterion, there is no guarantee that the output would represent all those authors with equal or balanced probability, since the output probabilities would be determined by what people submitted for classification. Also, the space of "style resemblance" is multi-dimensional and complicated. It may well be that there are one or a few authors who really are the closest match to a very large number of possible inputs and many others who are the closest match for very few possible inputs; it's essentially a Voronoi diagram in which some of the cells may be much larger than others. (Hey, do you know what a Voronoi diagram is? If not, why the fuck do you think I should listen to you shoot your ignorant mouth off on the subject?)

In fact, there's a strong theoretical reason for the Voronoi cell sizes to be uneven: we're looking at a naturally occurring process, it's likely to produce a power-law distribution because that's what such processes do, and in that case there will be a few cells much larger than the others because that's what a power-law distribution means. I have not done experiments to verify the size distribution of the Voronoi cells in the writer-resemblance space; to my knowledge nobody has; but it's well-supported by consensus theory and I have written a peer-reviewed journal article on something somewhat related. You can read that article here.

Another interesting theoretical point has to do with intrinsic dimensionality. With something like writing style, it's reasonable to guess that the intrinsic dimensionality will be high. That means that in general, almost all points (pieces of written work) will be at the maximum distance from almost all other points (other pieces of written work). See Section 1.3 of my dissertation, the work of others cited therein, and Chapter 2. Measuring the distance from one writer to another doesn't usually tell you much useful information then, because it'll just say "no significant resemblance" nearly all the time; and if you're matching against a database, unless you get lucky and find a perfect match, you'll usually find no real match at all. If there is no close match it's quite possible that a model of writer similarity might tend to match the input with the writer in the database who exhibits the highest per-word entropy; and very few lists of famous writers containing James Joyce will contain anyone with higher per-word entropy, so it's easy to predict that he in particular might end up matching a whole lot of inputs. (Now go look up what per-word entropy means, please.)

Most writers are unique and unlike all other writers; and even if you can answer the question of which writer is the closest thing vaguely resembling a match, it will almost certainly be a very bad match. That is an unavoidable property of high-dimensional distributions in metric spaces, and a major topic of my dissertation and my multiple follow-up publications. In a few words, it's because high-dimensional spheres are almost all surface with almost no interior. A Koosh ball, if you remember those, is a good tactile model for how high-dimensional spheres behave.

So: any "which author does this resemble?" system will often produce results you think are "wrong." That's how machine learning systems work (not very well), at least when applied to intrinsically high-dimensional problems like writing-style classification, and the mathematics of the underlying metric space do not imply anything bad about the people who build classification systems. It may tend to fail by matching almost everybody with James Joyce, and the only person about whom that implies anything interesting is Joyce himself, and it doesn't necessarily imply anything bad about him. Any "which author's work does this text resemble?" system will be forced, by lack of training data as well as by many other considerations, to have a very limited list of possible classification results, a list that doesn't match the population and doesn't match anybody's idea of a "balanced" list. That limitation does not indicate anything bad about the people who built the system. A "balanced" list is not a reasonable nor even remotely possible goal to demand of such a system, and even if it could be achieved it would not result in measurably better classification results, and that also does not indicate anything bad about the people who built the system.

If I ever did research in this direction, I sure had better not let anyone except my fellow scientists look at the results, eh? It's pretty clear that the Web community cannot accept or comprehend the limitations of this kind of science; and I'd be told how to do my job by people who think maybe egalitarianism would be nice, but don't have the slightest conception of the issues involved and don't have the most remote inclination to learn or listen; and then I'd be fucking crucified if my system failed to live up to unreasonable expectations.

15 comments

For what it's worth, I agree with you entirely that there are many valid reasons for such a system not to attempt a "balanced" selection of authors, and that a reasonable software author would reasonably become defensive when confronted with "why aren't there any writers of color in your database?"

You seem to be implying that it doesn't matter whether the software system is a purported attempt at scientific inquiry or something which is materially different from accepted scientific inquiry in some ways. You appear to be outraged at the way another person was treated, with no regard for whether or not that other person had "the slightest conception of the issues involved."

Out of curiosity, if the algorithm in use had been to use the length of the passage mod 40 to index an array of author names, would you feel similarly appalled at the way he was treated?
Chris K - 2010-07-19 20:39

There appear to be legitimate criticisms that could be directed at the goals of the system in question. I'm not sure it was ever intended as "a purported attempt at scientific inquiry." If it was meant as research, it certainly seems like it's way behind the state of the art for such things. If someone started out by complaining about those points and didn't put stupid irrelevant things front and centre, I'd probably have sympathy with them.

But what I see going on is that he wasn't crucified for anything that would differentiate what he did from something I might do. I then feel threatened by the reactions I see: there but for fortune go I. Anybody who mentioned any objections I might consider legitimate did so only in passing, only after devoting the large majority of their venom to the non-legitimate criticisms (at the very least, endorsing others' nonsense) and losing my sympathy. It appears that if I touched his subject matter I could expect to be treated just the same as he was. Maybe my impression on that point is incorrect, but if so, Web loggers aren't doing a good job of demonstrating their ability to make such nice distinctions.

And yes, if it were the algorithm you describe, I'd consider the reaction inappropriate. (Note that many Facebook quizzes and such do actually run such algorithms.) I'd be angered if he made specific and false claims about how it worked or how well it worked; I've seen no indication that he did. Even in such a case I think the reaction would be inappropriate; there are right and wrong ways to raise any given objection. I'm hard put to think of a scenario in which the reaction I'm seeing would be right.
Matt - 2010-07-19 21:09

What triggered this note? I'm not in the loop when it comes to rage against the machines.
trythil - 2010-07-20 02:19

Trythil: http://iwl.me/ , and commentary about it at such places as http://zia-narratora.livejournal.com/627422.html

Incidentally, it rates this posting of mine as most similar to the work of Cory Doctorow.
Matt - 2010-07-20 04:40

Further to that: I tried it with a few samples of my own stuff and it seems to say Cory Doctorow most often, Margaret Atwood pretty often too, occasionally others. It'd be amusing to think that I really do resemble Doctorow and Atwood simultaneously, in terms of subject matter, political outlook, and other abstract things human beings think of as characterizing the differences between authors; but since it's apparently mostly a Bayesian word-frequency thing, I think the explanation just comes down to my use of Canadian spelling.

The game would be more interesting, and less open to idiot criticism, if it produced more detailed support for results - "Your average sentence length is 17.8 words, close to Joe Bitzfilk's average of 16.4... you used this list of rare words which were also used by Mary Bloggs-Ramachandra..." and so on. But it might not be easy to translate the reasons for the results into that kind of summary in any convincing-to-humans way.
Matt - 2010-07-20 06:35

"I write like
Cory Doctorow"

I guess then that by the transitive property, I write like Matt.

Seems useless (and harmless) enough?
eloj - 2010-07-20 08:49

I think you're supposed to be angry because it didn't match you with Stieg Larsson - and doubly so because he isn't on the list of possible results. Or something. (If he isn't? I'm not sure where to look up the current list.)

It occurred to me that maybe part of the gap in values seen here has to do with many members of the lynch mob being based in the USA. The ideas there of what constitutes "racism" and how it should be treated are heavily informed by that country's unfortunate history. Demanding "racial" balance may sound a lot more reasonable to someone who thinks "race" is a binary variable than to someone who sees it the way I or you are probably inclined to, coming from a Canadian or Swedish perspective. And do note that Dmitry Chestnykh certainly appears to be a non-native writer of English, with a very different cultural background from what someone born in the USA would have. He's being picked on pretty heavily on Making Light for his clumsy English as we speak, and it makes me think less of the people who run and hang out on Making Light - including the ones who merely allow such behaviour to pass with only implicit endorsement of it.

While I'm posting another comment here, to clarify my answer to Chris K: Yes, it is correct that I am "outraged at the way another person was treated, with no regard for whether or not that other person had \"the slightest conception of the issues involved.\" " The way Chestnykh is being treated would be inappropriate regardless of his qualifications or lack thereof. When I see people treating others as they're treating him I pretty quickly end up on the opposite side of whatever ideology is motivating them: ideologies that lead to such behaviour are bad even if for no other reason. See the slogan in my page header.
Matt - 2010-07-20 09:29

Having tried the application you linked to, and read some of the criticisms about it...well.

"I Write Like" is obviously someone's Web toy; I don't understand why people get worked up about it. Perhaps people should stop being so offended at the output of a program.

I use the word "obviously" -- but perhaps it's not really that obvious? Maybe Ebert's tweet about it was interpreted as some sort of implicit endorsement, and then everything just went to hell from there.

There is something to be said here about software developers being more thoughtful about the implications of what they build (i.e. code of ethics for engineers), especially when the Internet can propel a piece of software to such notoriety in such a short amount of time, but I have nothing more to say about that.

I did come across http://www.theawl.com/2010/07/a-qa-with-the-creator-of-i-write-like-the-algorithm-is-not-a-rocket-science, which seems to indicate that the author of IWL has grander aspirations than "just a toy" -- but, hell, even then, the outrage doesn't make sense. It's like cursing Amazon because you think its recommendation algorithms have an agenda to insult you.
trythil - 2010-07-20 14:26

There's an important background detail here. A lot of people have been upset, in the last year or so, because of lists of authors presented in other contexts that were said not to accurately reflect population demographics. In particular, various people's lists of favourite or influential science fiction authors - for instance, lists of authors included in anthologies - contained very few people other than white men. I've witnessed a fair bit of that anger myself, despite a conscious effort to avoid the kind of people who deal in it. There've been multiple episodes of segments of the community freaking out over a given list, calls for heads or other body parts on pikes, more stable elements making learned introspective meta-comments, and so on. Another round of this stuff sweeps through the fandom Web logs I read every few weeks. There's also a lot of weird tension over other racism- and sexism-related issues in fandom at this time in history too; see that book cover thing I posted a couple weeks ago, and it's only one example I chose to comment on, among of many I let pass.

This toy involves a list of authors. That list included few to no persons other than white men. Such a list, posted on the Web in 2010, automatically ended up in the firing line of the existing seething anger already directed at other lists of authors, whether the anger was really appropriate in the specific context of this toy or not.
Matt - 2010-07-20 14:48

I thought that the private reply reposted on that livejournal blog you linked was an exceptionally good response by Dmitry. I don't get why that further angered the blog writer. I kind of understand it, but I don't -get- it.
Steve C - 2010-07-21 15:09

That's a big part of what set me off. What he said was pretty much the same thing I would have said. I think it was a good thing to say. But the Web log writer and most of their friends treated it as being obviously far beyond the pale. That suggests their values are irreconcilably different from mine.
Matt - 2010-07-21 15:16

Actually, it's probably not what I would have said. I would have been less polite. But that doesn't reflect badly on Chestnykh.
Matt - 2010-07-21 15:17

The first time, I submitted some paragraphs from an astrological article and it said I write like James Joyce. Two days later I submitted a passage from my dream file and it said I write like Vladimir Nabokov. (There were no underage girls in my dreams, I swear!)
Axel - 2010-07-24 10:41

I submitted some paragraphs from a whitepaper I helped write on the use of supervised learning in legal cases, and got back H. P. Lovecraft. That may confirm a lot of people's suspicions about me and/or the legal system.
Dave Lewis - 2010-08-27 16:59

Let's tweak the software so it can tell what other artists paint like me.
Robert Randall - 2010-08-28 02:22

Ansuz

A note on similarity search

Search

Links

Support this site

Tags

Archives