Hi! I'm a scientific researcher. I have a PhD in computer science. My doctoral dissertation is mostly about the mathematical background of "similarity search." That means looking at things to find other things they are similar to. I've travelled the world to present my work on similarity search at scientific conferences - and some very smart people with very limited funds chose to use those funds to pay for me to do that.
Argument from authority has its limitations, but I would like to make very clear: I am an expert in the specific area of how computers can answer questions of the form "Which thing does this thing most resemble?" Gee, why would I mention this right now?
It happens that one of my areas of specialization is computational linguistics. That is the use of computers to study language (and vice versa). I study, among other subjects, the features that make one writer's style different from another's, and how (if at all) a computer can measure those features. It would be a very interesting scientific experiment to take a bunch of samples of different writers' work, do stats on them, and then take other pieces of writing and compare them against the samples to see which writer in the database any given input most resembled the work of. Sounds like a fun project, right?
Such a thing could not ever be anywhere near 100% accurate. Depending on what standard you're comparing against, it might be as good as 50% accurate, or it might be lucky to get 10%, or it might be even worse. Sometimes if you put in your own writing, it would produce a silly answer - a writer you think is very much unlike you. For instance, it might say you were like a writer of a different gender or ethnicity from you. It might even do that when there are other writers in the database that you think you should be rated as more like. Sometimes even for writers in the database themselves, it would rate a given piece of writing as more like some other writer's writing than like that of the person who actually wrote it, especially if the system were designed to measure style instead of primarily identity, but even if it were designed to measure identity and nothing else. Those are unavoidable properties of how this kind of classification system works. Classifying writers' styles is especially difficult because writers consciously change their styles, to imitate one another or to avoid doing so, to create different artistic effects in different pieces of writing. Any writer who has a characteristic style noticeable uniformly in all their work is probably doing it wrong, to be blunt.
It also would be absolutely impossible to ensure that the list of writers in such a system's database would be perfectly balanced by any specific criterion - for instance, race, gender, or genre. Attempting to achieve such a balance would absolutely mean sacrificing other important things to some and probably a very large degree. One of the many reasons is that putting a writer into such a system requires a large sample of the writer's work, in a form that the training algorithm can process, and those samples are hard to find especially if one wants to avoid signing away one's soul to an organization that claims to own the data. It's not as simple as putting a paperback in a scanner and walking away.
Designing the training data for machine learning systems in general is difficult, expensive, and a really big problem. Putting in one writer absolutely means not putting in someone else; the number that can be accomodated is limited. Adding more writers may mean requiring a bigger sample from each of them, depending on the behaviour of the training algorithm, and will almost certainly mean more CPU cycles required to do the training. The cross-section of training data available is shaped by whatever happens to be available electronically already for other reasons, because nobody has the funding to create a database completely from scratch.
For instance, in the case of a writing classifier it's much easier to run stats on books from Baen than from other publishers, because they have that free online library thing; but that means getting a database skewed toward science fiction. It's much easier to run stats on books from before Mickey Mouse because they're out of copyright and available through Project Gutenberg; but that means being skewed in the direction of whatever books the Project Gutenberg volunteers like, because they don't make any effort to "balance" by any other criterion than what their volunteers volunteer to do. It also means being skewed toward the writers who were commercially successful enough in earlier times for their work to have survived enough to end up in PG - and commercial success even today, and all the more in the past, is not distributed in an even-handed manner. The uneven distribution of commercial success decades before I was born may be someone's fault but I don't believe it's mine.
Researchers like me spend mindboggling amounts of time and money constructing databases of samples (called "corpora") to use in this kind of project. We can never include all the writers we would like to, and if you want us to include writers you are interested in and we aren't particularly, hey, friend, the university is right over there. You go get the PhD and obtain the grant money, go learn what a hidden Markov model is and how to train one, read a few books on support vector machines, publish a few papers, spend the decade in poverty and celibacy that I spent, run around knocking on doors forever and sacrifice your other dreams, and then maybe you'll have some standing to tell me how to do my job.
Even if the list of authors in the database were balanced by any given criterion, there is no guarantee that the output would represent all those authors with equal or balanced probability, since the output probabilities would be determined by what people submitted for classification. Also, the space of "style resemblance" is multi-dimensional and complicated. It may well be that there are one or a few authors who really are the closest match to a very large number of possible inputs and many others who are the closest match for very few possible inputs; it's essentially a Voronoi diagram in which some of the cells may be much larger than others. (Hey, do you know what a Voronoi diagram is? If not, why the fuck do you think I should listen to you shoot your ignorant mouth off on the subject?)
In fact, there's a strong theoretical reason for the Voronoi cell sizes to be uneven: we're looking at a naturally occurring process, it's likely to produce a power-law distribution because that's what such processes do, and in that case there will be a few cells much larger than the others because that's what a power-law distribution means. I have not done experiments to verify the size distribution of the Voronoi cells in the writer-resemblance space; to my knowledge nobody has; but it's well-supported by consensus theory and I have written a peer-reviewed journal article on something somewhat related. You can read that article here.
Another interesting theoretical point has to do with intrinsic dimensionality. With something like writing style, it's reasonable to guess that the intrinsic dimensionality will be high. That means that in general, almost all points (pieces of written work) will be at the maximum distance from almost all other points (other pieces of written work). See Section 1.3 of my dissertation, the work of others cited therein, and Chapter 2. Measuring the distance from one writer to another doesn't usually tell you much useful information then, because it'll just say "no significant resemblance" nearly all the time; and if you're matching against a database, unless you get lucky and find a perfect match, you'll usually find no real match at all. If there is no close match it's quite possible that a model of writer similarity might tend to match the input with the writer in the database who exhibits the highest per-word entropy; and very few lists of famous writers containing James Joyce will contain anyone with higher per-word entropy, so it's easy to predict that he in particular might end up matching a whole lot of inputs. (Now go look up what per-word entropy means, please.)
Most writers are unique and unlike all other writers; and even if you can answer the question of which writer is the closest thing vaguely resembling a match, it will almost certainly be a very bad match. That is an unavoidable property of high-dimensional distributions in metric spaces, and a major topic of my dissertation and my multiple follow-up publications. In a few words, it's because high-dimensional spheres are almost all surface with almost no interior. A Koosh ball, if you remember those, is a good tactile model for how high-dimensional spheres behave.
So: any "which author does this resemble?" system will often produce results you think are "wrong." That's how machine learning systems work (not very well), at least when applied to intrinsically high-dimensional problems like writing-style classification, and the mathematics of the underlying metric space do not imply anything bad about the people who build classification systems. It may tend to fail by matching almost everybody with James Joyce, and the only person about whom that implies anything interesting is Joyce himself, and it doesn't necessarily imply anything bad about him. Any "which author's work does this text resemble?" system will be forced, by lack of training data as well as by many other considerations, to have a very limited list of possible classification results, a list that doesn't match the population and doesn't match anybody's idea of a "balanced" list. That limitation does not indicate anything bad about the people who built the system. A "balanced" list is not a reasonable nor even remotely possible goal to demand of such a system, and even if it could be achieved it would not result in measurably better classification results, and that also does not indicate anything bad about the people who built the system.
If I ever did research in this direction, I sure had better not let anyone except my fellow scientists look at the results, eh? It's pretty clear that the Web community cannot accept or comprehend the limitations of this kind of science; and I'd be told how to do my job by people who think maybe egalitarianism would be nice, but don't have the slightest conception of the issues involved and don't have the most remote inclination to learn or listen; and then I'd be fucking crucified if my system failed to live up to unreasonable expectations.