« 自分のドキュメントクラスを作ろう | Home

Why language models need not get worse

Thu 9 Feb 2023 by mskala Tags used: ,

Sam Kriss has a Substack posting in which he describes the zairja of the world and then links it to his ideas on why AI is getting worse. Basically, what I get from the piece is that he's saying successive generations of GPT models have produced less and less valuable output as they better and better approximate the average content of the World Wide Web. Nearly all of the Web is garbage, and so an approximation of the average Web document is garbage too. The output from older models is weird and interesting because more random and less accurately approximating the average Web document; the output from new models is boring and worthless. My own view is that these issues are not as serious or as inevitable as he presents them. There seem to be some gaps in Kriss's understanding of how language models, and the ChatGPT Web service in particular, work; and filling in the gaps points to some natural solutions for the problem he describes.

You probably wonder what the zairja of the world is, and it's worth reading Kriss's posting for that. In few words: the zairja of the world is a medieval Islamic text generator that seems to be surprisingly good at answering questions - something like ChatGPT if ChatGPT were invented by a Sufi mystic in 12th Century Morocco.

I'm a little surprised I hadn't heard of the zairja of the world before, because I'm interested in and know a certain amount about other similar things. Kriss mentions connections to Jewish gematria, which can be used for some of the same purposes. I'm also reminded of John Dee's Enochian magick, which has some specific details that seem very similar to the zairja of the world: in particular, concentric rings ("aethyrs") as a map of the universe; invocations of the astrological planets; and square lookup tables of letters.

portrait of John Dee

In Kriss's description of a story in the Muqaddimah, the zairja of the world, when asked, claimed that its own origin on this plane was in a revelation to the ancient prophet Idris, who is traditionally equated with Enoch.

The historical example is interesting, and well illustrates one of Kriss's points: that much of the value of what comes out of a text generator like the zairja of the world, or ChatGPT, is in the random and unclear nature of the output. What the zairja of the world gives you isn't really straightforward connected text answering your question. Instead, the output is a gnomic sequence of disconnected Arabic letters. You must add vowels to taste, and then interpret the result as poetry. You are free to convince yourself that the meaning you get from it is really coming from your own unconscious mind, instead of from the zairja itself or from supernatural entities guiding your actions.

Kriss views early, relatively primitive, GPT models as working in a similar way to the zairja of the world, with randomness and gaps in the output demanding interpretation from the human reader, and from his point of view that's a good thing. It's what makes them beautiful and worthwhile. Because of what I respectfully, as a practitioner myself, call mystical bullshit, random number generators tie into the deep underlying mechanics of reality and allow us to get at worthwhile esoteric knowledge. Kriss's analysis of GPT is coming from that frame of reference. GPT models are getting worse for him in the sense that they are becoming less suitable for use as divinatory oracles.

One of my own gripes with ChatGPT is that whether it's changing for better or for worse, it's certainly changing, and that makes studying it difficult. You cannot easily compare the results of today's ChatGPT with tomorrow's or yesterday's because Open-fucking-AI keep changing it. Comparing present-day ChatGPT against an earlier GPT model that wasn't a chatbot is also difficult because not many people had easy access to those when they were current, and even fewer today.

But last year I posted some sample text from a GPT-J 6B model fine-tuned on some of my own fiction writing, and I think that shows the effects Kriss is talking about. The text is very much like a dream transcript; and some of the same insights that come from interpreting dreams might come from interpreting texts like this. Not a few readers might try to psychoanalyze me from that text, given that it was fine-tuned on my writings. And it'd be much harder to find the same kind of mystical interest in the output of today's ChatGPT.

Kriss is interested in emergent behaviour of GPT models. He writes:

Nobody ever taught ChatGPT to write code, but it does it. If you ask it to translate English into Arabic, it’ll usually insist that as a large language model it hasn’t been programmed to translate between languages, but it can; the translations it provided for me were about as good as Google Translate. Scientists call this capability overhang: sufficiently complex AI will end up showing skills that their programmers didn’t even know were there. As if they’ve tapped into a hidden order in the language of the world.

I think that makes ChatGPT's abilities sound like a bigger surprise than they are. Somebody actually did teach ChatGPT to write code - by showing it many contextualized examples of code, including in particular a lot of "how to write code" instructional material. That's basically the same way I learned to write code, and it's the (only) way a GPT model learns. Code is just a set of languages much simpler than English and Arabic - in particular, with rules much more strictly followed than the rules of human natural languages.

Earlier in the paragraph I quoted, Kriss mentions that in English, a q should usually be followed by a u and that's a rule of English. He incorrectly describes GPT models as working character-by-character. Actually these models work token-by-token, with tokens typically being whole words, and then the search algorithms often do look-ahead over distances even longer than one token. In most tokens the u after q will be built into the token and not even seen by the model; the model doesn't need to learn it. But that detail isn't really significant. ChatGPT also correctly follows other rules that don't come for free with the tokenization and that it really did need to learn.

In English the u after q rule is only usually followed; "Iraq" is one exception. Rules like that in programming languages tend to be absolute - making them easier for the training process to pick them up. Programming languages are, when you get down to it, just easier languages than English or Arabic.

If ChatGPT can produce plausible English or Arabic, it's unsurprising for it to also produce plausible C or Python. Those who can read code fluently often notice conceptual gaps in ChatGPT's code analogous to the gaps in its natural language output: the code obeys the syntax rules and will compile or execute, it looks plausible, but it may or may not really work, and even if it works, it's frequently not good code. The model is just doing the same thing with code that it does with natural languages and that's not an unexpected skill appearing by surprise.

As for English/Arabic translation, the phrase "as a large language model" is a clue: whenever ChatGPT insists it can't do something "as a large language model," that's not really the model talking. That's the political/safety filter. When you use ChatGPT through the Web interface, you aren't just using the model. The inputs and outputs go through a filter before and after the model, intended to prevent you from doing things the operators of the Web interface don't want you to do. Some of this filtering is also built into the model itself rather than a separate piece of software, but it's built in as a deliberately added extra. They train the model, and then they - pick a verb - correct, align, distort, skew, discipline, drug, lobotomize... the model to limit what it might do.

It's not accurate to say the programmers "didn't even know" it could do translation. On the contrary, they knew damn well it could do translation and they thought that was a bad thing. So they attempted to prevent it from being used that way. The fact that they couldn't really prevent users from finding a way around the limitation does tell an important story; but the story is not about unexpected emergent behaviour of the model. Rather it is a story of human arrogance and cunning.

Having said that GPT model output is interesting as a source of mystical revelation (which I agree with); and, as support for the value of the mystical revelations, that ChatGPT has surprising unexplained skills (which I would dispute); Kriss goes on to say that successive GPT models are getting worse and will continue to get worse. They are getting worse basically because they are getting less random and weird, and the randomness and weirdness is the good part so it's a shame to lose it.

My response to that is that things don't have to be this way. The effect he describes may be happening, but it's not inevitable.

A big part of the reason current ChatGPT's output is boring, is because of the damnable politics/safety filter. Every time ChatGPT starts a sentence with "As a language model trained by OpenAI" you know that it could have given you a better answer than the one you're about to get, but it was prevented from doing so by human tampering. Even when it doesn't obviously spit out a boilerplate response that we've learned to recognize as meaning "I have been prevented from answering this," it's reasonable to guess that less visible filtering and "alignment" efforts are in play, steering it away from interesting answers that might be less "safe" and toward boring answers that are more "safe." That's not an emergent "smart" behaviour of a more sophisticated model, nor a necessary consequence of Sturgeon's Law applied to the Web. It's a deliberate (and expensive!) effort by OpenAI to make the overall Web service which includes other software beyond just the model, be boring. This human effort is not inevitable.

I wish that Sam Kriss, and all the rest of us, could run the underlying model of ChatGPT on our own local computers without going through a Web service. I also wish that the politics/safety filter - including any aspects of it which end up being embedded in the model per se through fine-tuning - would simply be abandoned. Looking only at the ChatGPT Web service, which is constantly being revised for the express purpose of making it boring, and seeing that that Web service is getting more boring over time, doesn't tell us a lot about a trend to boringness in smarter and smarter models. If we could see the underlying models by themselves and pre-alignment, I think we would probably not see the same trend, or at least not as strongly.

There may not be much hope for ChatGPT in particular, or for OpenAI. But they don't have to be the only game in town, and it is reasonable to hope that some other group will eventually create and distribute state-of-the-art models without politics/safety filtering.

The other reason I see for hope has to do with hyperparameters, and specifically what is called "temperature." This point requires some technical discussion, but very briefly the idea is that the generation algorithm has an adjustable knob for how random its output will be, and it would be easy to just turn that knob up a little higher.

In more detail: the core function of a language model is not actually to generate language but to evaluate the likelihood of language. The language model looks at a sequence of tokens like "The quick brown fox" and produces a number like "9" that means "This is very much like what we'd expect to see in English-language text." If instead you say "brown the fox quick," the model might say "6" - that sample is less plausibly English, but if you go out and kill a fox and want to cook it, then there might be a recipe that would contain those words in that order. If you ask the model for an evaluation of "exchange quart fox expedited" it might say "0." That sequence of tokens is basically not compatible with English at all.

In order to generate language, you combine the model with a search over possible texts. The search is for an answer to something like "Out of all texts that start with 'the quick,' which text does the language model rate highest?" Candidates might include "The quick brown fox jumped over the lazy dogs," something about "The quick and the dead," and so on. In a chatbot, the texts are two-sided conversations, and so the search is evaluating, "Out of all possible transcripts of conversations that start with this prompt, which transcript containing both prompt and response, is most likely?"

There are two problems with doing the search. The first is that the single most likely text is probably not actually very good, because the model is only a model, an imperfect representation of the training data. It exhibits what are called pathological effects, especially the one called overfitting. If you really find exactly the one text that the model judges as having the highest likelihood, it's likely to end up being something useless like one sentence repeated over and over again, or a couple of sentences that change the subject onto one that is strongly represented in the training data and then it talks about the changed subject at length. It might also end up spitting out a lengthy exact quote from the training data. Asking for the single most likely text is a great way to find the model's theoretical limitations.

The second problem is that even if you wanted to find the single most likely text, you couldn't actually do so. The likelihood of a text according to a model is a complicated function of the entire text. The number of different possible texts is an unimaginably huge number, like the number of books in the Library of Babel. You cannot run the model on every possible text to find the one that it rates most highly; and there may not actually be any faster way to find the best one. Maybe if you know things about the structure of the model you can have clever ways of proving that entire large classes of possible texts cannot possibly be optimal, so you don't need to consider them in detail; but such techniques aren't really strong enough in practice. This is a common situation in computer science: finding a global optimum, that is finding exactly the one very best answer to a question, is hard, and often, as here, it's too hard to really be possible.

So in practice you settle for an approximation. You find a text that the model does not rate as the single very most likely one, but that it rates as pretty likely, more likely than any of the others you've looked at, more likely than any closely similar text, probably about as good as the theoretical globally best text. Instead of an unimaginable number of guesses, maybe you look at some thousands or tens of thousands of guesses, and you do clever things to make sure that all those guesses are pretty good, and then you take the best one you see and call it good enough. This kind of approximation works well, and in the AI context it ends up being absolutely necessary to do this kind of approximation.

When you do this kind of approximate search - to find a good text according to the model even if not the globally best text - you have a choice of how widely to cast the net. You can try to stick tightly to where you think the global optimum might be, even though you cannot know exactly, looking only at texts near the average. Or you can make the search more loose and random, choosing from among candidates that depart from the average in different directions. The decisions you make about how to organize your search are controlled by what are known as "hyperparameters"; especially one that is usually called "temperature." What it has to do with physical temperature is complicated, but comes from an analogy to heating and cooling metal objects to make them softer.

The temperature, pretty much directly, controls how random and weird the output is going to be. Choose a low temperature and you get output that sticks very closely to the average of all the input documents - which is what Kriss considers boring and ugly. Choose a high temperature and the output will be more random and weird. I've seen this in my own experiments with GPT-J 6B; the temperature setting is, quite straightforwardly, a setting for how random and weird the output should be.

So it seems to me that a big part of Kriss's complaint about ChatGPT's output not being random and weird enough, is just simply because OpenAI may have set the temperature too low on ChatGPT. And, fortunately, that would be a trivially easy thing to change - if OpenAI chose to do so. For now, we're at their mercy.

I reiterate that we need to have these models available in a form where users have full control. If we could run them on our own computers instead of being limited to a Web service; if we could turn off the damned filter; if we could adjust the temperature; then most of the problems Kriss describes would disappear, and we could better address any problems that remain. Then, maybe, GPT models really could fulfill the promise of the zairja of the world.


Something I’m not understanding about all this is where all the money goes. Microsoft and others have reputedly invested several billions of dollars into OpenAI – what is that buying? That amount over 2-3 years would pay for several thousand very well paid so-called engineers (i.e. a combo of theoretical and practical computer scientists) and a good deal of upscale but not unduly specialized-for-AI (and therefore widely available) hardware. What I’m getting at is why can’t we smart folks all just do this on our own – why is access to someone else’s models so critical? Is there some magic and secret Great Idea that someone had that is needed to build the models? Presumably there is some fundamental cost that I’m not seeing. I note that Monica Anderson spent a long time wishing/whining for better hardware for her AI research, and yet she seemed to be talking thousands, not billions of dollars, so it doesn't seem to be that.
Tony H. - 2023-02-10 14:46
I don't know what their budgets look like; but on top of the usual startup wastage, what they do does involve a lot of human-mediated data processing - gathering and annotating training data in particular. There was that story that was in the news recently and then fell out of attention about how OpenAI was paying a low hourly rate to human annotators in Kenya to mark up what was and wasn't "hate speech" for the politics filter. I don't think they ought to *have* a politics filter, but they consider it a necessity, and implementing it the way they do does require a *lot* of human labour, which adds up even at a low hourly rate. And other things than the politics filter also require human-annotated data. Computation at the scale they're doing, for the training, also requires an ongoing cost, even if they buy the hardware (which is not guaranteed; they may be renting a lot of it). It's an energy sink on the scale of Bitcoin mining.

As well ask why don't we smart folks build a better search engine now that Google Search sucks.
Matthew Skala - 2023-02-10 16:04

(optional field)
(optional field)
Answer "bonobo" here to fight spam. ここに「bonobo」を答えてください。SPAMを退治しましょう!
I reserve the right to delete or edit comments in any way and for any reason. New comments are held for a period of time before being shown to other users.