I went through a bit of a crunch to get Tsukurimashou 0.5 out the door before my year-end vacation. With that done, and at least 99 kanji to do before the next planned release, I have a chance to sit back and think about some longer-term and spin-off projects. Here are some ideas on kanji searching.
UPDATE: A prototype implementation of the system described here now exists as part of the Tsukurimashou project, and you can check it out via SVN from there. Packaged releases will be available eventually.
The kanji search problem
I hope S. R. Baker won't be offended if I mention an amusing conversation I had with him a few years back. We were talking about lame pop-culture stuff and he mentioned a trend he'd observed of white people wearing cheezy baseball caps with a certain Chinese character on them, just because they thought it looked cool, almost certainly without knowing what it meant. And I said, well, which character is it?, because I'd like to look it up and find out what it does mean. But even if he knew what the character looked like, it wasn't clear how he could describe it to me when neither of us could read Chinese, short of creating an image file and sending me that, and he was completely flummoxed by the idea that given a picture I possibly could look it up. Well, I'd look in a dictionary, right? But... but... how could a dictionary of Chinese characters ever exist? Well, we have dictionaries for other languages, don't we?
The thing is, although there's a lot he might not have known about Chinese, it is a very reasonable question to ask. There are tens of thousands of Han characters - "hànzì" in Chinese; "kanji" in Japanese; other names exist for them in some of the other languages that share subsets of that character set; I'm going to say "kanji" from here on - and it's not instantly obvious how you could index them in a dictionary so that you could look them up usefully, because they're basically pictures. Alphabetic scripts, such as used for English, can be organized into "alphabetical order," which from a computer science perspective is actually equivalent to pretending that each word is a number and putting them in order of the numbers. If you don't know the alphabet and the order it goes in, then an English dictionary won't help you much. And with kanji, where the problem the dictionary is trying to help you with basically consists of "You don't know the alphabet," it seems at first glance like if you could use a dictionary, that would mean you wouldn't need to.
I imagine that the most naive expectation of what a kanji dictionary might look like would be that it's pages and pages listing all the tens of thousands of characters, one after another, with no particular structure to the ordering, and you just have to search from the start until you find the one you want. That's actually quite interesting to me as a computer scientist because it's directly related to a topic I covered at some length in my PhD dissertation, namely the topic of dimensionality. Some things are capable of being organized in such a way that it's easy to look them up. Some things aren't. You can search the alphabetical-order English dictionary in O(log n) time; searching a completely unorganized list of kanji takes the much worse O(n) time, which is fundamentally different.
It's possible to measure the tendency of things to be easy or difficult to organize, and say that things which are easy to organize have low "dimensionality" and things which aren't have high dimensionality. Words in English appear to have low dimensionality, and kanji appear to have high dimensionality, and that creates the difference in indexing difficulty.
And yet... if you want to have a literate culture it seems like it would be a really good thing if you could have dictionaries that actually worked. And indeed it is a fact that kanji dictionaries do exist, and have existed for thousands of years. So, how can they be made to work?
Characters and radicals
The first really important thing to observe is that kanji do have features that can be used to describe them. They're not a completely flat undifferentiated space of unrelated blobs, which is the recipe for truly high dimensionality. In particular, they have shared substructures, which often are also valid kanji in themselves, in a tree-like organization. For instance, 神 (meaning "god") can be described as 礻 (which is common to many kanji related to religion) next to 申 ("zodiac sign of the monkey" - these meanings are not necessarily relevant, but I'm mentioning them as a mnemonic device). Recognizing that kind of structure is a huge win for dictionary-building because it gives us some reason to group some kanji together. These smaller parts or, technically, a standardized subset of them, are called "radicals" and form the basis for a lot of dictionaries.
Some of the parts are more standardized than others. For instance, in 僕 (in current Japanese usage, that's a pretty common masculine first-person pronoun, like "I" but with a gender distinction that doesn't exist in English) the thing on the left-hand side is a very common radical, but the right-hand side is something weird that seldom occurs elsewhere in the character set. Many characters include these kinds of unique or almost-unique components. So it is not the case that you can just have a short list of maybe a couple hundred components and then build everything else up from those. It still seems like you would like to, though.
So there's the start on how we could organize kanji into sorted order for use in a dictionary: define some standardized radicals (the traditional list includes 214 different radicals), including a priority order. To use a dictionary, you have to memorize the list of radicals, which isn't great but it's much better than having to memorize the entire dictionary which is where you'd be otherwise. It's not that much different from memorizing the alphabet to use an English dictionary. You find the highest-priority radical in the character (which, depending on the exact system, may depend on where it occurs in the character as well as some radicals outranking others), and look in that section of the dictionary. Within the per-radical section they can be sorted by the number of "strokes," which can be measured by looking at the character. If two characters have the same highest-ranked radical and the same number of strokes, there may be no good way to organize them; but at that point there are usually no more than a couple dozen kanji that match, so it's no big deal if you have to do an exhaustive search of what's left.
For a long time, that was the state of the art for kanji dictionaries, which were necessarily published as physical books printed on paper because that was pretty much the only way we had for distributing collections of reference data. Now, of course, we have computers, and modern library science; so it becomes an interesting question whether we can do better today than was done in previous centuries.
Toward the end of the paper-books era of kanji dictionaries, people started coming up with other ways of organizing them that seemed easier to use. Often these consisted of other ways to arrange the characters into a few thousand categories which could be put in a sensible order and the right one examined exhaustively - the same basic idea as "radical + number of strokes" but without necessarily requiring the user to memorize radicals. For instance, there's something called the "SKIP code" where you memorize some rules for describing a kanji, and extract a three-component index number which you can look up. The SKIP code for 神 is 1-4-5: 1 because it has a left and a right part, 4 because the left part has 4 strokes, and 5 because the right part has 5 strokes. There is another system called the Four-Corner code, which uses a different set of rules but is fundamentally the same idea: kanji go into bins according to a set of rules you must memorize, and then once you find the right bin, you examine it exhaustively to locate the one you want. Under the Four-Corner code, 神 goes in bin number 3520.6.
One distinguishing feature of these "dictionary code" schemes is that the rules for the dictionary code are only useful for looking up kanji in the dictionary. Learning the dictionary code doesn't directly help you at all in learning the underlying language. For me, personally, that seems like a problem because being a bear of very little brain, my ability to memorize lists of arbitrary data is limited and it seems like a shame and a waste to spend that resource on the dictionary code when I'd rather be learning the actual language. On the one hand, a modern dictionary code involves less stuff to memorize than the much longer traditional radical list used by the older dictionaries; on the other hand, it's more arbitrary and less language-focused than the traditional radical list, which was at least sort of related to the etymology of the characters. Can we do even better with modern technology?
Nowadays, kanji dictionaries often exist as computer software rather than printed books, and that allows the use of much more convenient query methods. One in particular that I've found quite useful is "multiradical search", as seen on sjlfaq.org's Web site (similar things are available through local software instead of Web services). Looking at the Web form you're confronted by the roughly 200 radicals; but you don't have to know how to determine the one radical that would be the primary key for searching in a traditional paper dictionary. Instead, you just tick off whichever radicals you recognize, and the computer narrows down the list to show whatever kanji it has dictionary entries for that contain all selected radicals, and eventually you can either spot your choice on the list, or start adding "number of strokes" constraints to narrow it further.
The big win of this technique is that you don't have to know all the radicals, and you don't have to necessarily recognize all the weird variant forms. You can choose whichever ones you do recognize, in any order. For instance, knowing that the left side of 悦 ("ecstasy") is actually a form of 心 ("heart") is special knowledge you might not have, but it won't be a problem if you recognize that the right side includes 口 ("mouth") and 儿 ("legs"). However, it can still be a problem if the radicals you think you recognize don't match the database's organization of them, or if important parts of the character are really unique and you can only recognize the parts that are very common and don't narrow it down much.
Another big issue is that this system only recognizes whether things do or don't appear. For instance, there's a pretty common radical that looks like a Greek "beta" and appears on the left in characters like 陀 ("steep") but on the right in characters like 祁 ("intense"). It seems like a shame to have to wade through characters matching one of those cases when you're looking for the other; that page I linked to actually solves that by counting these two as completely separate radicals (and they may have different etymology), but on the other hand, you might also not know that they're different, and reasonably want to look for all characters that match either case without specifying them separately, so it's not a perfect solution; and other radicals (like 心 "heart") can occur in multiple different locations. Splitting all such cases would result in so many categories that it would be hard to use. It would be nice to be able to specify a query that includes information about the layout, not just the radicals that appear; that pushes us back toward the dictionary codes like SKIP, which do include layout information.
Here's another idea: actually draw a picture of the kanji you're looking for and let the computer try to figure out what it is. Something like that is available on sljfaq.org too. You start writing the kanji in the little window with the mouse, it displays its best guesses for what the entire character you're aiming at will look like, and usually by the time you've written about half the kanji, the one you want will have appeared on the list and you can click it. You don't need to know what the official breakdown into radicals might be. It doesn't have to be a Web service, either - some of my Japanese friends carry around little pocket-size devices with touch pads and essentially this same kind of database built in locally.
This approach works well to look up the meaning of a character if you already know how to write it. The big problem is that I don't know how to write a given kanji just by looking at it. In particular, the query fails completely if I don't get the "stroke order" exactly perfect; so unless I have that memorized, I have to do a number of queries that is factorial (!) in the number of strokes, before the computer will be able to recognize what character I'm drawing. On the other hand, stroke order is partially predictable; and it's something language-relevant that I might reasonably want to memorize and learn to guess for other reasons. So given that I am actually trying to learn the language and I have some level of proficiency at writing kanji already, this kind of query does end up being fairly useful for me. It's not a complete solution, though.
Another problem with this approach is that it's essentially querying always on the first strokes of the character and never on the last ones. If the part I'm interested in querying happens to be the final few strokes and I don't care about the earlier ones - I'm looking for multiple results - it's pretty much useless. For instance, it's just not possible to answer the query "Find me all characters that have 毋 ("mother") on the right" with the handwriting recognition method, because the left strokes will be written first and those are different for all the different answers to that query. If I want to do that kind of query, I need a different method, such as multi-radical.
Other problems, related to software development
In the course of the font project, I've often found myself wanting to do queries on kanji databases that aren't directly motivated by looking at a particular kanji and wondering what it means. For instance, when I was designing Tsukurimashou's version of 意 ("idea"), I recognized that it contained three parts: 立 ("stand up"), 日 ("moon"), and 心 ("heart"). All of those were macros I'd already written. But should I structure the code that draws 意 as 立 above a composite of 日 and 心? Or as a composite of 立 and 日, above 心? Or as a flat combination of the three without substructure? Would either of those hypothetical "composites" be subroutines I could re-use in other kanji, even if the smaller combinations would not be kanji I'd care about in themselves? I need general knowledge about how these parts tend to be used throughout the writing system, in order to properly structure the code. Sometimes a shared substructure is used in several places without actually ever appearing as a character itself, so there is some need to query the character database for structures that do not actually occur as dictionary headings. Also, sometimes I build kanji in the font in ways that do not correspond to their etymological origins for various reasons, so there is a need to query the knowledge of what is currently in my code base, independently of the knowledge of the etymological structure of a character. Obviously, no ready-made dictionary can answer such queries. Right now I'm doing them by running grep against my source code, using memorized names of macros, or even looking up character codes in dictionaries and reverse-engineering my own software to answer "How did I code this other similar glyph?"; the difficulty of doing that and interpreting the results will grow as the code base grows and eventually, inevitably, become unworkable.
There is another somewhat related problem people face in kanji software development, which is that the set of kanji is open-ended. It sometimes happens, especially when dealing with personal names, that you will want to write a kanji that isn't in the dictionary. This is generally called "the gaiji problem," and it will continue to happen no matter how many extra characters you add to the dictionary, up until you get into millions. If you're trying to write a text file that needs to include one of these special exotic kanji, what are you going to do? It doesn't have a Unicode code point. It's not obvious what sequence of bytes you could possibly put in your file to denote that character. I suppose the Google Plus solution would be to tell the person involved "That can't possibly be your real name, you are BANNED!" but not all of us are Google and able to get away with that. Adobe has developed (and likely patented) technology based on creating and embedding special one-glyph fonts to cover kanji that otherwise don't exist. It gets pretty complicated.
The bottom line on database queries seems to be that I would really like to have:
- A general way of describing kanji and parts of kanji, including kanji-like things that do not actually exist, or at least do not have entries in any standard dictionary. It should be specific enough to cover the spatial organization of the parts and narrow down the description to a single character, not just to a collection of a few dozen for exhaustive examination.
- A rich and precise query language for these descriptions, allowing me to specify any feature of the character (radicals with or without their locations) or combination of features, without being forced to specify any feature in particular.
- Software and ready-made data to actually do queries on the existing language as well as on new databases I may create.
I'd like this to be as easy to use as grep.
Ideographic Description Sequences
The Unicode Consortium in their wisdom has thought about some of these issues, and they came up with an interesting solution (for which problem is not quite clear) in the shape of the "Ideographic Description Characters" (IDCs) U+2FF0 through U+2FFB (see PDF code chart), and instructions for their use (in section 12.2 of the spec). I'd like to be able to use those actual characters in this article but I imagine many readers won't be able to see them, so after an example or two that you may or may not be able to read, I'm going to introduce other syntax of my own.
The idea is that when you want to describe a kanji for whatever reason, and this is especially but not exclusively intended for kanji that don't have standard code points of their own, you can use the IDCs along with other already-existing code points to create a sequence (an "Ideographic Description Sequence" or IDS) that explains how the character is constructed. It can't, and doesn't try to, solve all problems, but it's supposed to help with a lot of common cases. For example to describe the character 神 which is U+795E, you could write ⿰礻申 which is U+2FF0 U+793B U+7533. The U+2FF0 is a prefix operator that specifies "take the next two things and put them next to each other." Similar operators exist for putting things one above another, one wrapped around another from the left and bottom, and so on; the set of operators is designed to capture the kinds of combining operations that actually occur in the kanji character set.
The real power of it is that each of the "next two things" after U+2FF0 can be, instead of a single existing character, a complete IDS in itself. You can build up a lot of complicated kanji from simple parts. For instance, we can describe 能 ("talent") as "⿰⿱厶月⿱匕匕": 厶 over 月 and that next to 匕 over 匕 ("myself over moon and that next to spoon over spoon"). Note that the lower-left corner is not, actually, quite identical to 月; the description isn't perfect, but it helps a lot in explaining what the character looks like.
IDSes describe kanji in a simple stack-ish language. It might be called "Polish Notation" bearing in mind that it's the reverse of the traditional Reverse Polish Notation. It's not possible to describe absolutely every kanji with absolute precision this way, but you can get pretty far, including into the gaiji territory not covered by the standard code.
If I had descriptions like this for all the characters in the dictionary, it seems like I could do a search on them with grep or something like it, and get the kinds of queries I'd really like to have. For instance, when building 能 recently, I wanted to know whether "匕 over 匕" was a common substructure that I should make into a subroutine for use elsewhere. How can I do that? Multiradical search won't work because it will just return everything that includes 匕 at least once (no way to say "twice, in that arrangement".) Handwriting recognition won't work because the part I want to query appears on the right side, which comes last in stroke order. The dictionary codes and radical-stroke system don't seem to work because they key on other features, and anyway I'm not smart enough to use them. It doesn't appear there's any good way to do this query at all. But my computer ought to be able to understand this kind of query.
Ideally, I would like to be able to type something very much like the following into my Linux command line. Assume for the sake of the discussion that I and my computer have already come to a mutual understanding of how I'm going to type 匕 on my keyboard.
idsgrep '...[tb]匕匕' dictionary
I want that to mean "search the dictionary for any characters that include, anywhere, a sub-character that can be described as 匕 and 匕 in a top-to-bottom arrangement." The "..." is an operator saying "what comes next may appear anywhere, not just at the top," something like the opposite of what "^" does in grep. The "[tb]" is meant to express the meaning of ⿱ U+2FF1 using characters that are easier for me to type and easier for you to see in your browser. I have some detailed plans for exactly how I would like the full syntax of idsgrep's patterns to actually work.
Sidebar: IDSes cannot substitute for existing character codes
It is natural, when confronted by the underlying structures of kanji that make IDSes possible, to say "Well, then, we're really stupid to use the existing system of basically one code point for every kanji! We should instead create a code, based on IDSes or something like them, that describes each kanji by its parts! That would make software and fonts and stuff much easier, because you would only have to have maybe a hundred or a thousand shapes and some intelligence for how to lay them out, and then you could generate any character you wanted. No gaiji problem anymore, smaller font files, interesting computer science problems to solve, yay!" I once thought that myself, have heard it from a lot of non-experts, and have even heard it from experts who should know better. It really isn't as good an idea as it may sound.
First of all, the underlying set of smaller shapes you need to combine to make complete characters, isn't entirely closed. We run into cases like that right side of 僕, which isn't a standard radical; and that's actually a relatively easy one because it is a standard, though rare, kanji with a Unicode code point (菐 U+83D0, no clear meaning); it's just not a radical. In a dictionary organized for looking things up visually you might get away with pretending that the right side of 僕 is 業 ("business"), which is common and looks similar, but that really isn't true and a scaled version of 業 cannot be used to write 僕 correctly without additional processing for explaining how to delete the extra stroke. Other kanji-component things exist that are not, and cannot easily be broken up into, standard smaller parts at all. Trying to cover all such cases as radicals to combine leaves us with a huge set of wacky rare radical-like things, and our fonts still need thousands of glyphs but now we've layered a complicated layout engine on top of everything.
Another problem is that even when a kanji appears to be made up of standard parts, the parts and their layout still need to be customized to look right. Consider my example of 能 ("talent"); the thing at lower left looks enough like 月 ("moon") that that's a good way to index it in a dictionary, but it is not really a scaled-down version of 月; it has to be a different shape to look right. (Note, especially, the lack of a curve to the left at the bottom of the left stroke.) In Tsukurimashou I actually draw 月 and the lower left of 能 with two different macros. An IDS-like character encoding could go the same route and have them be different primitives, but if different font designers want to make different decisions about that, we lose standardization (the byte sequence that means 能 will differ depending on the font you're using, which is an interoperability disaster), and the number of primitives continues to grow, and so on.
The combining operations need customization, too. Look closely and you'll see that the line between top and bottom is at a different point on the left side of 能 and on the right side of 能 - so it's not enough to just say "this goes above that," you need a smart way of specifying where the line is, and how much the parts possibly overlap. Much the same occurs left to right. Specifying those dividing lines seems more like the font designer's job than the encoding system's job, but if the font only has glyphs for 厶, 月, and 匕, it's not clear where that knowledge will be stored. An automatic system will have trouble just looking at "here are 厶, 月, and 匕; put 厶 over 月 and 匕 over 匕 and put those pairs next to each other" and getting a nice glyph for 能 without additional information. Sometimes to get a nice glyph, you also need to make deeper modifications than just dividing line and overlap.
Take a look at the Tsukurimashou code and you can see how I addressed this kind of thing; the code for 能 does NOT just express "these four things go in the four quadrants of the glyph" but also contains some extra numbers describing the combining operations. Some composite glyphs also contain arbitrarily complicated code describing modifications that must be made to their parts. In other words, describing 能 in enough detail to typeset it and have it look good, is more than describing the sum of its parts. If you like English-language fonts, think about whether you can get a good ff-ligature glyph just by saying "two fs next to each other and overlapped"; building kanji from parts is not much easier. Some previous Tsukurimashou-like projects have figured this out and solved it by similar techniques to mine. Others have not gotten far enough to see the problem, and made overly optimistic predictions about their ability to define large kanji from smaller parts and have the results look good without per-glyph design work. Nonetheless, the "here are the pieces of a larger thing" description method remains obviously valuable for dictionary lookups even if it's not enough for rendering.
All in all, if we try to use IDS-like encoding everywhere instead of code points per glyph, then even though Unicode actually implies in their standard that it would be cool for implementations to support that, the result is going to suck. On the rendering side it is at best a stopgap method for handling gaiji without the special one-glyph fonts; it is better applied to dictionaries rather than rendering.
Some thoughts on actually building it
Why not use actual grep, or regular expressions in general, instead of a homegrown syntax? There's a serious theoretical problem with that, namely that correctly matching these kinds of sequences is exactly what regular expressions cannot do. Supposing I wanted a slightly more specific query than '...[tb]匕匕': instead of '[tb]匕匕' appearing anywhere, suppose I wanted to search for a left-right combination where the right matched '[tb]匕匕' and the left could be anything. This would still match 能, and in fact I think that might be the only thing it would match because '[tb]匕匕' is not common except as part of 能. Then I'd like to write '[lr]?[tb]匕匕' where the '?' matches any tree structure of any size.
Well, that's pretty much exactly the same as saying I want to match balanced parentheses in a regular expression - and anything that can do that, simply can't be a regular expression. I need to do matching over a somewhat more general type of formal language. Some extensions of regular expressions designed for matching trees do exist in the computational linguistics community, and it might be wise to look at them, but so far the ones I've seen seem to be optimized for other applications in a way that would make them cumbersome for my own application.
Unicode's definition of IDSes includes some arbitrary limits that offend me on principle. In particular, no IDS can be more than 16 characters long, or contain more than four "operator" kinds of things in a row. The point is to reduce the possibilities for stack overflows and similar when a decoder is trying to interpret them. It is also required, or at least strongly recommended, that you use the shortest possible IDS to describe any character or part of a character: so if 能 occurred as part of something more complicated, it would be denoted by a literal 能 instead of by [lr][tb]厶月[lr][tb]匕匕, and that raises the related issue that I might want to search by entering the literal character 能 or a longer description and I want to return the correct results regardless of how it's described in the database. Storing everything by the longest instead of the shortest possible description, after relaxing Unicode's number-of-characters limit, seems like it might help, but then I'd still be faced with exploding my query 能 into its description before doing the search.
What I actually have in mind, then, is to define and use a more complicated syntax that is basically a superset of Unicode IDSes. Compliant Unicode IDSes would be valid extended IDSes, but in the most full form of the extended syntax, the description for 能 (whose ideal Unicode IDS is just the single literal character) would look like "<能>[lr][tb]<厶>(;)<月>(;)[tb]<匕>(;)<匕>(;)". Syntactic sugar in the parser would allow something more concise, such as my earlier examples, to be actually stored for both the database and the queries. The point is that that describes the entire tree structure of the character, down to the lowest level for which there are parts we can describe, but also tagging higher-level nodes (in particular, the root) with the corresponding code points when those exist. When "idsgrep" runs, it applies the parser to the query to get a tree structure for the query; scans the dictionary while also parsing that using substantially the same syntax; and writes out any matching trees. The whole thing is closely analogous to "grep," but with the parser applied to both sides and a different matching algorithm. In the typical case of just applying it to a dictionary file, I could look at the start of each line output from idsgrep, and there would be the dictionary head (a kanji matching my query) in angle brackets.
Where to get the dictionary data? For querying my own code, I can design my software to generate a suitable database from my font as it's building the font. That's good for the important special case of querying my font, and at some point in the future when Tsukurimashou is closer to complete, it might even be useful as a general kanji lookup tool. But I'd also like to do queries based on someone else's work on the entire character set, especially characters I have not yet added to Tsukurimashou, and following the actual etymological structure of the language instead of the somewhat nonstandard character structures I use internally, so I'd like to also be able to use an existing database. KanjiVG seems like a good choice. They've got a database that describes characters in sort of, but not entirely, the way IDSes do; it seems reasonably possible to write a Perl script that would read KanjiVG's XML format and generate extended IDSes that could be queried with the idsgrep utility. I think I'd rather convert to extended IDSes and query those rather than trying to design my entire query system around KanjiVG, because KanjiVG's organization is in some ways quite different from what IDSes do and what I want. In particular, they are focused more on individual strokes than on hierarchical substructures.
I need this or something very much like it for my own purposes, but it seems reasonable to hope that others might also find uses for it. A command-line kanji search with a rich query language seems like it'd be appropriate for many cases where existing GUI-based databases aren't appropriate, and the code could also of course be re-used as a back-end for someone's GUI project. Needing it myself, and others probably having a use for it, are criteria for a good open source project.