« The fundamental attribution er... | Home | Cycle counting: the next gener... »

Where do I draw the line?

Mon 21 Jan 2013 by mskala Tags used: , ,

It's a very common pattern in the Han writing system that a character will be made of two parts that are themselves characters, or at least elements resembling characters, placed one above the other or one next to the other. For instance, 音 (sound) can be split into 立 (stand up) above 日 (day); and 村 (village) can be split into 木 (tree) next to 寸 (inch). This kind of structure can be nested, as in 語 (language). One can do a sort of gematria with the meanings, (what exactly is the deep significance of "village = tree + inch"?) but that's not the direction I'm interested in going today. Here's the thing: in the Tsukurimashou project, these two ways of constructing characters each correspond to a piece of code that's invoked many times throughout the system, and I thought it would be interesting to look at how often the different parameter values are used.

Almost everywhere in Tsukurimashou that I have two components one above the other, I invoke a macro called build_kanji.tb - that's "tb" for "top and bottom" - and almost every time I have two components one next to the other, I invoke build_kanji.lr. (You can read the source code of these on SF.JP, but it may be pretty opaque if you're not familiar with the METAFONT programming language.) Each macro takes two numeric arguments specifying where to draw the line between the two halves of the composite character and how much to overlap them, and two "text" arguments ("text" is the METAFONT data type used for this purpose) which are fragments of code to draw the two halves. It's a sort of primitive functional programming. The macro takes the current drawing area, which is initially a square, calculates two new rectangular areas by cutting it in half at the specified coordinate and then shifting the boundaries to create the specified amount of overlap, and invokes the two halves it was given, each on the appropriate side of the line. Here's an example of build_kanji.lr in use, from the code for "village".

  build_kanji.lr(480,0)
    (kanji.leftrad.wood)
    (kanji.grsix.inch);

Each character is described in terms of a box that nominally (before these scaling-down operations are applied) covers x coordinates 50 to 950 and y coordinates -50 to 850. Some don't cover all of that square area, and some trespass a little outside it; the basic box is chosen to make the kanji glyphs generally look about the right size next to kana and the Latin alphabet. And the above says that in "village," the dividing line between "wood" and "inch" is at x coordinate 480, which is just slightly to the left of centre.

A left-right split point of 480 is quite typical: when a character is made up of a left part and a right part, usually the left part is a little narrower. However, exactly where the line should be to look right depends on the particular character. Large, "busy" components usually need more space to look balanced when next to small, simple components. The "overlap" value in this case is zero, but in some characters it's positive (meaning that the two halves should be pushed together a bit - this might be relevant if the parts happen to have bigger margins than usual when considered by themselves) and in others it's negative (resulting in added space between the two).

Above I said "480 is quite typical"; but what does that really mean? It seems like it would be interesting to actually collect some data on how often the different parameter values are used. How many times is the dividing line really at 480 as opposed to elsewhere?

So I ran the perl one-liner perl -ne 'print "$1\t$2\n" if /build_kanji.lr\((\d+),(-?\d+)\)/' mp/*.mp > lr-raw.dat (and a similar one for build_kanji.tb) against the current Tsukurimashou code base. That will count all the times I invoked one of those macros with two literal numeric arguments, which is how I usually use them. It's not an absolutely complete count. METAFONT is a programming language, and sometimes I make use of that fact to calculate a number that will go into the macro argument. For instance, sometimes the amount of overlap is conditional on the typographic style of the character instead of being a fixed constant, and in such a case the regular expression won't match and that Perl code won't detect it as an invocation of build_kanji.lr. It's also worth knowing that this count is uniform across the code base, not uniform across the character set. If I invoke build_kanji.lr inside a subroutine and the subroutine is used in ten different places, it will still only count once because I only wrote "build_kanji.lr" once. And there are many other ways to write a call to build_kanji.lr with constant argument values that would be valid METAFONT syntax but won't match the regular expression; I wrote it only to match ones that look like the way I usually type it.

Keeping those things in mind, I can say that the count picked up 451 instances of "left and right" and 348 of "top and bottom." The data file it produces looks something like this:

480     60
480     0
500     30
380     80
300     0
540     -20

There's one line for each time I invoked the macro, and the two fields are first the split coordinate and then the amount of overlap. The most obvious thing to do with a file like this would be make a scatter plot; but that has the problem that in some cases, a given exact combination of split coordinate and overlap value occurs many times. With a simple scatter plot one of these multiples will look the same as if it just occurred once; and it would be nice to be able to see clearly which ones happen a lot and which ones only rarely.

So I used another Perl one-liner to count the number of occurrences of each pair. I made several different attempts with this one because I was trying different things with GnuPlot that wanted a few different kinds of input; the one I eventually liked best was perl -e 'while (<>) { chomp;($a,$b)=split(/\t/);$cnt{"$a\t$b"}++;} while (($k,$v)=each %cnt) { print "$k\t$v\n"; }' lr-raw.dat | sort -n > lr-hist.dat. The output of that looks something line the following.

260     0       2
270     -10     2
280     -50     1
280     0       1

Using uniq -c might actually be an easier way to do it, if I were starting from scratch; I ended up with the Perl one-liner because it was the end of an evolutionary process and some of the steps along the way did more complicated things (like rounding, and supplying zeros for combinations that don't occur at all) that uniq -c won't do.

In any case, that's a file I can feed into GnuPlot. Recent versions support a "circles" plot style, which is like a scatter plot but uses circles of individually adjustable size for the points. I told it to plot my data file using the square root of the third column as the circle radius (you might like to think about why I'd use the square root), and here's what I got:

[invocations
of build_kanji.lr] [invocations of
build_kanji.tb]

Notice that the data has a strong grid tendency. Although the nominal coordinate space is 900 by 900 units, and METAFONT's fixed-point arithmetic is good to a few places after the decimal point, in fact I almost always use multiples of 10 for my coordinates. At 12 point, 10 units is roughly 0.04 millimetres and that's almost always as much precision as I need.

On the left-right plot, the points fall into vertical lines. For instance, there are many at split coordinate 450 but very few at 440 or 460, and there's a big circle (many duplicates) at the combination of split coordinate 450 and overlap 0. I think that's because for left-right splits, the parameters are very much driven by what the left-hand side happens to be. For instance, there's a large family of characters including 仕他代住使係倍 that all share the same left-hand side. I'm not going to type it here because many browser fonts can't display it as a stand-alone character, but it's etymologically derived from 人 meaning "person." And for nearly all the characters in the "person" family, I chose 300 as the split coordinate; for many I also chose 20, and that's visible as the large circle on the plot at those coordinates. My usual practice when adding a new character with "person" on the right is to start with (300,20) as the arguments to the macro and then change the overlap if I don't like the results. Similar considerations apply to other families of left-side components.

One other thing to notice on the left-right plot, and it's somewhat gratifying, is that things are basically evenly spread between positive and negative overlap, and there's a lot of weight at exactly zero. That suggests that zero overlap (which really means "the estimated basic amount, which can be manually adjusted" rather than any particular dimension of the shapes being zero) is usually about right; I don't have to think very hard to guess the appropriate amount for a new character I may define.

There are a few outlier points with overlap well above 150 or below -50. I think those correspond to situations where I was using the build_kanji.lr macro to do something more complicated than just put one thing next to another; for instance, I sometimes build up a complicated character by overlaying pieces of others and then adding and removing other strokes. The regular expression search indiscriminately matches all uses of the macro, whether they correspond exactly to the final visual structure of the glyph or not.

On the top-bottom plot, first of all the range on both axes is bigger. I sometimes use top-bottom splits with very strange coordinate and overlap values. That's partly because there really is a wide range of visual split points in the character set, and partly because those kinds of oddball situations mentioned above where I'm using build_kanji.tb to build something other than a classic top-and-bottom character, are more common than in the equivalent case for build_kanji.lr.

The overlap amount may also sometimes be a fair bit more for build_kanji.tb because it's a fairly common pattern that there will be a top component that forms a cup or pocket into which the bottom component fits, causing their y intervals to really overlap by a lot. For instance, in the character 安 ("cheap") the overlap value is 150, and others like it have even larger values. I call those "hats" - the top of 安 is drawn by a macro called kanji.radical.silly_hat, and there are also macros for "ridiculous" and "conservative" hats.

In the top-bottom plot there's still a tendency for the plotted circles to form vertical lines, probably caused by the same kind of "family" effect as on the left-right plot, and there's a noticeable horizontal pattern at overlap zero. The biggest circle is at split coordinate 600 and overlap 0, which corresponds to the most popular (but by no means the only) set of parameters for the "grass" radical, which is the top of 花 ("flower").

I'm not sure that all that proves anything, but it was a fun experiment to try.

1 comment

*
村 is not

木 [plus] 寸

but instead

木[sounds like]寸

Many Chinese characters have 'sounds like' hints incorporated. ^^
- 2013-09-19 01:10


(optional field)
(optional field)
Answer "bonobo" here to fight spam. ここに「bonobo」を答えてください。SPAMを退治しましょう!
I reserve the right to delete or edit comments in any way and for any reason. New comments are held for a period of time before being shown to other users.