Where do I draw the line?
Mon 21 Jan 2013 by mskala Tags used: typography, programming, 日本語It's a very common pattern in the Han writing system that a character will be made of two parts that are themselves characters, or at least elements resembling characters, placed one above the other or one next to the other. For instance, 音 (sound) can be split into 立 (stand up) above 日 (day); and 村 (village) can be split into 木 (tree) next to 寸 (inch). This kind of structure can be nested, as in 語 (language). One can do a sort of gematria with the meanings, (what exactly is the deep significance of "village = tree + inch"?) but that's not the direction I'm interested in going today. Here's the thing: in the Tsukurimashou project, these two ways of constructing characters each correspond to a piece of code that's invoked many times throughout the system, and I thought it would be interesting to look at how often the different parameter values are used.
Almost everywhere in Tsukurimashou that I have two components one above
the other, I invoke a macro called build_kanji.tb
- that's "tb"
for "top and bottom" - and almost every time I have two components one next
to the other, I invoke build_kanji.lr
. (You can read the
source code of these on SF.JP, but it may be pretty opaque if you're not
familiar with the METAFONT programming language.) Each macro takes two
numeric arguments specifying where to draw the line between the two halves
of the composite character and how much to overlap them, and two "text"
arguments ("text" is the METAFONT data type used for this purpose) which are
fragments of code to draw the two halves. It's a sort of primitive
functional programming. The macro takes the current drawing area, which is
initially a square, calculates two new rectangular areas by cutting it in
half at the specified coordinate and then shifting the boundaries to create
the specified amount of overlap, and invokes the two halves it was given,
each on the appropriate side of the line. Here's an example of
build_kanji.lr
in use, from the code for "village".
build_kanji.lr(480,0) (kanji.leftrad.wood) (kanji.grsix.inch);
Each character is described in terms of a box that nominally (before these scaling-down operations are applied) covers x coordinates 50 to 950 and y coordinates -50 to 850. Some don't cover all of that square area, and some trespass a little outside it; the basic box is chosen to make the kanji glyphs generally look about the right size next to kana and the Latin alphabet. And the above says that in "village," the dividing line between "wood" and "inch" is at x coordinate 480, which is just slightly to the left of centre.
A left-right split point of 480 is quite typical: when a character is made up of a left part and a right part, usually the left part is a little narrower. However, exactly where the line should be to look right depends on the particular character. Large, "busy" components usually need more space to look balanced when next to small, simple components. The "overlap" value in this case is zero, but in some characters it's positive (meaning that the two halves should be pushed together a bit - this might be relevant if the parts happen to have bigger margins than usual when considered by themselves) and in others it's negative (resulting in added space between the two).
Above I said "480 is quite typical"; but what does that really mean? It seems like it would be interesting to actually collect some data on how often the different parameter values are used. How many times is the dividing line really at 480 as opposed to elsewhere?
So I ran the perl one-liner perl -ne 'print "$1\t$2\n" if
/build_kanji.lr\((\d+),(-?\d+)\)/' mp/*.mp > lr-raw.dat
(and a
similar one for build_kanji.tb
) against the current
Tsukurimashou code base. That will count all the times I invoked one of
those macros with two literal numeric arguments, which is how I usually use
them. It's not an absolutely complete count. METAFONT is a programming
language, and sometimes I make use of that fact to calculate a number that
will go into the macro argument. For instance, sometimes the amount of
overlap is conditional on the typographic style of the character instead of
being a fixed constant, and in such a case the regular expression won't
match and that Perl code won't detect it as an invocation of
build_kanji.lr
. It's also worth knowing that this count is
uniform across the code base, not uniform across the character set. If I
invoke build_kanji.lr
inside a subroutine and the subroutine is
used in ten different places, it will still only count once because I only
wrote "build_kanji.lr" once. And there are many other ways to write a call
to build_kanji.lr
with constant argument values that would be
valid METAFONT syntax but won't match the regular expression; I wrote it
only to match ones that look like the way I usually type it.
Keeping those things in mind, I can say that the count picked up 451 instances of "left and right" and 348 of "top and bottom." The data file it produces looks something like this:
480 60 480 0 500 30 380 80 300 0 540 -20
There's one line for each time I invoked the macro, and the two fields are first the split coordinate and then the amount of overlap. The most obvious thing to do with a file like this would be make a scatter plot; but that has the problem that in some cases, a given exact combination of split coordinate and overlap value occurs many times. With a simple scatter plot one of these multiples will look the same as if it just occurred once; and it would be nice to be able to see clearly which ones happen a lot and which ones only rarely.
So I used another Perl one-liner to count the number of occurrences of
each pair. I made several different attempts with this one because I was
trying different things with GnuPlot that wanted a few different kinds of
input; the one I eventually liked best was perl -e 'while (<>) {
chomp;($a,$b)=split(/\t/);$cnt{"$a\t$b"}++;} while (($k,$v)=each %cnt) {
print "$k\t$v\n"; }' lr-raw.dat | sort -n > lr-hist.dat
. The
output of that looks something line the following.
260 0 2 270 -10 2 280 -50 1 280 0 1
Using uniq -c
might actually be an easier way to do it, if I
were starting from scratch; I ended up with the Perl one-liner because it
was the end of an evolutionary process and some of the steps along the way
did more complicated things (like rounding, and supplying zeros for
combinations that don't occur at all) that uniq -c
won't
do.
In any case, that's a file I can feed into GnuPlot. Recent versions support a "circles" plot style, which is like a scatter plot but uses circles of individually adjustable size for the points. I told it to plot my data file using the square root of the third column as the circle radius (you might like to think about why I'd use the square root), and here's what I got:
Notice that the data has a strong grid tendency. Although the nominal coordinate space is 900 by 900 units, and METAFONT's fixed-point arithmetic is good to a few places after the decimal point, in fact I almost always use multiples of 10 for my coordinates. At 12 point, 10 units is roughly 0.04 millimetres and that's almost always as much precision as I need.
On the left-right plot, the points fall into vertical lines. For instance, there are many at split coordinate 450 but very few at 440 or 460, and there's a big circle (many duplicates) at the combination of split coordinate 450 and overlap 0. I think that's because for left-right splits, the parameters are very much driven by what the left-hand side happens to be. For instance, there's a large family of characters including 仕他代住使係倍 that all share the same left-hand side. I'm not going to type it here because many browser fonts can't display it as a stand-alone character, but it's etymologically derived from 人 meaning "person." And for nearly all the characters in the "person" family, I chose 300 as the split coordinate; for many I also chose 20, and that's visible as the large circle on the plot at those coordinates. My usual practice when adding a new character with "person" on the right is to start with (300,20) as the arguments to the macro and then change the overlap if I don't like the results. Similar considerations apply to other families of left-side components.
One other thing to notice on the left-right plot, and it's somewhat gratifying, is that things are basically evenly spread between positive and negative overlap, and there's a lot of weight at exactly zero. That suggests that zero overlap (which really means "the estimated basic amount, which can be manually adjusted" rather than any particular dimension of the shapes being zero) is usually about right; I don't have to think very hard to guess the appropriate amount for a new character I may define.
There are a few outlier points with overlap well above 150 or below -50.
I think those correspond to situations where I was using the
build_kanji.lr
macro to do something more complicated than just
put one thing next to another; for instance, I sometimes build up a
complicated character by overlaying pieces of others and then adding and
removing other strokes. The regular expression search indiscriminately
matches all uses of the macro, whether they correspond exactly to the final
visual structure of the glyph or not.
On the top-bottom plot, first of all the range on both axes is bigger. I
sometimes use top-bottom splits with very strange coordinate and overlap
values. That's partly because there really is a wide range of
visual split points in the character set, and partly because those kinds of
oddball situations mentioned above where I'm using
build_kanji.tb
to build something other than a classic
top-and-bottom character, are more common than in the equivalent case for
build_kanji.lr
.
The overlap amount may also sometimes be a
fair bit more for build_kanji.tb
because it's a fairly common
pattern that there will be a top component that forms a cup or pocket into
which the bottom component fits, causing their y intervals to really overlap
by a lot. For instance, in the character 安 ("cheap") the overlap value is
150, and others like it have even larger values. I call those "hats" - the
top of 安 is drawn by a macro called kanji.radical.silly_hat
,
and there are also macros for "ridiculous" and "conservative" hats.
In the top-bottom plot there's still a tendency for the plotted circles to form vertical lines, probably caused by the same kind of "family" effect as on the left-right plot, and there's a noticeable horizontal pattern at overlap zero. The biggest circle is at split coordinate 600 and overlap 0, which corresponds to the most popular (but by no means the only) set of parameters for the "grass" radical, which is the top of 花 ("flower").
I'm not sure that all that proves anything, but it was a fun experiment to try.
1 comment
木 [plus] 寸
but instead
木[sounds like]寸
Many Chinese characters have 'sounds like' hints incorporated. ^^
村 - 2013-09-19 01:10