« Getting rid of stuff | Home | Genjimon fonts »

The Terrible Secret of OpenType Glyph Substitution

Mon 6 Dec 2010 by mskala Tags used: , ,

I was up until 3 this morning trying to figure out how to make OpenType glyph substitution work. That, in itself, is not news. Anyone who has tried to write substitution rules for OpenType fonts has probably gone through something similar. What is unusual, though, is that I not only succeeded, but also figured out the undocumented underlying principle so that I can predictably succeed in the future; as far as I can tell, the more usual practice is to just try things at random until one eventually either gets it working by accident, or gives up, without having learned anything useful either way.

The purpose of this entry is to provide the important information that I wasn't able to find on the Net and wish I had had. There is one important point I call the Terrible Secret, which makes all the difference to getting it to work; but rather than jump to that immediately I'm going to give the needed background first. I'll be using the terms that make sense to me, rather than the "easy" but uselessly vague simplified style used by all existing documentation I found.

The glyph substitution problem

The situation is that you're going to typeset text using a font. By "font" I mean the computer program that lives in an OpenType font file (usual extension ".otf"); that may in some cases correspond to what traditionalists would call a "typeface." Your text consists of a stream of symbols (in the computer-science sense). For instance, to typeset the word "afflicted" your stream of symbols might be "a f f l i c t e d"; that is nine symbols. Usually, each of the intangible symbols in the text corresponds to a tangible symbol called a glyph, which is a pattern of ink on the printed page. However, in some fonts, to make it look right on the page, there need to be translations applied. For instance, in traditional English-language typesetting the sequence of symbols "f f l" is usually replaced by a single glyph called a "ligature," which combines the three letters into a shape that looks nicer than just setting them separately side by side. So the input stream of symbols "a f f l i c t e d" must be converted into a new stream that might look like "a FFL.ligature i c t e d" - that is seven symbols - where "FFL.ligature" is the name of the special replacement glyph. (Note that that isn't the standard name for that particular ligature glyph - this is only an example of the need for replacement, and the standard name of "f_f_l" would be less clear.)

What replacements must be performed, depend on the font. Each OpenType font includes a computer program that explains what replacements to perform. These programs are written in a declarative language designed for the purpose, and they can be edited with font-editing software. When you typeset with an OpenType font, your software (word processor, typesetting, even potentially the printer firmware, maybe) is supposed to interpret the substitution program and use that to determine what glyphs should be set.

The designer of a font is faced with the task of writing the substitution program. Font designers often are not trained as computer programmers at all, and to make matters worse, the OpenType glyph substitution language is declarative and weird. It's quite different from the imperative languages more commonly used for computer programming, so even if you happen to be a professional computer programmer, you will probably find it very difficult to understand. I myself have a PhD in computer science and am currently employed doing research in declarative programming languages specifically, dammit, and it still took me all night. All the instructional materials I could find on the Net attempted to simplify the language by giving a bunch of examples but not actually specifying the exact behaviour of the system. So even if I were capable of understanding all the details, I couldn't, because the information just was not there.

Tools

The font stores its glyph substitution program internally in a set of "tables." These are similar but not identical to the state transition tables of a finite state transducer, if you know what one of those is. I believe that in a mathematical sense they are equivalent to a finite state transducer's state-transition tables, in that we could write a piece of software to convert between the two in either direction. FontForge allows you to edit the glyph substitution tables directly in its GUI. Other GUI font editors probably do as well.

Forget that. The glyph substitution tables in their native format are effectively impossible (at the very least, painful) to understand. Instead, what you want to edit is a format called a "feature file." Feature files are a human-readable representation, at a higher level of abstraction than the substitution table entries. They normally have the filename extension ".fea", and something described as a "specification" of the syntax is available on the Adobe Web site. You probably should read it. However, that document does not (or at least, does not clearly) specify the semantics of substitution, which are critical to actually making the damn thing work. It only clearly specifies syntax.

FontForge can read feature files and translate them into its internal representation of the state tables. It also claims to be able to write feature files. However, I suspect some of the information that makes the table human-readable is lost during read, so if you load and then save a feature, what comes out may be very hard to understand. This issue is fundamentally similar to compiling and then de-compiling a program in a more conventional programming language: you can de-compile to a functionally identical form, but because of information loss, the de-compiled source code may be much less understandable to programmers than the original source code.

Some commercial font editors may allow you to edit the feature file syntax in their GUIs without this kind of information loss, or they may support better translation from table form back to feature file syntax. I'm committed to using free - and command-line - software for this and haven't investigated the commercial GUI editors much. With FontForge, the GUI command to load a feature file is "Merge Feature Info..." on the "File" menu of the font view window. The GUI command to save a feature file is under "Element->Font Info->Lookups"; then right-click any existing lookup and choose "Save Feature File." That will save a feature file of all the lookups despite being on the per-lookup local menu; you can also save just the lookup you clicked on with "Save Lookup" in the same menu. These GUI commands are (briefly) documented on this page of the FontForge site.

In the non-GUI scripting interface (which, honestly, is probably a better way to do this) the commands are MergeFeature() and GenerateFeatureFile() for load and save respectively. It does not appear to be possible to access the transition tables at full generality from the FontForge scripting language, neither the native scripting language nor the Python bindings; as far as I can tell, some kinds of context substitution can only be achieved in FontForge by writing a feature file and loading it, or through the GUI (but the GUI operates at such a low level of abstraction as to be useless).

Features and substitutions

The substitution program consists of a bunch of "features" which can be turned on and off. Features are named with four-character codes usually consisting of lowercase ASCII alphanumerics. There are standard names for these, and some software may be limited to only processing the standard names it knows about. Some software also may have other weird limitations of its own - for instance, limiting the kinds of substitution that are permitted with a given feature name, or limiting the kinds of features you are allowed to turn on depending on the (human) language your document is written in. These limitations are not inherent to the feature file syntax; you must know about them because they may be a cause of things not working, but they aren't the main issue I'm interested in here.

There also exist a bunch of limited subsets of the internal substitution table language, with less mathematical expressive power than fully general substitution tables; in general, if your substitution program can be expressed in one of these limited forms, then the feature file reader will automatically convert it into the appropriate form. Many tutorials present the distinctions between the different internal table representations as important to programming, but in fact they are nearly invisible at the level of the feature file; it is written in the most general form and then dumbed down behind the scenes where possible, with the limited forms being at best syntactic sugar. The reason for things to work this way seems to be related to a desire to compact the binary font format into as few bytes as possible.

Feature files also support additional syntax for kerning. That is beyond the scope of this introduction. I am only talking about glyph substitution here.

So: each "feature" consists of a sequence of substitution rules. Each substitution rule describes a way of recognizing a subsequence of the tokens in the input, and a way of replacing some or all of that subsequence with a different subsequence. The substitution program operates by applying all the rules, in order, to transform the input sequence into the output sequence.

Here is a very small, simple example:

feature liga {
  sub f f l by f_f_l;
} liga;

That says that the sequence of tokens "f f l" will be replaced by the single token "f_f_l"; it's just the ligature example I gave earlier. This rule becomes the definition of the feature called "liga", which is one of the standard names for the general ability to do ligatures. If you turn ligatures on in your typesetting software it should activate this feature.

More complex substitution rules

The pattern to look for, or left side of the substitution rule, consists of a sequence of tokens that must occur in the input in order for the rule to match. Each element in the sequence is actually an ordered set of tokens and matches if the corresponding token is included in the set (the reason it's ordered comes in later). Sets for matching may be specified in four ways:

  • As a single bare token, which corresponds to a singleton set (must match that one token and no other). The tokens "f" and "l" in the above example are of this kind. Singleton token names may be preceded by a backslash to prevent them from being interpreted as something else, if they happen to coincide with reserved words: for instance, a token named "by" could be specified as "\by" to prevent it from being interpreted as the reserved word of the same name.
  • As an ordered set of single tokens in square brackets, such as "[a b c]". Note that order does matter (it is relevant in the replacement process) but makes no difference for matching.
  • The specification says that ranges may be specified with a hyphen as in "[a - z]". UPDATE: When I first posted this I complained that I couldn't get ranges to work reliably. I've now figured out why, but it's a little complicated. See below.
  • Sets of tokens may be pre-declared and then used with a syntax based on the "@" symbol; see below.

Here is an example using all the syntax described above:

@digit=[zero one two three four five six seven eight nine];

feature liga {
  sub a \b [c d e] [x-z] @digit by A B C X nine;
} liga;

The matching pattern in this case will match a sequence of five tokens, where the first token must be "a"; the second token must be "b" (and must be the glyph token of that name, even if "b" were to have some other meaning in the language); the third token may be "c", "d", or "e"; the fourth may be anything from "x" to "z" (probably "x", "y", or "z", if we trust the alphabet to be in order); and the fifth may be anything from the predefined set we created and named "@digits", which happens to consist of the digits "zero" through "nine".

Note that it is not possible to specify a variable-length token sequence on the left. The example rule always matches exactly five tokens and in general, every rule has a specific number of tokens that it always matches. The right-hand side also must be of constant length, but need not be the same constant length as the left-hand side except when using some special syntax I'll describe later.

How glyph name ranges work (NEW!)

In the initial posting of this article I complained that I couldn't get glyph ranges to work. I've now re-read the spec and found that it's actually explained quite clearly there; I just hadn't been looking in the right place.

A range will work, in the obvious way, if any of these three cases apply:

  1. If the start and end of the range are identical strings except for one index, where both strings contain uppercase or lowercase letters from A to Z. For instance, valid ranges include "[a-z]", "[a.sc-z.sc]", and "[funkyHbar-funkyQbar]".
  2. If the start and end of the range are identical except for a substring consisting of one, two, or three decimal digits. For instance, valid ranges include "[orn01-orn23]" and "[foo17bar.x-foo23bar.x]". Note that "[bleh1-bleh20]" is NOT a valid range because the digit sequence is not the same length at the two ends.
  3. In a "CID font", in which case the range will be counted over "the CID ordering". I don't know what that means.

The spec notes, and so will I, that ranges do not work over symbolic sequences that might be obvious to a human being, no matter how convenient that would be; in particular, "[zero - nine]" will not do what you want it to. They also don't work over hexadecimal.

It is not clear to me what happens if you define a range that would include glyph names that don't actually exist.

FURTHER UPDATE: This isn't the whole story. There appear to also be bugs in FontForge's handling of ranges with decimal digits in them, in particular that it will sometimes interpret all but the first element of the range as being a name containing only the digits. I'm pretty sure this is a bug in FontForge; if I can construct a good demonstration of it, I'll be reporting it to the authorities.

Context sensitivity

The matching pattern is (always) divided into three sub-sequences, which have more complex names in the spec but which I will call the beginning, the middle, and the end. The beginning and the end may be (and frequently are) empty, but the middle must be non-empty. If you do not indicate otherwise, then the entire matching pattern will be the middle, and the beginning and end will be empty. For instance, in the matching pattern "a b x y z c" the beginning is nothing (empty), the middle is "a b x y z c", and the end is nothing.

However, if you add an apostrophe after one or more of the token slots in the matching pattern, then all tokens with apostrophes will be the middle. Then the beginning is everything before the middle and the end is everything after the middle. Either of those can still be empty if the first or last tokens have apostrophes. For instance, in the matching pattern "a b x' y' z' c" the beginning is "a b", the middle is "x y z", and the end is "c". In the matching pattern "a b y' z'" the beginning is "a b", the middle is "y z", and the end is empty. Apostrophes can be used with any of the token match types, as in "p \b' [c d e]' @digits' q" where the beginning is "p", the middle is "\b [c d e] @digits", and the end is "q". Putting an apostrophe after everything is the same as not using apostrophes at all - everything will be the middle and the beginning and end will be empty. It is a syntax error to use apostrophes in a way that cannot be divided into this beginning/middle/end structure, such as "a' x x b'".

Part of the significance of the beginning, middle, and end structure is that although all three must match for the rule to be triggered, only the middle is replaced. The sequence of beginning, middle, and end in the input is replaced by the unchanged beginning, then whatever is on the right-hand side of the rule, then the unchanged end. If that were the only consequence of using this structure, it would be syntactic sugar - you could always specify a rule with identical behaviour by just using an ordinary all-middle match and including what would otherwise be the beginning and end explicitly on the right-hand side; but there is also an important consequence I'll describe later having to do with the input pointer. Here's an example of the syntax:

@digits=[zero one two three four five six seven eight nine];

feature frac {
  sub @digits slash' @digits by fraction;
} frac;

This might be part of a program for contextual fractions (which are a standard example to which I'll return). It says that the token "slash" should be replaced by the token "fraction" but only when it appears between two tokens that are each in the pre-declared class named "@digits". So "one slash two" will trigger the replacement but "a slash b" will not. Only the token "slash" is actually replaced, with the digits on either side left unchanged; although a sequence like "one slash two" is needed to trigger the rule, the output will be "one fraction two", not just "fraction". An unchanged copy of the beginning and end are implicitly included in the output despite not being mentioned explicitly in the syntax.

It is because of their sensitivity to context that rules with non-empty beginning or end are called "context substitution rules." Beware: although the syntax does not enforce such a limitation, some software interpreting this syntax will be unable to process tables that include context rules (non-empty beginning or end) in some of the four-letter features. For instance, some software will only allow non-context rules in the "liga" feature and you're supposed to use "clig" instead for a similar feature with context substitution rules.

Substitution of corresponding characters

When you specify non-singleton sets in some or all of the input slots, it is possible to change the output substitution depending on what exact tokens were matched in the input. For instance, the rule "sub [a b c] by [x y z]" will replace "a" by "x", or "b" by "y", or "c" by "z". Note that the same effect could be achieved by specifying three rules: "a by x; b by y; c by z;". The set syntax functions as an abbreviation for that longer syntax.

It is not clear to me the exact limitations of this syntax. It is definitely a requirement that the input and corresponding output sets must be exactly the same size: "sub [a b c] by [w x y z]" is definitely an error (though feature files can also be used to specify things called "alternates" which have syntax similar to that, are a sort of user-interface suggestion, are not the same thing as substitution, and are not further described here). It appears that output tokens can only be made to depend on input tokens at the same index within their substring, so that "sub x [a b c] by [x y z] a" is probably an error - or at least, it won't change "x b" to "y a" as may have been intended. It also appears that it won't work when the input middle and the output token sequences are of different length, as in "sub [x y z] by [a b c] [a b c]". Some of the effects you want that might be achieved by that kind of syntax if it worked, can be achieved instead by just writing multiple rules. Subject to these limitations and possibly other hidden ones I don't know about, this kind of multi-rule syntax does work in context rules, so that a rule like "sub slash @digits' by @denominator" can be a reasonable thing to write. It will replace any character in "@digits" by the corresponding character in "@denominator", but only when there is a "slash" immediately preceding it.

It is because of this kind of syntax that order within a matching set matters. The rule "sub [a b c] by [x y z]" has a different meaning from "sub [c b a] by [x y z]" even though they match the same input token sequences.

Order of application

Here is a piece of code that might seem like it should work.

@digits=[zero one two three four five six seven eight nine];
@numerator=[zero.n one.n two.n three.n four.n five.n six.n seven.n eight.n nine.n];
@denominator=[zero.d one.d two.d three.d four.d five.d six.d seven.d eight.d nine.d];

feature frac {
  sub @digits' slash @digits by @numerator;
  sub @digits slash' @digits by fraction;
  sub @digits slash @digits' by @denominator;
} frac;

The idea is that we want to be able to type sequences like "1/2" (which would become the tokens "one slash two") and have them come out as nicely-typeset fractions with the numerator written as superscript and the denominator as subscript. This is a commonly-implemented feature in OpenType fonts; I'm not entirely sure how often it's actually used - I suspect it may be more of a "Hey, isn't it cool we have technology to do this?" thing - but whatever, it makes a good example. My code above only seems to support one digit over one digit, and many fonts actually do more complicated things to support arbitrary numbers of digits over arbitrary numbers of digits, but we have to start somewhere; getting the arbitrary-digits situation to work requires the use of principles more easily explained in a simpler setting.

The desired effect of this code is that the sequence "one slash two" should be transformed into "one.n fraction two.d". The ".n" and ".d" suffixed versions of the glyph names are assumed to represent special glyphs defined by the font that display versions of the numerals suitable for numerators and denominators of fractions - for instance, superscript and subscript versions.

The way it's supposed to work is that given the sequence "one slash two" the first rule matches (it's a digit followed by a slash followed by another digit) and changes the first token from "one" to "one.n". Similarly, the second rule changes the second token from "slash" to "fraction", and the third rule changes the third token to "two.d", so the output should be "one.n fraction two.d" as desired. All these rules are perfectly in accordance with the syntax and semantics I explained above.

It doesn't work. The missing knowledge is that rules are applied one at a time, with each rule looking at the output left by the previous rule. So "one slash two" matches the first rule and is transformed into "one.n slash two". Then neither the second nor the third rules match because they each start with "@digits" and "one.n" isn't in "@digits". The output sequence is "one.n slash two".

Here is a corrected version:

@digits=[zero one two three four five six seven eight nine];
@numerator=[zero.n one.n two.n three.n four.n five.n six.n seven.n eight.n nine.n];
@denominator=[zero.d one.d two.d three.d four.d five.d six.d seven.d eight.d nine.d];

feature frac {
  sub @digits' slash @digits by @numerator;
  sub @numerator slash' @digits by fraction;
  sub @numerator fraction @digits' by @denominator;
} frac;

In this code, when given "one slash two", the first rule matches and transforms it into "one.n slash two". Then the second rule also matches, because "one.n" is in "@numerator", so the token sequence becomes "one.n fraction two". Finally, the third rule matches that, and transforms it into "one.n fraction two.d" which is the desired output.

One tutorial I read ended here with a general admonishment to pay close attention to the order in which rules are applied. You will drive yourself nuts doing that, because in the language as I've described it up to this point, the following code should work:

@digits=[zero one two three four five six seven eight nine];
@numerator=[zero.n one.n two.n three.n four.n five.n six.n seven.n eight.n nine.n];
@denominator=[zero.d one.d two.d three.d four.d five.d six.d seven.d eight.d nine.d];

feature frac {
  sub @digits slash' @digits by fraction;
  sub @digits' fraction by @numerator;
  sub @digits' @numerator by @numerator;
  sub fraction @digits' by @denominator;
  sub @denominator @digits' by @denominator;
} frac;

That is supposed to handle any number of digits in the numerator and denominator. Given input like "one two slash three four five", it is supposed to transform as follows:

  • "one two slash three four five" (unmodified input)
  • "one two fraction three four five" (@digits slash' @digits by fraction)
  • "one two.n fraction three four five" (@digits' fraction by @numerator)
  • "one.n two.n fraction three four five" (@digits' @numerator by @numerator)
  • "one.n two.n fraction three.d four five" (fraction @digits' by @denominator)
  • "one.n two.n fraction three.d four.d five" (@denominator @digits' by @denominator)
  • "one.n two.n fraction three.d four.d five.d" (@denominator @digits' by @denominator [again])

But it doesn't work.

You might guess that there's a problem with having a rule work on its own output. After the last rule in the set has changed "three.d four" to "three.d four.d", it might be a problem to use that "four.d" in the same rule to recognize "four.d five" and change it to "four.d five.d". However, that is not actually the problem. The problem also is not due to the rules being in the wrong order. They are actually in the right order - the desired sequence of events uses them in exactly the order they are written, with each one working properly on the output of the previous (or itself). Nonetheless, it still doesn't work. What's wrong?

The Terrible Secret

I'm going to put this in bold because it's very important and I've never seen it explained anywhere else before.

After a rule match has succeeded, no other match can occur with the first token of its middle strictly before the first token of the middle of the successful match; and it can only match with the first token of its middle on any token of the middle of the successful match, if the new rule comes strictly after the successfully-matching rule.

In the example above, here's what actually happens as a consequence.

  • "one two slash three four five" (unmodified input)
  • "one two fraction three four five" (@digits slash' @digits by fraction) Now the single token "fraction" is the middle of the successful match.
  • "@digits' fraction" cannot match because the first token of its middle is before the "fraction" token.
  • "@digits' @numerator" by @numerator certainly can't match because there are no numerator digits (none were created due to the previous rule's failure).
  • "one two fraction three.d four five" (fraction @digits' by @denominator)
  • "one two fraction three.d four.d five" (@denominator @digits' by @denominator)
  • "one two fraction three.d four.d five.d" (@denominator @digits' by @denominator [again])

Note that the last rule, which seems to be consuming its own output, actually does work. The problem is with the earlier rules, which tried to move the point of matching backward. The substitution algorithm keeps an input pointer describing where the start of the middle of a match pattern can be. It can look backward to match beginnings of rules, but it cannot move backward to place the start of the middle earlier than the current location. This fact is mentioned, but not at all clearly, in subsection 5.f.i of the Adobe specification, which says "If the rule is matched, then the current context moves the current glyph pointer ahead in the original text by the length of the input sequence."

This behaviour does have some consequences for rules that consume their own output; they're not always allowed. If you write "sub a' a by b;" (which is very similar to the denominator rule above) and run it on "a a a", the input will transform from "a a a" to "b a a" and then to "b b a". When the rule applies the second time, it works because the "a" being matched, even though it was touched by the previous invocation, was part of the copied-through end of the previous invocation, not the substituted middle. Whereas if you wrote "sub a a by b a;" then when run on "a a a" it would change to "b a a" and stop there. The "a" at the centre of the string is part of the substituted middle of the rule invocation, and cannot be seen by the same rule when it looks a second time.

One other gotcha: although it applies rules in the order they are written, it applies the sequence of matching locations at a higher priority. So it looks first for any rule that can apply with the first token of its middle on the first token of the input, and it uses the first such rule if more than one. Then it looks for any rule that can apply with the first token of its middle on the second token of the input - or on the first token of what's left after the middle of the matching rule, if one matched at the first token. Rules are matched first in token order, only then in sequence within the file order, and beginnings and endings do not count for rule sequencing purposes; only middles count.

Here is a different arbitrary-length fraction substitution fragment, which is not as nice (because it will mess up strings of digits that aren't fractions) but will, at least, work:

@digits=[zero one two three four five six seven eight nine];
@numerator=[zero.n one.n two.n three.n four.n five.n six.n seven.n eight.n nine.n];
@denominator=[zero.d one.d two.d three.d four.d five.d six.d seven.d eight.d nine.d];

feature frac {
  sub slash by fraction;
  sub @digits by @numerator;
  sub fraction @numerator' by @denominator;
  sub @denominator @numerator' by @denominator;
} frac;

Note, first of all, that even though "sub slash by fraction" is first, it probably won't be the first rule actually applied. We look first for rules whose middles can match as soon as possible. Here's the correct sequence of events with the same example input as before:

  • "one two slash three four five" (unmodified input)
  • "one.n two slash three four five" (@digits by @numerator)
  • "one.n two.n slash three four five" (@digits by @numerator)
  • "one.n two.n fraction three four five" (slash by fraction)
  • "one.n two.n fraction three.n four five" (@digits by @numerator)
  • "one.n two.n fraction three.d four five" (fraction @numerator' by @denominator; note that this is matching on a token from the middle of a successful match, but that is allowed because the new rule comes after the successful rule)
  • "one.n two.n fraction three.d four.n five" (@digits by @numerator)
  • "one.n two.n fraction three.d four.d five" (@denominator @numerator' by @denominator)
  • "one.n two.n fraction three.d four.d five.n" (@digits by @numerator)
  • "one.n two.n fraction three.d four.d five.d" (@denominator @numerator' by @denominator)

Note that the changed glyphs move smoothly through the sequence: first we change the first glyph, then we change the second glyph, and so on. That's a big clue to how the software that interprets these programs actually works internally.

Why does it work this way?

This is where it gets into computer science. Why is there this weird limitation on what kind of matching patterns the system will accept? Wouldn't it be better to allow the input pointer to move backward? Then it would be a lot easier to write these programs. It seems like it shouldn't be much harder to write the software that interprets them. Such a change seems like it would make the system more powerful and allow a way to do complicated substitutions not possible under these weird, poorly documented and difficult to understand rules.

Actually, I think that's kind of the problem. If I had been thinking clearly last night I might even have been able to guess that they'd put in a stupid limitation like this, based on what I know of the theory behind the system. The thing is, if the input pointer were allowed to move backward, the system would become very much more powerful indeed. In fact, it would become equivalent to a Turing machine, capable of performing any computation that can be performed by computers in general; and there are important reasons they don't want to put that kind of power into the hands of font designers. It may seem like a simple little thing, letting the pointer move in both directions instead of only forward; but it has very big consequences.

One simple consequence of allowing the pointer to move backward is the possibility for infinite loops. Under the existing limitations, you can associate every pattern match with a specific token in its input, and at most one match per token. Even if some patterns produce many new tokens, we can prove that there is some notion of "progress" in effect: every match moves forward some amount through some suitably-expanded version of the input. For any given rule set there is a finite maximum number of output tokens that can be produced per input token. There is no combination of input and rules that will just keep producing new tokens forever. Contrast that with the rule "sub a by a a", which if allowed to match in its own output would produce an infinite number of "a"s from a single one as input. Even if you put in additional limits on the ability of rules to read their own output, something similar can be constructed by laundering the extra glyphs through multiple rules without any direct recursion.

Now imagine what happens if you're the sysadmin of a large institution like a university, populated by students, who are basically younger and (even) less responsible versions of me: just the kind of people who think that this subject matter is really cool. You buy a fancy new printer that does OpenType. You know what's going to happen. Somebody's going to send a document to your printer containing a custom font with something equivalent to a "sub a by a a" rule in it, and then your printer is going to have a seizure. You don't want that. You want to know for sure that there's no way a font can produce infinite substitution results. (Never mind that the existing rules can still produce exponential results which are really just as bad in practice; there is a principle at stake here.)

It might be thought that we could program a computer to check the font and make sure there are no infinite loops, before using it. In fact, that is not possible. You just can't do it. The detailed explanation of why not is the main subject matter of a required course in second-year undergrad computer science, and it's a course many students have trouble with. It's difficult stuff. The general idea is that if you're allowed to move the pointer backward, then (as I mentioned before) you could use the glyph-substitution system to build a universal Turing machine, that is, a general-purpose question-answering system. Then it becomes possible to ask the question-answering system a self-referential question about its own behaviour to which it cannot give any answer but must end up in an infinite loop. This is called the "Halting Problem."; it is a general limitation of any sufficiently powerful computer system. Moreover, if there were any way a computer could analyze a "substitution with backward-moving pointer allowed" glyph table to verify whether it contains an infinite loop, then the technique used to do that could also be applied to solving the Halting Problem, which isn't possible, and therefore no such technique exists.

For my two cents I'd rather use a Turing-complete substitution language and take my chances on infinite loops and other shenanigans - just think of the fun we could have! - but it's understandable that Apple and Adobe didn't want typesetters to have quite that much fun. Hence the forward-moving-pointer limitation. At least I can console myself by using MetaFont to generate the font data; it was built by a famous computer scientist for power rather than safety.

More syntax: named lookups

The Terrible Secret above is the important one that isn't adequately explained elsewhere; nonetheless, there are a few other points I should address. It is possible to create ready-made sets of rules, other than features, that you can then invoke like subroutines in your features. These are called "lookups" in feature files; they seem to be a strict subset of what FontForge calls "lookups" in its GUI, but I recommend ignoring FontForge's concept of "lookups" entirely. The syntax for a named lookup looks like this:

lookup FOOBAR {
  sub a b c by x y z;
  sub f f l by f_f_l;
} FOOBAR;

It seems to be a common convention that named lookups get all-caps names. Once a lookup is defined it can be invoked within a feature using the "lookup FOOBAR;" statement. That has the effect of inserting all the rules from the lookup as if they were entered again right there.

The spec says that lookups may be defined outside a feature, in which case their scope is the rest of the file, or else within a feature, in which case their scope is until the end of the feature. What it does not clearly say is that if you define a named lookup within a feature, then (at least in FontForge) it is automatically invoked at the point of definition, as well as at any time you might explicitly invoke it. So if the above example lookup is defined inside a feature, it is just as if the rules "sub a b c by x y z; sub f f l by f_f_l;" were typed at that point, as well as at any other time the command "lookup FOOBAR;" might be given. This behaviour is usually not desirable, to say the least. I don't know if it is a bug in FontForge or actually the intended behaviour of the system. The work-around is to make the lookups file-global instead of feature-local.

Instead of being invoked to insert blocks of rules unconditionally, lookups can also be invoked inside substitution rules. The syntax for doing that looks like "sub a' FOOBAR b' c' d;". Note, particularly, that there is no "by" in that rule. This means that if the rule "a' b' c' d" matches, then the named lookup will be invoked with its input pointer set to the "middle" token after which the lookup name occurred; in this case "a"; it must be a "middle" token. So running that rule on "a b c d", it matches, it passes "a b c d" to the lookup named "FOOBAR", then the first rule of that substitutes the "a b c" to "x y z", and the result is "x y z d". I am not absolutely certain of issues like how much of the beginning and end contexts are available to the invoked lookup. It appears, but I have not tested, that lookups can invoke each other to arbitrary depths; however, since a lookup is not in scope until after it is defined, recursion is not possible.

It is allowed to specify more than one lookup in this way in the same rule. The Adobe spec actually presents that as the normal case, even though from my point of view it seems to be weird special syntax that will seldom if ever be used at all. It is not absolutely clear to me how much of the output of the previous lookup is available to the next one, if there is more than one. The more popular "by" syntax is actually a sugary shorthand for this syntax: as you can verify with the FontForge GUI (God help you) what really happens when "by" is used for sufficiently complex rules is that it creates a rule-local anonymous lookup and invokes it after the first character of the middle of the match.

There is more

As well as all the "kerning" stuff, which allows the context-specific modification of glyph position as well as which glyph gets shown, there are a few more constructions that could go in a feature file. Features and parts of the rule set can be locked out based on "scripts" and "language systems", so you can share the fun by making your font mysteriously fail to work depending on nothing obvious to the user. There is a keyword called "ignore", which allows the specification of an "exception" matching pattern that, if matched, will block some other substitution matching pattern that would otherwise match. That looks useful, but I don't understand it yet. There are a few other miscellaneous bits of syntax used for specifying font features that aren't necessarily related to glyph substitution but are convenient to specify in the same file.

All in all, there's still a lot to be learned. But that one point, about the no-backward-movement limitation, seemed really important, and it was hard-won, so if nothing else, I wanted to document that.

Another unwelcome surprise (new, November 2011)

It appears that if you use a glyph as a "mark" in the GPOS table, then you are not allowed to use it at all in the GSUB table. This is because every glyph must be categorized as at most one of "base," "ligature," "mark," or "component" in the (usually implicit) GDEF table, a fact which is alluded to rather vaguely in subsection 9.b of the specification. FontForge allows putting "mark" glyphs in substitutions, but not really. The GSUB table ends up full of garbage if you try, and it segfaults when you save. Apparently, some of the Adobe tools I don't have will fail in a more informative way. I don't see a good reason for marks in substitutions not to be allowed, and it would be useful if they were allowed, but it seems clear that they are not allowed.

9 comments

Axel
Thank you! Will mark for future study. Yesterday, by a happy coincidence, I posted a note on the Jupiter and Saturn glyphs.
http://hirsig.ca/extra/ZforZeus.pdf
Axel - 2010-12-07 11:32
Matt
That's interesting history. Have you, or others, done similar work on the other planetary glyphs? I've occasionally looked for such information and mostly found Web pages that are clearly (like the one you quoted) attempting to impose newly invented meaning on the glyphs without looking at their history.

We don't need to have the entire descriptivist/prescriptivist argument all over again right here and now, but I see you refer to the older forms of the glyphs as the "proper" forms. Why? The Z-formed Jupiter glyph seems to be derived from an abbreviation of the name of the object, and not even the name in current general use; mightn't it be better to use a symbol that connects to the object's meaning? (Maybe the Saturn glyph does, if it's the sickle, but as you say, you haven't found real evidence for that.) Even if the Web site you quoted in your note was wrong about both the origin of the modern glyph and the meaning of Jupiter, it seems like *some* way of getting from the meaning to the glyph might be consistent with your comments on the Eris article about how we should go from observation to meaning rather than getting bogged down in the name that human beings happened to choose for an object.

Should we write present-day English with the long s?
Matt - 2010-12-07 14:09
Axel
The ſ has much to say for it. My father had the Diderot edition of Voltaire's works - seventy leather-bound octavos - and I loved reading it wil the long ſs. But seriously, typography is part engineering, part applied art, and as in any art there is room for taste. I am not about to impose the choice of any particular glyph on anyone else but I cringe when I have to publish something of my own with glyphs I loathe. In fact, when designers make fonts available with only forms of glyphs with which I disagree then it is I who am being imposed upon. Some glyphs - think especially of Pluto - should have as many as four variants just to satisfy national preferences.

Anyway, comparison with the long ſ is not quite fair because the Z-form Jupiter and uncrucified Saturn (a) have not completely disappeared, and (b) are easily recognized even by people who have never seen them. They are not radical departures, and there is precedent for fanciful variation in typography (the ampersand, the pilcrow, capital Q, and various ligatures, not to mention the more subtle differences). Typography is a conservative art whose practitioners enjoy tweaking traditional material with small quirks, and I'm content with that. As for astrology, I think the names of the seven naked-eye bodies must be meaningful - or at least must provide helpful metaphors - because their effects must have influenced the choice of their names. I can't say that for any body discovered with telescopes.
Axel - 2010-12-07 19:37
Cernael
"...you can share the fun by making your font mysteriously fail to work depending on nothing obvious to the user."
At which point, I thought fnord.
Cernael - 2010-12-19 22:28
Chris
A Turing machine from a glyph-substitution algorithm ! ! !
What a truly entrancing topic. Very well presented, my friend.
-
Don't overlook that a TM can only compute a tractable task,
though. Much of what you present as the possible outcome
of a machine recursively processing its own product involves
the infinite and Turing hated the infinite with a vengeance !

David Deutsch has some interesting input to these matters
in his splendid "Fabric of Reality".

Where can I look at more of your work ?

Chris Eccles APS MRI HND(Eng)
Chris - 2010-12-25 04:42
ThVortex
Regarding the "CID fonts", see http://en.wikipedia.org/wiki/PostScript_fonts#CID
ThVortex - 2012-01-16 20:50
Ava Odoemena
Can´t we just draw .sgv-files and click "export to font" in Inkscape??! Sheesh. What a nightmare.
Ava Odoemena - 2012-03-16 08:59
Matt
Not if we want the typeset results to be professional-quality. Even in English ligatures are necessary for high-quality printed output, and in some languages, such as Arabic, they're necessary for the output to be readable at all.
Matt - 2012-03-16 09:16
Slash
Excellent! I used this information to get my definitions to work. I have a question. Is there a way to make a class that states exceptions or exclusions? If I want to match everything but the digits, is there a way to define a class of "not digits"? Is there a NOT operator to do this? Like defining a class of not_zero, not_one, etc., or maybe not [zero one two... etc] or ~[zero one... etc]?
Slash - 2012-09-25 11:41


(optional field)
(optional field)
Answer "bonobo" here to fight spam. ここに「bonobo」を答えてください。SPAMを退治しましょう!
I reserve the right to delete or edit comments in any way and for any reason.