Building a build for something weird

« On begging | Home | Ideographic Description Sequen... »

Mon 12 Dec 2011 by mskala Tags used: software, 作りましょう

Here are some thoughts on the Tsukurimashou build system. You can find the code, and some documentation of how to use the build system, in the package, but this posting is meant to look more generally at some of the issues I encountered while building a build for something weird.

The thing is, Tsukurimashou isn't a piece of software in the normal sense, but a package of fonts. It's written sort of like software, using programming languages, but the data flow during build doesn't look much like the data flow during build of the usual kind of software package. As a result, although it seemed like using Make was the thing I wanted to do, the way I've written my Makefile doesn't look much like what we might expect on a more typical software project. Working on it has forced me to see the structure of the project quite differently from the way I'd usually look at software, and maybe some of the ideas from that can be applied to other things.

Tree structures

In very general terms, an orthodox software project usually looks like a tree. You are building basically one big thing, or at most a very small number of big things, like two or three. Your one big thing is made of big parts - limbs, if you will. Each of those is made of smaller parts, which you might call branches (at risk of confusing yourself when you start talking about revision control, but we're not doing that tonight), and they have even smaller parts (twigs?). The number of levels in this structure can vary, but it's basically an hierarchical thing where the little parts of one big thing don't overlap with the little parts of another big thing. Maybe there are some few small things that are shared across the entire project - those go in libraries and they typically go in a separate directory somewhere.

[illustration of a tree-structured project

When you want to build orthodox software, it's fairly well-understood how to do that. The dependency structure is basically a tree; it follows the tree of organization of the basic structure of the entire system. Usually you have a directory structure that also more or less matches that tree. It is popular practice to put a Makefile in each directory, an Autoconf-generated configure script (or whatever unworthy modern innovation you want to use to replace that) at the very top, and if it's a really large project, maybe configure scripts for some of the big parts near the top as well. If you want to build the whole thing, you type "make" at the very top, and it recurses down to the leaves at the bottom, builds them, then builds the branches that depend on them, and so on, working its way up until at the top it can link the biggest pieces to create the final product. The bottom line: many things built first combine to create few big things built later to create basically just one thing built last of all.

That "Makefile in every directory" is not actually, necessarily, all that great a way to design a build system even for orthodox software that has a tree structure. At about the time I was starting to realize that it wouldn't work for Tsukurimashou, I happened upon the article "Recursive Make Considered Harmful" by Peter Miller. He makes a lot of interesting points, and he's talking about software that has a tree structure to begin with, that you meaningfully could follow with the build system if doing so weren't so "harmful." Most of what he says goes triple for Tsukurimashou, which doesn't really have a tree structure to begin with.

Build system as pipeline

Tsukurimashou doesn't really have a tree structure. Sure, some parts of it could be described in a tree-ish way; for instance, within any one font, you can split the Unicode range and say this part is the Latin alphabet, that part is the hiragana, this other part is the kanji, and so on. It makes some sense to do some of the work on those things separately. But there is a lot of work that goes into building a font that ends up being done in parallel on all the glyphs on the font, either in many small files or a few large ones. Also, if you're building the whole system, you're building many font files and you don't end up with one big thing, you end up with many medium-sized things.

The most critical thing that is non-treeish is that the population of things you're looking at as you proceed through the build system doesn't start out large and then shrink down to just one. Instead, it starts out medium-sized, then grows a lot, then shrinks a little, then shrinks some more, but it always stays at least medium-sized, and more than just one thing comes out in parallel. Perhaps a diagram would help.

[illustration of a pipeline-structured project]

What's basically going on here is that several different dimensions of font styling and Unicode ranges are multiplying together. I've got one source file called tsuku-00.mp that defines the Unicode code points U+0000 to U+00FF, which correspond to the Latin alphabet, basically. (ASCII and Latin-1, if you want to be more technical). There are similar source files that correspond to other 256-code-point ranges. For instance, tsuku-30.mp defines the Unicode code points for (among other things) Japanese kana. Some of these source files have internal structure that looks like orthodox software, with hierarchical subdivisions, or "library" code used in many places. (Last week I wrote about some of the issues with many-to-many use of library code.)

But then I've got another source file, namely tsuku-kg.mp, that defines "Kaku Goshikku style"; and one called tsuku-mi.mp for "Mincho style." And I've got a source file called tsuku-ps.mp that defines proportional spacing, as opposed to the default monospace. (All these filenames start with "tsuku-", but there are other prefixes to be described later.) When these files go through the MetaPost interpreter, the five source code files I've named so far are supposed to generate a total of eight Postscript fonts, because it has to choose one page of code points, one style, and to use or not to use the proportional spacing code:

tsuku-kg-00.pfb
tsuku-kg-30.pfb
tsuku-mi-00.pfb
tsuku-mi-30.pfb
tsuku-kg-ps-00.pfb
tsuku-kg-ps-30.pfb
tsuku-mi-ps-00.pfb
tsuku-mi-ps-30.pfb

It's multiplicative, though not purely so because not all combinations are actually supposed to be generated. At that stage, in the current development version, there are 1730 Postscript font files generated from 193 source code files.

Then the next stage is the "remove overlaps" stage, where each of those Postscript fonts is processed to clean up the vectors in certain ways, and each input file generates, independently, one output file in a nice one to one process. Then they get joined together in groups to form OpenType fonts (of which there are fewer because that format has a 16-bit instead of an 8-bit limit on the number of glyphs it can contain), then those joined files go through a few more processing stages, and there's some information that was generated as a side effect of creating the Postscript files that goes through a different pipeline and then gets fused eventually with the OpenType fonts, and at the end you've got 20 finished product font files. Note, not just two or three.

But even though this build process involves several thousand files, and there are certainly some complicated things going on in it, it's not as complicated as we might expect for an orthodox software package with the same number of files in its build. That's because many of these operations - and especially the ones that use the largest number of files - consist of doing the same thing in the same way many times. Even the dependencies are parallel across many files. So describing the relations among the files seems like it should be much less work than describing the relations among the same number of files in an orthodox software package.

Organize by process, not by product

Reading Miller's article reinforced the thought already in my mind that trying to split this into subdirectories with their own recursive Makefiles would be unworkable. Suppose I tried to split it as I might a piece of orthodox software. What I'd probably end up doing would be splitting the source code and the build system in the way that the finished product could be logically divided. I have several "families" of fonts (Tsukurimashou, Jieubsida, Blackletter Lolita...); maybe each of those could be a directory. Each of those has several styles (Kaku, Mincho, Tenshi no Kami...); each of those could be a subdirectory. Then within a style, maybe I have proportional and monospace; within the planned future scope of the project there might also be divisions by "weight"; and so on.

After several levels I could actually start putting in the real build system - which would be duplicated completely at every leaf of the tree, though I would certainly hope a lot of the Makefile code could come in via include files instead of really being duplicated. And then it would all have to refer to the source code that would probably end up having to be in some other tree somewhere else, because it would all be "library" code shared across the entire tree, and there'd be so much source code as to require an hierarchical organization of its own. The intermediate files, like the Postscript fonts before and after overlap removal, would end up scattered uniformly across the entire build tree. (There would also be a third hierarchy for "object" files, if I followed GNU's recommendations in that matter.)

I'm sure I could build that if I had to, but I sure wouldn't want to; it would be a nightmare to maintain, especially when (as sometimes happens) I want to change the basic workflow, for instance by adding another processing stage. And all of Miller's objections to recursive Makefiles would apply strongly.

Instead, I built a much flatter directory structure for this project, following the logical structure of the process instead of the structure of the product. I have a directory - just one, and it's flat! - for source code. I have a directory for all the Postscript files before "remove overlap," which are the immediate product of the source code; then another directory for the Postscript files after "remove overlap" which come from those. Basically, each step of the pipeline reads from one directory and writes to the next. Files of the same kind, for the entire project, are kept together, even though that means some directories contain over 1700 of them in what the filesystem sees as a flat structure. As a human, I see the files in most of my directories as a multiplicative, matrix- or tensor-like structure; not a hierarchical tree-like structure.

There are some exceptions to that general approach. I have a "doc" subdirectory that looks a little more like what might be seen in an orthodox software project, with all stages of processing on some files (TeX to XDVI to PDF, with all TeX's temporary files) in the same directory. However, the number of files there is much smaller; and that is quite simply a part of the project that can be reasonably tackled with orthodox technique.

Even with the one directory per processing stage approach it would still be possible to use recursive Make. I could put a Makefile in each directory to handle building all the things that should go in that directory. Miller complained that sometimes files have dependencies outside their own directories, and this approach would be the extreme case of that, with all dependencies outside the local directory. However, all dependencies in one directory would be pointing at the same directory, the previous one, and it wouldn't be hard to understand the sequence in which it would be necessary to build the different directories, so the worst of the problems he described wouldn't come into play much and I think it could be made to work. Splitting directories by the build process instead of the finished product would still be a win even with recursive Make and its problems.

I chose not to use recursive Make, however; Tsukurimashou has just one Makefile, at the root of the source tree. Then the next question is how to actually write that Makefile.

On-the-fly Makefile generation

I don't want to write 1730 Make rules for building Postscript fonts (plus 1730 more for removing overlaps from them), let alone maintaining those as the number changes with updates to the software. Fortunately, nobody would seriously ask me to; Make has basically always supported "pattern" recipes, where you write one recipe with some wildcards in it, and then if the wildcards match, Make can construct a recipe for building whatever file it's looking at. Such things are used (and many are even built in) for doing things like building object files from C source code.

So that's the first line of attack: where possible, I can write pattern recipes, and cover an entire processing stage regardless of how many files it applies to, with a single recipe.

One issue that comes up is dependency tracking. Off-the-shelf dependency tracking simply doesn't exist for MetaPost. But that's not such a huge issue. I wrote a Perl script that would parse my source code far enough to extract dependencies and write those to a Makefile fragment that gets included from my main Makefile. Figuring out when it's necessary to recompute those was a little tricky, and I ended up adopting a compromise solution. On every build I run grep against the entire MetaPost code base (but remember, that's all in one flat directory, which contains nothing else!) and find the "input" lines, which represent inter-file dependencies. I compute a long checksum of that. If the checksum hasn't changed, dependencies don't need to be recomputed; otherwise they do. It seems like more work done on every build than if I used a dependency file for every Postscript file, but it also means one dependency file instead of 1730 of them (with their associated Make recipes and work for Make to do), and the grep and checksum is very fast; it's not clear that I would really save much time with a profusion of dependency files.

The next issue is that many of my Make recipes are more complicated than can reasonably be done in a pattern recipe. In particular, I have to do a fair bit of string processing on filenames to figure out how they should be built. To build tsuku-kg-ps-00.pfb, for instance, I need to create a temporary directory (with a unique name so it won't collide in a parallel build), create a "driver" file that invokes preintro.mp, tsuku-kg.mp, tsuku-ps.mp, and tsuku-00.mp in that order, actually build the Postscript font, and then clean up. Between Make's string manipulation functions and the kind of shell scripting I can put in recipes, it's certainly possible to do that kind of thing.

However, string functions (at first glance) don't seem to be much use in pattern recipes because they are processed before the pattern matching. If the pattern %.pfb is going to match tsuku-kg-ps-00.pfb, string functions in the recipe won't see the name tsuku-kg-ps-00.pfb; they will see %.pfb.

I read the GNU Make manual and learned that it is possible to turn on (apparently globally, though that isn't spelled out) a feature called "secondary expansion," which will subject all recipes to variable expansion a second time after the patterns have matched. That expansion will be able to see what the % expanded into. But to make it work, you have to escape the string processing you want to happen on the second expansion enough to make it survive the first expansion. If you also want your recipe to contain shell scripts, you need to escape those through both of Make's expansions, and if you want the shell scripts to contain Perl one-liners, or if you want to use the rather complicated automatically-generated recipes that come from Automake and don't seem to have been written with secondary expansion in mind, well, have fun.

When the vomiting subsided, I decided to use the similar in spirit, but more controllable, $(eval) function. Pattern recipes are always preferable where they can apply, but for cases where they can't, I use Make's for-each function to build some nested loops that iterate through all the similar recipes I would like to define. Inside the loops, I do $(eval) on a fragment of parameterized text that represents a rule. I only have to write one of those fragments for each processing stage in my pipeline. It functions somewhat like a pattern recipe, but it's not doing pattern matching; it is an explicitly-written recipe per target with one of them defined for every target all generated on the fly from a single source. Make has to do all the work of generating these - several thousand of them - when it loads the Makefile; but in practice, the time taken to do that isn't noticeable.

Even this level of processing didn't seem to be quite enough because some complicated decisions must go into the choice of which combinations, out of the multiplicative universe of font families, styles, weights, and code point ranges, should actually be built. I ended up writing some of that in stripped-down Prolog and having the Makefile call it and generate another Makefile fragment to include, much in the manner of the dependency tracker. That Makefile mostly sets variables which can then be used in the on-the-fly recipe generator. There's also a dependency tracker very similar to the Postscript-on-source-files dependency tracker, for tracking the dependencies of documentation files on OpenType font files. That's another Perl script that generates a Makefile fragment to be included. But there's nothing complicated about how it works, in principle.

Virtual methods in Make code

You probably haven't seen this one before.

The thing is that among the different "families" of fonts generated by the Tsukurimashou build system, there's some amount of sharing of font shapes in a way that cuts across the other organization of the code. This is only used a little at present, but it will be used more in the future.

The Tsukurimashou fonts proper, the ones that end up having "Tsukurimashou" in their filenames, are in some sense generic. They are the core of the project. But there are other fonts that are sort of derived from them and also contain significant additions and subtractions of their own. The best (nearly the only) example in the next version (hoped-for release before Christmas) is a set of fonts called Jieubsida. The Jieubsida fonts are basically Tsukurimashou minus the Japanese kanji and plus the Korean hangul writing system. Shared content includes the Latin alphabet, a bunch of East Asian punctuation, extra stuff like dingbats, and the subset of Japanese that doesn't actively conflict with Korean. When I want to change something in that shared subset, I only want to change it in one place.

So far that's not a big problem because, hey, I've already got library files of MetaPost code I can use in multiple places, so I just define the shared stuff between Tsukurimashou and Jieubsida in those library files and include it in the right places.

The trouble is that the build system needs to be cued by special filenames. As I mentioned before, there is for instance a Postscript file called tsuku-kg-ps-00.pfb which is generated from tsuku-kg.mp, tsuku-ps.mp, and tsuku-00.mp. What about when I want to generate jieub-do-ps-00.pfb? It looks like I need to have jieub-do.mp, jieub-ps.mp, and jieub-00.mp. Well, jieub-do.mp is basically identical to tsuku-kg.mp ("do" and "kg" derive from 「돋움」 and 「角ゴシック」, respectively Korean and Japanese typographic terms with totally different literal meanings but referring to essentially the same graphic style), so if I compromised on the filenames I could merge them. And jieub-ps.mp is exactly identical to tsuku-ps.mp, so I could easily merge them.

The problem is jieub-00.mp. It happens to be exactly identical to tsuku-00.mp, but that fact is specific to page 00. It is not true for other pages; there are some where the corresponding page files between the Tsukurimashou and Jieubsida families are totally different; and there are many pages that only exist on one side or the other. So I can't have them totally separate or I'll face a lot of duplication, and I can't have them totally merged or I'll be constantly writing conditional code within the per-page files, in a way that will get progressively more crufty as the number of different families grows beyond two.

What I really want is for Tsukurimashou to be a thing like an ancestor class that is inherited by Jieubsida, overriding the behaviour of the ancestor where necessary. The inherited ancestor code still needs to be able to call back into the descendant's code, at least a little, so I need things like virtual methods. But, ideally, I'd like this to be implemented at the level of the build system, with files overriding files, rather than in the underlying programming language (which is MetaPost and, of course, doesn't have anything like classes or inheritance built in). I want to write virtual methods. In Make.

So, that's pretty much exactly what I did. Many of my Make recipes were already doing a fair bit of string processing. It wasn't really hard to make them break up the filenames, find the part that designates which family is currently being built, and then do wildcard matching to find an overriding file if one exists, or substitute "tsuku" and look for the resulting filename if one doesn't.

Suppose I want to build jieub-do-ps-00.pfb. I'm already generating the recipe for that on the fly for other reasons, so while I'm doing that, I test the existence of filenames. The file jieub-do.mp exists, so it uses that. The file jieub-ps.mp doesn't exist, so it uses tsuku-ps.mp instead. Similarly, there is no file jieub-00.mp, so it will use tsuku-00.mp, but that is specific to page 00. This "use override, or inherit" decision is made per file, not per dimension; some of the other page files (like jieub-d3.mp) do override the corresponding Tsukurimashou files.

For the case of actually deleting functionality from the ancestor (relevant to blocking the kanji out of the Korean fonts) it also requires handling at a higher level - the recipes that would build deleted pages actually never get generated - but that's a relatively minor detail. I need it, but only for a couple of purposes. The general concept is quite simple. By the moderately laborious measure of using code to check every time I want a filename, I always use an override file if one exists and a generic file if there is no override. The overall effect is simple virtual methods: functionality (and, in particular, code point ranges) defined for Jieubsida overrides Tsukurimashou when it exists, with the Tsukurimashou functionality filling in the gaps where it doesn't. My type hierarchy is limited to two levels - just one generic family with everything else inheriting directly from it - but that's as far as it needs to go.

Final thoughts

Overall, I think most of these techniques would be unwarranted for most projects. We typically use Make for things that have a natural tree structure; it may or may not be appropriate to use recursive Make following that tree structure, but at least it usually makes sense for the process to be structured like the product when they are both naturally tree-structured. But there's no reason that the natural structure of the build process would be the same as that of the thing it's building; it only happens that software often works that way. When it doesn't, it may make sense to organize things about how they will be built instead of how they look when finished. That's probably the best take-away lesson here. Then once we're in a strange space where the usual techniques aren't appropriate, it helps to know about some of the tricks like on-the-fly recipe generation that can be used in strange spaces.

0 comments

Ansuz