Kleknev: a coarse-grained profiler for build systems

« Cycle counting: the next gener... | Home | Removing things from Firefox's... »

Mon 11 Mar 2013 by mskala Tags used: programming, software, 作りましょう

When I was preparing the Tsukurimashou 0.7 release, I had to build the entire package several times from scratch, to verify that all the necessary pieces really were included in what I was preparing to ship. When I run the build on my development machine it normally re-uses a lot of previously-built components, only updating the parts I have recently changed. That kind of incremental compilation is one of the main functions of GNU Make. But if I'm shipping a package for others to use, it has to work on their systems which don't have a previous history of successful builds; so I need to verify that it will actually build successfully in such an environment, and verifying that means copying the release-candidate package into a fresh empty directory on my own system and checking that the entire package (including all optional features) can build there.

Tsukurimashou is a big, complicated package. It's roughly 92,000 lines of code, which may not sound like so much. For comparison, the current Linux kernel is about 15,000,000. Tsukurimashou's volume of code is roughly equivalent to an 0.99 version of Linux (not clear which one - I couldn't find numbers I trusted on the Web just now, and am not motivated to go downloading old kernel sources just to count the lines). However, as detailed in one of my earlier articles, Tsukurimashou as a font meta-family is structured much differently from an orthodox software package. Things in the Tsukurimashou build tend to multiply rather than adding; and one practical consequence is that building from these 92,000 lines of code, when all the optional features are enabled, produces as many output and intermediate files and takes as much computation as we might expect of a much larger package. A full build of Tsukurimashou maxes out my quad-core computer for six or eight hours, and fills about 4G of disk space.

So after a few days of building over and over, it occurred to me that I'd really like to know where all the time was going. I had a pretty good understanding of what the build process was doing, because I created it myself; but I had no quantitative data on the relative resource consumption of the different components, I had no basis to make even plausible guesses about that, and quantitative data would be really useful. In software development we often study this sort of thing on the tiny scale, nanoseconds to milliseconds, using profiling tools that measure the time consumption of different parts of a program. What I really wanted for my build system was a coarse-grained profiler: something that could analyse the eight-hour run of the full build and give me stats at the level of processes and Makefile recipes.

I couldn't find such a tool ready-made, so I built one.

Instrumenting the build

The first thing I needed was a way to instrument the build: some way to run the build system, preferably with as few modifications as possible, that would allow me to collect raw data on all the activities performed as part of that complicated multi-hour computation. Fine-grained profilers often dig deeply into the subject code (that is, the system being analyzed). For instance, to use gprof you recompile the subject code using a special compiler feature that adds data-collection instructions throughout. The result is bigger and slower than the original. Valgrind doesn't require direct modifications to the subject code, but runs it inside a software CPU emulator, incurring a huge speed penalty in exchange for extremely detailed profile data. I wanted my coarse-grained profiler to avoid that kind of thing. It should hook into the subject code as little as possible, both to reduce the performance impact of using the profiler (I don't want to spend 80 hours running the profiled version of a build that already takes eight hours) and to reduce the development load (because I, theoretically, want to spend my time creating fonts, not tools for profiling font build systems; and correctly hooking into subject code at a low level is hard).

I thought of several possible ways to instrument the build. Most of what happens in the build is mediated by Makefiles, and it would be reasonable to treat Make recipes as the fundamental units of analysis. So one of my first ideas was to add instrumentation to GNU Make to have it record all the recipes it executes. Changing other people's source code is not my favourite activity, but this might be a relatively clean way to achieve the goal. It would only work for recipes executed by GNU Make - not some of the other things that occur during the Tsukurimashou build nor in other things I might want to analyse in the future - but recording recipes executed by GNU Make would probably be enough for my immediate needs.

Another idea might be to run the build inside an environment that records filesystem access. That's a standard trick used for a number of security and stats-collection purposes: someone will create a modified C library that intercepts file access (or whatever other activities are of interest) and substitute that for the standard one. But it's a lower-level way of doing things, likely to generate more fine-grained data than I wanted. Using a special user-space filesystem with FUSE is a cleaner way to do the same sort of thing, but still at the same granularity level.

Then there is BSD process accounting. It turns out that the Linux kernel already contains instrumentation (if built with this feature enabled) allowing it to create a running log of all the processes that terminate and their resource consumption. That's nice because it's totally noninvasive, and it might be reasonable to hope that people have already written the tools to analyse this kind of data instead of my having to create those from scratch. However, it records all processes on the entire host, so I would have to either filter the data to see just my build, do some kind of virtualization or namespace thing to put the build on a separate "host" for the purposes of process accounting, or accept that I'm looking at a profile of everything happening on my machine during the time the build was running and not just the build itself. The existing tools for analysing process-accounting logs all seem designed for intrusion detection. Many "tutorials" even introduce process accounting as "a security feature" as if that were its original and only purpose - the actual history goes back to billing for CPU time on commercial time-sharing systems. And so I would still have to do a lot of work to adapt the analysis tools to the questions that interest me most.

I ended up deciding to create a wrapper around the Bourne shell, /bin/sh. Make (when so configured, and Autotools always configures it this way for certain portability reasons) passes the commands it runs through /bin/sh; so a piece of code that executed in place of the shell would have the opportunity to log everything Make does. Each Make recipe typically corresponds to one invocation of the shell, or just a few, so logging these is almost the same as logging Make recipes; and it's not tied to a particular version of Make or having the source code for Make, and it could in principle be applied to other things that invoke the shell besides Makefiles. If the wrapper is written generally enough, it could even be reasonably applied to other things besides /bin/sh, should they be relevant to a particular subject of analysis. For instance, someone might want to wrap a language interpreter.

Kleknev

The first version of my shell wrapper is checked into the Tsukurimashou source control system. I call it Kleknev - a name chosen by a random name generator, with the aim of not colliding with anything else on the Net. It's not perfect: apparently there exists a piece of Star Trek fan fiction that uses that as a character name. But I'm reasonably hopeful that people will be able to find my project under this name without too many collisions.

When the Kleknev wrapper runs, it checks for an environment variable called KLEKNEV_LOG. If found, it will open a log file of that name (creating it if necessary, serializing access with flock(2) ) and write an entry describing the shell command it's about to run. Then it'll call /bin/sh, write another log entry describing the exit status and resource consumption of the shell, and return to its caller. It sets a KLEKNEV_PID variable to its own process ID when invoking the child, and looks for such a variable in the environment inherited from its parent, logging it if found. That way it will be possible to decode, from the log file, which shells invoked which sub-shells.

The idea is that the Kleknev wrapper can be invoked with exactly the same interface and will have the same effect as /bin/sh; it just invisibly writes log entries along the way. If the KLEKNEV_LOG environment variable doesn't exist, then the wrapper will just exec(3) the shell. The short-circuit case is meant to be efficient enough that I can afford to go ahead and use the wrapper all the time, even when not profiling, without any meaningful performance hit.

I have built support into Tsukurimashou's build system to check for the existence of the Kleknev wrapper during configuration and automatically add a SHELL= line pointing at it in each Makefile. If profiling a system not designed with built-in support like that, I think it would suffice and wouldn't be too onerous to add such a line manually. I'm not sure whether putting the SHELL= command on the command line of configure will suffice; Autotools explicitly sets SHELL=/bin/sh by default and I'm not sure that can be overridden from the configure command line.

I had hoped that there would be a command-line setting I could give to Make to have it use a special value of SHELL just for that one invocation and pass it down to recursive Makes. Such an option would allow me to profile a Make run with no changes to the Makefiles and no reconfiguration at all, even when building a package not designed to work with Kleknev. But it doesn't seem to be possible. I don't understand all of Make's variable-passing rules, but it seems I can override the value of SHELL for the top-level Make from its command line; and I may even be able to tell it to pass that variable in the environment to recursive Makes (or doing so may be default). However, because of special-case handling of the SHELL environment variable, recursive Makes will only look for its value in their own Makefiles and command lines, not in the environment. And making the top-level Makefile and all recursive ones pass SHELL= on the command lines of their children would require editing the Makefiles in the way I was trying to avoid. Since I think the path "/bin/sh" is hardcoded as the default (certainly in Autotools and probably in Make itself), naming Kleknev's wrapper "sh" and putting it in an overriding PATH directory is probably also not going to work. For the moment, since my main goal is to profile Tsukurimashou and I am able to put hooks inside Tsukurimashou, reducing the invasiveness of the profiler to zero is not my top priority anyway.

Some people reject Autotools. They are in a state of sin, but I'd like to make clear that even if I'm using it with Autotools, Kleknev is not specific to Autotools. It would work with any build executed by a version of Make that accepts the SHELL= setting the way GNU Make does. With a little more creativity, it would probably work in a lot of other kinds of systems as well. I'm not sure what other things besides Makefiles need coarse-grained profiling like this, but I hope to be surprised by what users will invent.

My very first version of the wrapper would create a binary log file that was basically just a dump of the structures returned by various data-gathering calls, particularly getrusage(2) and clock_gettime(2). Such files aren't portable, and I anticipated having trouble reading them nicely in the Perl code I would be using to analyse the results, so I also wrote a program to translate the binary log into XML; I figured I could run that piped directly into my analysis program because the XML would be too big to conveniently store on disk. Parsing XML is obnoxious in itself, but since I already had a Perl XML parser as a dependency elsewhere in the project, it made sense to reuse it, and that would spare me having to decode the binary format in Perl. The translation program in C could share code with the wrapper that writes the file, making it easier to keep them synchronized.

I was suprised to find that the XML version of the log file (uncompressed, mind you) was only about 50% bigger than the binary version. Looking at it carefully, I think the reason is that most of the bytes in both the binary and XML formats are the command-line arguments, stored as text in both places with only a minimal increase in length from binary to XML as a result of escaping characters that aren't safe in XML. At least in the Tsukurimashou build, many of the shell commands have a few kilobytes of command-line arguments, and the sum of all of those is large in comparison to the things I was recording that were really binary, like time stamps. Expanding the minority of true binary material to text and escaping all the command line arguments doesn't blow up the file all that much. The expansion in size from using XML might be more costly on some other system where the command line arguments tend to be a smaller part of the byte count, but I'm not working on that other system today. The bottom line: I decided to make the wrapper write XML directly to the log file instead of having a separate translator. That way, I get portability benefits, including a big step up on being able in the future to make this work with builds distributed across multiple hosts and sharing the log file on a network.

One small, but very annoying, aspect of making the log file XML is that that's not really possible. You cannot append new tags to an XML file and have the result still be valid XML, because of a requirement that every XML document must have exactly one top-level tag. The usual way around this issue is what Kleknev does: declare the log file to be an "XML fragments" rather than an "XML document," and allow there to be any number of tags at its top level. On parsing, I have to play some tricks on the parser to feed it a header before and a footer after the log file. What the parser sees is a complete document with one top-level tag and all the contents of the real log file inside that.

Analyzing the log

After running a typical large build for seven and a half hours, I have 56 megs of XML. What happens next?

The next step is to run the XML file through a Perl program that converts it into an ASCII report of what went on during the build. The reporting program is still somewhat experimental. At the moment it includes a lot of Tsukurimashou-specific assumptions that I'd like to generalize, and when designing the whole thing, I kind of got carried away with the line printer mentality. What I ended up with is a program that makes only a single pass through the input, and streams it. Only summary stats are collected along the way, the timeline is written on the fly as it's generated, neither the whole input nor the whole output has to be kept in memory at any time, and so, at least in theory, it should be able to process absurdly large log files.

Here is the start of the report:

==============================================================================
  KLEKNEV 0.1 PROFILING REPORT
==============================================================================
*2013*
Mar10
16:55
:26.30 >=5535:0
       | tetsu
       | grep
       | 0.021000u
       | 0.0060000s
:26.32 \=0.020481r

       >=5540:0
       | tetsu
       | UNKNOWN
       | 0.0050000u
       | 0.0020000s
       \=0.0071926r

       >=5542:0
       | tetsu
       | UNKNOWN
       | 0.0060000u
       | 0.0070000s
:26.33 \=0.0083890r

And because that's a little boring, here's a clip from where it's more interesting:

       | +-<=5953:0      | perl        | gzip        |
       | | | tetsu       | perl        | gzip        |
       | +---------------------------------------------------------<=5954:0
       | | |             | perl        | gzip        |             | tetsu
       | | |             | perl        | gzip        |             | UNKNOWN
       | | |             | perl        | gzip        | UNKNOWN     | 0.0030000u
       | | |             | perl        | gzip        | 0.0030000u  | 0.0010000s
:45.21 | | |             | perl        | gzip        | 0.0010000s  \=0.0047293r
       | | |             | perl        | gzip        \=0.0098877r
       | | |             | perl        | gzip
       | +-------------------------------------------<=5957:0
       | | |             | perl        | gzip        | tetsu
       | +---------------------------------------------------------<=5958:0
       | | |             | perl        | gzip        | perl        | tetsu
       | | |             | 0.045000u   | gzip        | perl        | perl
       | | |             | 0.011000s   | gzip        | perl        | perl
:45.24 | | |             \=0.064465r   | gzip        | perl        | perl
       | | |                           | gzip        | perl        | perl
       | +---------------<=5965:0      | gzip        | perl        | perl
       | | |             | tetsu       | gzip        | perl        | perl
       | | |             | UNKNOWN     | gzip        | perl        | perl
       | | |             | 0.0020000u  | gzip        | perl        | perl
       | | |             | 0.0020000s  | gzip        | perl        | perl
:45.25 | | |             \=0.0056491r  | gzip        | perl        | perl
       | | |                           | gzip        | perl        | perl
       | +---------------<=5967:0      | gzip        | perl        | perl
       | | | UNKNOWN     | tetsu       | gzip        | perl        | perl
       | | | 0.0020000u  | perl        | gzip        | perl        | perl
       | | | 0.0020000s  | perl        | gzip        | perl        | perl
       | | \=0.045612r   | perl        | gzip        | perl        | perl
       | |               | perl        | gzip        | perl        | perl
       | +-<=5969:0      | perl        | gzip        | perl        | perl
       | | | tetsu       | perl        | gzip        | perl        | perl

Times are shown along the left, down to 1/100 of a second accuracy; but note that it only shows a time when an event has occurred. At this point in the timeline, things are happening fast with many processes starting and stopping; later on, when some of the slower tasks are running, the time labels can skip ahead by seconds or minutes at a time. The vertical scale is verynonlinear.

Processes appear as vertical blocks. At the top is the process ID and the host. Builds can run long enough for the process ID space to wrap around, so I attach a colon and a serial number to the process ID to disambiguate different processes with the same ID: the id "5965:0" means the first process 5965 to occur in the log, and there might also be a "5965:1," "5965:2," and so on. The hostname is included for a future version which may work over a network; at the moment, it's unlikely that it will differ between processes in the same log file. The combination of process ID, serial number, and host is supposed to uniquely identify the process within the log file. Note that all of these process IDs, and the parent-child relationships among them, refer to the Kleknev wrapper processes. There may be several layers of other processes (shell invokes Make which invokes jobserver task which invokes compiler and so on) between the nodes and below the leaves of the tree of Kleknev wrappers.

At the bottom of the process in the printout, there appears a class identifier (more about which in a moment) that is supposed to identify what kind of task that process was performing, and the user CPU, system CPU, and real (wall) times of the process in seconds. In between there is ASCII art showing the parent/child relationships, and if the class identifier isn't the catch-all "UNKNOWN," then it is repeated vertically to fill the space from top to bottom.

The result is that at a glance I can see what was going on in the build at any given point, and I can see all the parent/child relationships created by recursive Make invocations.

I spent a lot of time on the code that plans the layout of where blocks can and can't appear, horizontally and vertically. I think that doing it nicely (with as few crossing-over lines as possible) is an NP-hard scheduling problem. The much easier compromise I chose (still very complicated, when all the details are in place) isn't quite as pretty as I'd hoped, but still looks interesting.

In retrospect, I maybe shouldn't have put so much effort into the tree layout. Tsukurimashou deliberately avoids the use of recursive Make, as described in my earlier article on the Tsukurimashou build and influenced by Peter Miller's classic paper. As a result, nearly all the time the process tree looks very simple: there is a master Make, it invokes some constant number of parallel processes, and then it replaces them as they expire to keep the system at full load. I had been watching the Autotools "make distcheck" target do a lot of weird nested things and I had the idea that there was a lot of recursion going on behind the scenes in ordinary builds, which my profiler would need to untangle. That turned out not to be the case. But being able to discover that was worth something, the tree layout still looks cool, it might be useful if someone wants to profile a different system that does use extensive Make recursion, and now that it's implemented, there's not much point ripping it out.

Anyway, although the timeline tree makes up almost all the volume of the profile report, it is not really so interesting most of the time. I can check back to it to see what was going on if I have a question about something specific that happened during the build, but for basic profiling the most valuable stuff is at the end: the summary statistics. Here's the first block of those. Note that it's still a work in progress and not everything I'd like to track is included here in quite the form I would like to eventually have.

==============================================================================
  STATISTICS BY CATEGORY

Overall real time (start of first to end of last)
grep        1.82E+04  (69%) |--------------grep--------------|
UNKNOWN     2.62E+04 (100%) |--------------------UNKNOWN---------------------|
perl        2.61E+04 (100%) |----------------------perl----------------------|
make        2.61E+04 (100%) |----------------------make----------------------|
gzip        1.85E+04  (70%) |--------------gzip---------------|
fontforge   2.61E+04  (99%) |-------------------fontforge--------------------|
mt1         1.82E+04  (69%) |--------------mt1---------------|
rmo         2210.419   (8%)                                   |---|
assembly    3867.689  (14%)                                   |------|
latex       7856.643  (30%)                                    |----latex----|
kerner      1306.752   (5%)                                            ||
TOTAL       2.62E+04 (100%) |=====================TOTAL======================|

User CPU time per process, including children
class              n            total          minimum     mean  maximum
grep               7   (0%)     0.138   (0%)     0.018    0.020    0.021
UNKNOWN          807   (3%)  9.70E+04  (33%)  1.00e-03  120.192 9.68E+04
perl             196   (0%)    48.219   (0%)  8.00e-03    0.246    3.475
make               2   (0%)  9.68E+04  (33%)     7.039 4.84E+04 9.68E+04
gzip               2   (0%)     2.590   (0%)     1.027    1.295    1.563
fontforge        181   (0%)  1.07E+04   (3%)     0.036   59.335  478.814
mt1            11208  (46%)  6.53E+04  (22%)     0.221    5.827   43.241
rmo            11310  (47%)  8245.315   (2%)     0.047    0.729   12.356
assembly         121   (0%)  6492.729   (2%)    18.061   53.659  267.252
latex              8   (0%)    19.777   (0%)     0.890    2.472    7.475
kerner            60   (0%)  5117.562   (1%)    19.468   85.293  585.799
TOTAL          23902 (100%)  2.90E+05 (100%)  1.00e-03   12.122 9.68E+04

System CPU time per process, including children
class              n            total          minimum     mean  maximum
grep               7   (0%)     0.054   (0%)  4.00e-03 7.71e-03    0.011
UNKNOWN          807   (3%)  1031.670  (34%)     0        1.278 1024.222
perl             196   (0%)     2.148   (0%)  2.00e-03    0.011    0.042
make               2   (0%)  1025.377  (33%)     0.896  512.688 1024.481
gzip               2   (0%)     0.039   (0%)     0.016    0.019    0.023
fontforge        181   (0%)    22.076   (0%)  9.00e-03    0.122    0.420
mt1            11208  (46%)   740.207  (24%)     0.023    0.066    0.359
rmo            11310  (47%)   157.911   (5%)  1.00e-03    0.014    0.069
assembly         121   (0%)    39.432   (1%)     0.163    0.326    0.556
latex              8   (0%)     1.581   (0%)     0.071    0.198    0.501
kerner            60   (0%)     1.638   (0%)  3.00e-03    0.027    0.164
TOTAL          23902 (100%)  3022.133 (100%)     0        0.126 1024.481

==============================================================================

A bunch of conditional code in the log file analyzer looks at the log entries and assigns a class to each one. This is the part that is Tsukurimashou-specific; at the moment it's done by Perl code built right into the analyzer, but I'd like to make it more generic and defined by a configuration file. Anyway, the class at this stage just reflects that general kind of work that was being done, such as "mt1" for generating subfont files with METATYPE1, or "assembly" for assembling the subfonts into full-coverage fonts. If the classification code doesn't recognize one of its manually-configured regular expressions it tries a more generic one designed to identify the command name that was being executed; that's how "gzip" and "perl" end up on the list. If that doesn't work the process falls through into the "UNKNOWN" category.

The first block of summary data shows a horizontal timeline of the build, measured in real physical time, with the intervals over which the different classes of process occurred.

Overall real time (start of first to end of last)
grep        1.82E+04  (69%) |--------------grep--------------|
UNKNOWN     2.62E+04 (100%) |--------------------UNKNOWN---------------------|
perl        2.61E+04 (100%) |----------------------perl----------------------|
make        2.61E+04 (100%) |----------------------make----------------------|
gzip        1.85E+04  (70%) |--------------gzip---------------|
fontforge   2.61E+04  (99%) |-------------------fontforge--------------------|
mt1         1.82E+04  (69%) |--------------mt1---------------|
rmo         2210.419   (8%)                                   |---|
assembly    3867.689  (14%)                                   |------|
latex       7856.643  (30%)                                    |----latex----|
kerner      1306.752   (5%)                                            ||
TOTAL       2.62E+04 (100%) |=====================TOTAL======================|

These intervals are very simple. As the label says, they run "start of first to end of last," referring to the processes in each class. It happens that gzip was invoked once for a fraction of a second near the start of the build and once at about the 70% mark; so it gets an interval covering that entire range, even though it was never used in between. Utilities that are used occasionally throughout the build will tend to cover the entire timeline.

However, there are still some interesting things to notice in this chart. We can see that for about two thirds of the computation, it was basically just building subfonts ("mt1"). Then once it finished all of those, it started removing overlaps ("rmo") and full-font assembly almost immediately, the assumbly running for a while after the overlap removal finished. I happen to know from reading that part of the log that it actually did only one assembly task right up front and waited until the end of overlap removal before doing the rest, but that doesn't show up in this chart.

Make chooses the priority order of recipes to invoke by rules known only to itself. There is a partial order constraint: if A depends on B, then B must finish before A can start. But other than obeying that kind of constraint, there are no guarantees about order of execution. In Tsukurimashou it happens that there are thousands of files incomparable in the partial order, for which it doesn't matter which order they are built. The "mt1" and "rmo" tasks are almost like that: for any given subfont, we must build it before we can remove overlaps on it, but we could just as well alternate between building subfonts and removing overlaps a few at a time instead of doing all the building first before starting any of the overlap removal. In fact I've seen Make do just that in other builds. I think what's driving its choices is some kind of heuristic like doing first the files that are the greatest number of dependency links away from the final targets; that could be an attempt to minimize overall latency without needing to really solve the hard scheduling problem.

After the horizontal timeline, there follows a block of user CPU times. This probably is the closest thing to a traditional "profile," because it describes the relative resource consumptions of the different categories.

User CPU time per process, including children
class              n            total          minimum     mean  maximum
grep               7   (0%)     0.138   (0%)     0.018    0.020    0.021
UNKNOWN          807   (3%)  9.70E+04  (33%)  1.00e-03  120.192 9.68E+04
perl             196   (0%)    48.219   (0%)  8.00e-03    0.246    3.475
make               2   (0%)  9.68E+04  (33%)     7.039 4.84E+04 9.68E+04
gzip               2   (0%)     2.590   (0%)     1.027    1.295    1.563
fontforge        181   (0%)  1.07E+04   (3%)     0.036   59.335  478.814
mt1            11208  (46%)  6.53E+04  (22%)     0.221    5.827   43.241
rmo            11310  (47%)  8245.315   (2%)     0.047    0.729   12.356
assembly         121   (0%)  6492.729   (2%)    18.061   53.659  267.252
latex              8   (0%)    19.777   (0%)     0.890    2.472    7.475
kerner            60   (0%)  5117.562   (1%)    19.468   85.293  585.799
TOTAL          23902 (100%)  2.90E+05 (100%)  1.00e-03   12.122 9.68E+04

Note that it is, as it says, "including children." Because of Autotools' shennanigans (in particular, the code that automatically reinvokes Make in jobserver mode), there are a couple of layers of extra processes above the entire computation. Almost all processing occurs in shells (and thus Kleknev wrappers) invoked by an instance of Make which is itself inside a Klelnev wrapper, and that is inside another instance of something (possibly Make again, but it was classified as UNKNOWN) that is also inside a Kleknev wrapper. Each second of CPU time is charged to each child and both enclosing parents. This kind of measurement is of some use because it's nice to be able to recognize processes that are parents to a large part of the computation even if they do not themselves do a large part of the computation. However, the multiple counting renders the percentages less meaningful. It would be nice to also have exclusive (not including children) time measurements, and to calculate the percentages in such a way that the total reflects only one counting of each CPU second, even if that means the inclusive percentages will add up to more than 100%.

Edited to add (inadvertently left out in first version): these CPU times come from the "struct rusage" returned by the wait4(2) function. As such, they only count resource consumption of children whose parents have waited for their termination - in other words, zombies are not counted. And the requirement for parents to wait for child termination applies to all intervening parents, not just the Kleknev wrapper itself (which already does wait because it needs the statistics collected on termination). This matter came to my attention when I noticed some processes with very much less CPU time charged than their real time - raising yet again the age-old profiling question "Where did the time go?" It turned out that in order to work around TeX-family software's tendency to demand interactive user attention on errors the processing in those classes was being done inside an expect wrapper, inside a shell, inside a Kleknev wrapper. Expect creates a new process group and for a while I thought that was the problem: maybe resource usage measurements don't cross process group boundaries. If that were the case it would be difficult or impossible to work around without modifying or discarding expect, but on further research, it turned out that it was just because of the expect script not waiting on its child. It was easy to fix by adding a wait command at that level. Failure to wait on the child was a long-undetected bug in the expect wrappers used throughout the Tsukurimashou build system. This issue shouldn't be forgotten because it's likely to turn up again in the future. If you don't wait for your children, not only do they end up shambling around the process table in a horrible semblance of life muttering "brains... brains... brains," but they may also disappear from profile statistics. This issue also illustrates a difference between process accounting and the statistics collected by Kleknev: a program could hide its resource consumption from Kleknev and other systems dependent on getrusage(2) and friends, but not from process accounting, by deliberately creating zombies. So don't use getrusage(2) for billing or intrusion detection purposes.

The numbers tell some interesting stories. Excluding the unknown and global stuff, we can look at the "mean" CPU time column and note that our heaviest processes are the ones classified as "assembly," "fontforge," and "kerner." I think most of the cost of "fontforge" is actually for a second kind of assembly operation applied after kerning; I could test that by adding another more specific classification rule to recognize processes of that kind. The large majority of the computation (after correcting for multiple counting in the percentages) was spent on "mt1"; individual processes in that class ran much faster than the kerning and assembly processes, but there were a very large number of them.

There is a similar table for system CPU time, but it's relatively uninteresting. Let's skip ahead to a later section of the report. The analyzer actually doesn't just give each process one class; it has several schemata for classification and can give the process a class label in each schema. The one called "CATEGORY," above, is the one that appears in the tree timeline. I've also defined one that simply lets the class be the last component of the current working directory when the Kleknev wrapper executed, but in this particular log file it's boring because everything happened in one directory. I intended that schema for other builds where there is a more complicated directory structure. Here's a more interesting section.

==============================================================================
  STATISTICS BY FAMILY (TSUKURIMASHOU MT1)

Overall real time (start of first to end of last)
tsuku       1.82E+04 (100%) |---------------------tsuku----------------------|
jieub       1.47E+04  (80%) |-----------------jieub-----------------|
bll          789.072   (4%)     ||
tsuita      1268.706   (6%)                                          |-|
TOTAL       1.82E+04 (100%) |=====================TOTAL======================|

User CPU time per process, including children
class              n            total          minimum     mean  maximum
tsuku           4896  (43%)  9124.277  (13%)     0.221    1.864   35.435
jieub           2628  (23%)  4.90E+04  (75%)     0.222   18.635   43.241
bll             1236  (11%)  2813.437   (4%)     0.230    2.276   41.915
tsuita          2448  (21%)  4392.772   (6%)     0.228    1.794   20.344
TOTAL          11208 (100%)  6.53E+04 (100%)     0.221    5.827   43.241

System CPU time per process, including children
class              n            total          minimum     mean  maximum
tsuku           4896  (43%)   232.341  (31%)     0.024    0.047    0.264
jieub           2628  (23%)   322.202  (43%)     0.023    0.123    0.267
bll             1236  (11%)    69.856   (9%)     0.027    0.057    0.359
tsuita          2448  (21%)   115.808  (15%)     0.025    0.047    0.145
TOTAL          11208 (100%)   740.207 (100%)     0.023    0.066    0.359

==============================================================================

This "FAMILY (TSUKURIMASHOU MT1)" schema first looks at the class from the earlier "CATEGORY" schema, and if that is anything but "mt1" it assigns a special class value that causes the log entry to be ignored. So in this section we are only looking at things from the "mt1" category of the previous section: subfont generation with Tsukurimashou's bundled version of METATYPE1. Given that the category class was mt1, this schema assigns a label by attempting to guess (from filenames on the command line) which one was built of the several font families in the Tsukurimashou meta-family.

The "TOTAL" bar at the bottom of the horizontal timeline, then, corresponds to the "mt1" bar in the previous section's timeline. We can see that "tsuku" (Tsukurimashou, the mainline Japanese font family) was processed throughout the interval, and "jieub" (Jieubsida, Korean-language) through most of it. There are small bars for "bll" (Blackletter Lolita, a planned-for-some-day blackletter version) and "tsuita" (TsuIta, an italic font family). I'm not sure there's any significance to the times at which Make chose to build those.

Processes of this kind do not have any logged children, so we can believe the CPU time percentages without needing to correct for multiple counting. I think the "mean" and "maximum" columns here are the most interesting part of this whole report. Running down the "maximum" column, note that all the numbers are roughly the same. There's only about a factor of two difference from largest to smallest. Despite the fact that there are drastic differences in the code I wrote for the different families (and I'd have expected Blackletter Lolita to be a fair bit more expensive; earlier versions of it were resource hogs to the point of triggering bugs in Metapost), the most computation-intensive subfonts in each family all cost about the same.

But if you look at the "mean" column, Jieubsida jumps out. The average Jieubsida subfont costs far more than in any of the other families - nearly exactly ten times as much as the average Tsukurimashou subfont - and that pushes Jieubsida to 75% of the total CPU time consumption for subfont generation even though it accounts for only 23% of the wrapper invocations counted in the "n" column.

Here's what I think is going on. I think the time taken to compile a subfont is determined mostly by the number of glyphs in the subfont, and doesn't depend much on what those glyphs actually are. That explains the similar maximum times for all families: there is a 256-glyph limit on subfont size, nearly every family will contain at least one subfont at or near the limit, and so they all have roughly the same maximum time. However, Jieubsida (because it contains far more glyphs, programmatically generated to cover the roughly 11000 syllables of modern standard Korean) has far more of those maximum-size subfonts and they push the average up. The other families are intended for English and Japanese and have fewer glyphs spread across the same number of subfonts, resulting in a smaller average size per subfont.

And that's cool; it is something I learned about my build that I didn't know and had no easy way to discover without the profiler. The fact of time depending primarily on number of glyphs and not on the code that generates them means I don't have to worry too much about writing expensive METAFONT code in the parts that generate the glyphs; and on the other hand, if I want to speed up my METAFONT code I should be looking at the low-level glyph drawing system and the external processing with Perl and other things hidden inside the "mt1" recipe - the parts that depend on glyph count rather than content. Those low-level things apparently eat up most of the processing time. I might have been able to discover that with a lower-level profiler working in the METAFONT language instead of with this high-level approach, but no such profiler exists (unless I write it too), and it would have been hard to know where to focus my attention in the first place.

The system CPU time shows the same effects, but the difference between Jieubsida and everything else is much smaller. Even though more glyphs consume more time across the board, the place it's really eating CPU is not I/O.

I have one more report section to comment on. This one is for a schema called "FAMILY (TSUKURIMASHOU FONT ASSEMBLY)," which is just like the previous one but for tasks in category-class "assembly" instead of "mt1."

==============================================================================
  STATISTICS BY FAMILY (TSUKURIMASHOU FONT ASSEMBLY)

Overall real time (start of first to end of last)
tsuku       3867.689 (100%) |---------------------tsuku----------------------|
jieub       1223.974  (31%)                             |----jieub-----|
bll          115.090   (2%)                              ||
tsuita       256.804   (6%)                                            |-|
TOTAL       3867.689 (100%) |=====================TOTAL======================|

User CPU time per process, including children
class              n            total          minimum     mean  maximum
tsuku             49  (40%)  1368.566  (21%)    18.061   27.930   39.734
jieub             36  (29%)  4033.729  (62%)    28.839  112.048  267.252
bll               12   (9%)   341.362   (5%)    24.510   28.447   32.111
tsuita            24  (19%)   749.072  (11%)    24.227   31.211   38.318
TOTAL            121 (100%)  6492.729 (100%)    18.061   53.659  267.252

System CPU time per process, including children
class              n            total          minimum     mean  maximum
tsuku             49  (40%)    16.126  (40%)     0.257    0.329    0.382
jieub             36  (29%)    11.013  (27%)     0.163    0.306    0.556
bll               12   (9%)     4.035  (10%)     0.292    0.336    0.372
tsuita            24  (19%)     8.258  (20%)     0.301    0.344    0.386
TOTAL            121 (100%)    39.432 (100%)     0.163    0.326    0.556

==============================================================================

Remember I said that the build had invoked just one Tsukurimashou "assembly" process near the start of the "rmo" phase, and then waited until the end of "rmo" to do all the other assembly. That is visible on the horizontal timeline here; the "tsuku" bar is shown as starting much earlier than anything else, even though we can see from the CPU time totals that the "tsuku" class only did 21% of the work. Something else jumps out at me about the horizontal timeline for assembly. Let's compare it (top) with the similar timeline for subfont generation (bottom):

tsuku       3867.689 (100%) |---------------------tsuku----------------------|
jieub       1223.974  (31%)                             |----jieub-----|
bll          115.090   (2%)                              ||
tsuita       256.804   (6%)                                            |-|
TOTAL       3867.689 (100%) |=====================TOTAL======================|

tsuku       1.82E+04 (100%) |---------------------tsuku----------------------|
jieub       1.47E+04  (80%) |-----------------jieub-----------------|
bll          789.072   (4%)     ||
tsuita      1268.706   (6%)                                          |-|
TOTAL       1.82E+04 (100%) |=====================TOTAL======================|

The assembly timeline looks just like the subfont generation timeline, except squished over to the right by the extension of the "tsuku" bar. Except for that one weird early invocation of Tsukurimashou assembly, Make seems to be doing the assembly in the same order as subfont generation.

The user time totals show Jieubsida dominating, presumably because of its much greater glyph counts, much as in subfont generation. But the min/mean/max columns shown quite different behaviour from what we saw with subfonts. In all families except Jieubsida, the minimum is quite near the maximum. Jieubsida is different, with almost a factor of ten between maximum and minimum; but that is still much less than the factor of roughly 150 between minium and maximum we saw for subfont generation in all families. In this phase all the fonts being generated for a family have the same glyph count; so we are only seeing style-to-style differences, which are smaller than the differences associated with glyph count and visible in the subfont data. I don't know why Jieubsida has a much bigger (though still smaller than for subfonts) gap between smallest and largest; but I do know that some styles of Jieubsida have very complicated glyphs in them that trigger bad cases in the FontForge spline-geometry code that runs during assembly. Maybe it's simply the case that some styles of Jieubsida are much harder for FontForge to process than others, and that effect doesn't occur so much in the other families.

System time totals in this report section are interesting almost because they're so boring. All the processes seemed to take about the same amount of system time. All the averages are close to each other; and the percentage totals almost perfectly match the process-count totals. I'm not planning to revert to an earlier version just to test this, but I think this system time table would have shown totally different results if I had run the profile a couple months ago before I discovered the excessive-save issue that was killing my RAID. Tsukurimashou until recently would save every file hundreds of times during font assembly, which combined with power supply issues after I moved to my current apartment, had the effect of blowing my hard drives right off their SATA busses. In that context I would expect to see much larger numbers for system time, and more significant differences from family to family because of the differing file sizes.

Final thoughts

This has been a long and complicated exploration of the Tsukurimashou build system, and I hope it'll be interesting to anybody except me. Even if the Tsukurimashou-specific parts are not, however, I think the idea of coarse-grained profiling for build systems has general applicability. I wasn't able to find a similar pre-existing tool, so this may be the first. I'm hoping to generalize Kleknev into something people can easily use for their own projects.

For more information about Tsukurimashou, see its home page.

2 comments

"Since I think the path "/bin/sh" is hardcoded as the default (certainly in Autotools and probably in Make itself), naming Kleknev's wrapper "sh" and putting it in an overriding PATH directory is probably also not going to work."

I freely admit that I'm a lowly beginner in these matters, but in poking around in my own (kubuntu) /bin, I notice that /bin/sh seems to be a link of some sort to /bin/dash (/bin/bash is also extant in a standard installation). Could one option for invoking Kleknev as above be to point /bin/sh to Kleknev, which in turn invokes /bin/bash (or your shell of choice)?
cernael - 2013-07-16 14:11

In order to do that, you have to be root, and be willing to run the wrapper for *all* uses of /bin/sh on your entire system, including many that are internal to system administration scripts. That seems pretty dangerous; also, even if one were willing to do it, there'd be the same problem as with BSD process accounting, of how to separate out the thing you're profiling from the other logged activity on the system. Maybe the environment variables that control the wrapper could serve to separate out profiling from non-profiling runs of the wrapped shell; but that might open security holes - if a non-root user can set an environment variable that will eventually be in effect when a root shell runs, then the non-root user may be able to subvert a kleknev-wrapped shell to write to a file at root privilege level. Generally, I was trying to design something that could be run by an ordinary user without needing to do things at the entire-system level.

A related idea, safer but still requiring root privilege, would be to install the wrapper as a replacement for /bin/sh within a chroot jail rather than system-wide, but that's still a lot of work. I'd like for a profiled build to be as easy as possible to do, with as little difference from an ordinary build as possible.

In the actual implementation (which has evolved a little since the version described in this posting) I have strong warnings in the man page for the kleknev wrapper as such, saying that you should NOT replace /bin/sh with it, neither via symlink nor actually copying the binary. That is just because it includes a hardcoded path of /bin/sh as the thing it will wrap; if that happens to be the wrapper itself, then it will be a fork bomb, invoking itself recursively until resources run out. Of course, it'd be easy to hardcode some other path for the thing to wrap (such as /bin/dash), but I'm still not sure it would be a good idea. I do provide another binary called kleknev-argo, which looks at the name under which it was itself invoked, searches for that in the path while being smart enough to skip over itself, and then wraps what it finds. It would be relatively safe to replace /bin/sh with that as long as there was something else named "sh" somewhere in the path. I'd still be worried about security issues, though.

Replacing the default shell on a system-wide basis is a drastic thing to do.
Matt - 2013-07-16 14:58

Ansuz