Testing with Autotools, Valgrind, and Gcov

« SOPA/PIPA protest disappointme... | Home | 公安相のことは »

Sat 4 Feb 2012 by mskala Tags used: programming, software

I only have limited faith in software testing, partly because of my lack of faith in software engineering in general. Most professionally-written code is crap, and the more people use "methodologies," the worse their code seems to be. I'm inclined to think that the best way to remove bugs from code is to not put them in in the first place. Nonetheless, writing tests is fun. It's an interesting way to avoid doing real work, and some of you might enjoy reading about some test-related things I tried on a couple of my recent projects.

Autotools built-in testing

I recently released the first version of IDSgrep, which is a search program for finding Japanese kanji based on their visual structure (outline of the concepts in an earlier posting). It's a smallish C program (about 1500 lines) but it implements a complicated algorithm. The algorithm is original - no other software does what IDSgrep does, so there's no "known good" standard I could compare to - and especially because much of the input data is itself buggy and under development, correctness is important. I want to know, if IDSgrep produces weird output, that it's correctly reflecting the weirdness of its input and not introducing errors of its own. Although it was written originally for my own use, I'd also like to encourage others to use IDSgrep, so it's important too that it should be able to handle whatever nonsense other users subject it to; and if they think it's buggy, it's important that there should be a good way for them to convey to me what they think is wrong. These factors add up to a situation where some automated testing might be appropriate.

Note that what I'm about to describe is not included in the version 0.1 release linked above; if you want the actual code from this article, you'll have to either wait for the next version or check out from SVN. That shouldn't be a big obstacle to the intended audience.

IDSgrep has a typical GNU Autotools build system, and Autotools has some built-in support for testing. This is primarily handled by Automake in particular; if you give it the appropriate instructions, it will include support for a "make check" target in the generated Makefile, which will invoke whatever tests you give it. I think this support is relatively recent. My Automake on my home machine is currently version 1.11.1, which is three years old but they seem to update slowly. (Bleeding-edge seems to be 1.11.3, semi-coincidentally released three days ago.) Searching the Web for information on testing with Automake turns up a lot of old discussion of "Hey, wouldn't it be nice if Automake included test support?"; much of the third-party documentation for Automake seems to predate the actual existence of such features. As of now, they do exist and are documented in the manual, but they don't seem to be much talked-about anywhere else but in the official manual. That's one reason for this article.

Automake as of version 1.11.1 can implement "make check," if at all, in one of three ways. These basically correspond to three increasingly complicated sets of driver scripts that you can ask to have built into your Makefile. There's a "simple" driver; a "parallel" test driver (which actually provides several other special features beyond just being "parallel"); and the "DejaGnu" test driver, which requires an additional set of tools from another package. I chose the "parallel" test driver for IDSgrep.

Requesting parallel test support in the first place is easy, given that I already have an Autotools build system; it just involves adding a couple more items to the line in configure.ac that initializes Automake.

AM_INIT_AUTOMAKE([foreign parallel-tests color-tests])

It had formerly just said "foreign" (which selects a certain level of GNU coding standards uptightness; the default level seems to be "panties-in-a-bunch"). The new item "parallel-tests" selects the parallel test driver script, and "color-tests" selects output in ANSI colour. With these options selected, the Autotools-generated Makefile will contain a test harness. However, it's still necessary to actually give it some tests.

Unfortunately, the Automake manual doesn't really make very clear exactly how the test harness is supposed to interface with the tests. The bits and pieces of the interface are spread throughout the chapter on testing without being collected together clearly in one place. I'm going to try to gather them here, but this is partly based on my own experiments and may not be complete.

Fundamentally, the function of the test harness is to run all the things listed in the TESTS variable and total up how many of them were "successful," while giving the user a pretty display summarizing the results. If the things listed in TESTS were not all "successful," it complains, and it records log files of their output (standard output and standard error) for purposes of haruspicy. The things listed in TESTS should (at this level) all be the names of executable files, but more on that later. In IDSgrep, they are all shell scripts in the test/ subdirectory of the build. In theory they could be compiled programs built by Makefile rules. IDSgrep's setting of the TESTS variable looks like the below - not really, because I'm doing something more complicated involving Gcov (as discussed later in this article) but without the Gcov layer, it would look like the below.

TESTS = \
  test/andor test/anynot test/basicmatch test/bighash test/demorgan \
  test/equal test/kvg-grone test/messages test/spacing test/tsu-grone \
  test/unord test/utf8 test/vgneko

Tests communicate with the harness by return code (the small integer passed to the caller by the "exit" command in the shell script). The harness recognizes at least these values:

0 - Test succeeded.
77 - Test was skipped; it was impossible or inappropriate to run this test, for instance because it requires external software that wasn't installed, or it tests an optional feature that was disabled by configuration option. Such tests appear with a cyan "SKIP" notation in the results; they are counted separately from successes, but they don't cause the overall result to be "FAIL" as a regular failure would.
99 - Test failed in an epic way! The driver provides a feature for specifying some tests on which a "failure" result is considered good (and a "success" result bad); the return code 99 is supposed to be for tests so designated that were epic failures, so they should actually count as failures even though you said that you wanted to count failures as successes. I don't think this feature is a good idea and won't discuss it further in this article.
any other non-zero values - it appears that all other nonzero values are counted as the ordinary kind of test failure. I've been using 1 for this purpose and it seems to work.

Meanwhile, the standard output and standard error of each test is saved (regardless of whether the test's result was pass or fail) to a file named by the TESTS item plus ".log"; so for instance test/andor's results go in "test/andor.log". In the event that the overall result was FAIL, the test harness also generates a "test-suite.log" file in the top directory which contains the concatenation of these log files for the failed tests. The log files are overwritten when you run "make check" again. It's this log file handling that is the main difference between the "parallel" test driver and the "simple" test driver; with "simple," stuff just goes to Make's standard I/O, and so multiple tests running at once can get their output mixed together. Some of the other features of "parallel" are consequences of the fact that Make can process the test log files as targets.

One thing I'm not thrilled about with this system is that it wants to have a separate executable file for every test; on a system of even moderate complexity there may be hundreds or more of individual tests and I don't really want to litter my distribution with tiny files. It might be possible to get around this issue by going deeper into the code and creating a "log file compiler" such that my "tests" would be different command-line arguments passed to a single test program, rather than different executable shell scripts. But for IDSgrep, I just rolled multiple small, related tests into each script. Then if one fails, I can dig into the log file to figure out exactly which case failed. Here's a typical example (sorry about the unusual Unicode characters - they are fundamental to the function of IDSgrep and it wouldn't be a good test without them).

#!/bin/sh

if test '!' -r kanjivg.eids
then
  exit 77
fi

./idsgrep '...&⿰日?⿰?月' kanjivg.eids > dem-out1.$$
echo === >> dem-out1.$$
./idsgrep '...|⿰日?⿰?月' kanjivg.eids >> dem-out1.$$

./idsgrep '...!|!⿰日?!⿰?月' kanjivg.eids > dem-out2.$$
echo === >> dem-out2.$$
./idsgrep '...!&!⿰日?!⿰?月' kanjivg.eids >> dem-out2.$$

if diff dem-out1.$$ dem-out2.$$
then
  rm -f dem-out?.$$
  exit 0
else
  rm -f dem-out?.$$
  exit 1
fi

The script starts by checking that the KanjiVG-derived dictionary file, which is needed for this test, actually exists. Someone might well compile IDSgrep without that, and in such a case, the test returns the magic 77 return code causing a "SKIP" result. It then runs the IDSgrep command-line utility four times with four different queries, to generate two different output files; and it runs "diff" on the output files and returns success if they are the same, failure if they differ. Finally, it cleans up the files it created.

Not to get into too much technical detail on IDSgrep, because this isn't an article about that, but the point of this particular test is to verify that the Boolean query operators obey DeMorgan's Laws: "X AND Y = NOT ((NOT X) OR (NOT Y))" and "X OR Y = NOT ((NOT X) AND (NOT Y))." Queries implementing each of those formulas are evaluated and it checks that the result sets are the same on both sides of each equation. It could be fooled if other things went wrong (for instance, if the output were just empty in all cases!), but other test scripts look for that.

One thing to note is the use of $$ to get the shell's process ID; that prevents a collision should two copies of this test happen to execute at once, which is especially important because we're contemplating a parallel build.

I didn't bother trying to write my test shell scripts for absolute maximum portability (Autoconf-style "portable shell" as mediated by M4). I don't think it's worth doing that under the circumstances, and there's a lot to be said for keeping the complexity of the individual tests down as far as possible so as to spend time fixing bugs in real code and not in the test scripts. Thus, I assume that, for instance, Perl (which is a prerequisite of IDSgrep anyway, for other reasons) is named "perl" and is in the path.

Memory checking with Valgrind

The simple Automake-mediated tests described above are good for checking basic correctness: my users can configure the package, run "make check," and it'll run the search program with a bunch of different inputs and check that the output is basically what it should be; if those tests pass, then the program can probably be said to "work." For development purposes, though, I'd like to examine the software in a bit more detail.

IDSgrep is written in C both in fact and in spirit, and it does some tricky things with pointers in the name of efficiency and programmer convenience. Some kinds of pointer-related problems may not be very visible to users, especially not on small test cases, so we could run into a situation where the program is technically incorrect but nobody notices until one day when they run it on a larger-than-usual workload and everything comes crashing down around their ears. Memory leaks often have that profile. It's easy to allocate some memory, forget about the pointer to it, and then that memory is lost until the program terminates. You can get away with it as long as your program is only running on small cases, but then when it runs longer than usual or when other factors cause the leak to be faster than usual, it fails. Much like nurses counting the needles at the end of a surgical procedure to be sure none are forgotten inside the patient, it'd be nice to run our code for testing purposes in an environment where we track all the memory and can be sure we don't lose any.

Valgrind (last syllable pronounced like "grinned," not "grind"; it refers to the main entrance of Valhalla in Norse legend) provides something like that. Its general function is to run a program in an emulated environment where individual "tools" (of which there are several available) can peek and poke at the program's innards during and after execution. The default Valgrind tool, which many people take as synonymous with Valgrind itself, is something called Memcheck.

Memcheck tracks every bit of the program's state as "valid" or "not valid," in order to provide warnings when the program's behaviour becomes non-deterministic as a result of dependence on the value of uninitialized data. This is actually an implementation of something like the Colour concept I've written about before. Keeping a bit of metadata for every bit of data, even with aggressive data compression, seems like a lot of work, but Memcheck's authors put forward the claim that this level of accuracy is needed to correctly handle some things that C programmers do, such as copying a structure that may contain alignment gaps. If you flag an error whenever the code reads uninitialized data, you get a false positive as soon as you read the alignment gap. But if you consider data to be "initialized" as soon as it's written, you get a false negative if you read the copied, but never actually initialized, alignment gap in the copied structure. Really getting it right requires keeping track of the initialized-or-not Colour in a way that moves with the data as the data moves in and out of memory and the CPU. That's all fine; I'm sure it's worth testing and it'll protect from silly errors, but in fact I haven't gotten much use from this particular check yet with IDSgrep. I think uninitialized data errors aren't all that common in skilfully-written code, and the compiler will catch many of them anyway, despite its less-accurate model for doing so.

Memcheck's memory-leak detection is more interesting for me in the case of IDSgrep. As well as keeping track of validity of data, Memcheck also tracks all allocations and deallocations of memory blocks. If you do something like freeing a block that was already freed, running off the end of an array, or trying to dereference a pointer into West La-Laland that hasn't been allocated, it can complain, but that may not be too exciting: we could already find many such problems with GDB and segmentation faults, and Electric Fence does a more thorough check. Memcheck's specialty is that it records the call stack every time a block is allocated. At the end of the program, it traverses all the data structures to figure out which blocks are still accessible. Any that aren't, seem to have been leaked, and it can give you a report of all those call stacks, which is usually a great clue to where the memory was lost.

The post-run memory traversal process involves heuristics to figure out which values "look like" pointers, and it's easy to imagine theoretical problems with that. You can get a false negative from an integer that happens to be equal to the integer value of what would be a valid pointer, and you can get a false positive if the program is representing pointers in some other way than the standard C pointer type (for instance, hiding bits of extra data in the low bits of unaligned pointers) or with pointers into the middle of blocks (as some custom allocation layers might do). Nonetheless, the whole thing works surprisingly well in practice.

In the context of the IDSgrep test suite, what I'd like to do is have one of my tests be a Memcheck leak test. It'd be possible to imagine something more elaborate - even to the point of running the entire regular test suite inside Memcheck - but for my purposes (and especially because Valgrind is slow, what with all the virtualization) I just want to have one test with reasonable coverage that runs inside Memcheck, and I want that to be considered a success if there are no leaks. More detailed leak-testing I'll leave for the maintainer (myself) to do manually when it's appropriate.

All I needed to do to create this kind of test for IDSgrep was write another script (the test/vgneko script mentioned in the previous section's example) and include it in the package and the TESTS variable. Here are the contents of that script:

#!/bin/sh
if which valgrind > /dev/null 2> /dev/null
then
echo '【猥】⿰⺨<畏>⿱田?【猫】⿰⺨<苗>⿱艹田' > vgneko-in.$$.eids
valgrind --leak-check=full \
  ./idsgrep  '&!⿰?畏&⿰?*⿱?田...⺨' *.eids 2> vgneko-out.$$
cat vgneko-out.$$
if test "x`grep 'ERROR SUMMARY: 0 errors' vgneko-out.$$`" = x
then
  rm -f vgneko-in.$$.eids vgneko-out.$$
  exit 1
else
  rm -f vgneko-in.$$.eids vgneko-out.$$
  exit 0
fi
else
  echo "Can't find valgrind, skipping test."
  exit 77
fi

This script checks that there actually is a program named "valgrind" in the path; otherwise it skips the test with the return code 77 described earlier. It creates a small temporary file giving idsgrep some data to chew on, then runs it with the prefix "valgrind --leak-check=full" on the command line. That means the program actually being run is Valgrind, with the default Memcheck tool; the remainder of the command line is then taken as a command which Valgrind will run in its emulated environment. Afterward, the script uses cat to dump Valgrind's output (which it had captured to a file) to standard output so that Automake's driver can re-capture it into the log file; and it checks for a line indicating termination with no detected errors, to trigger the pass or fail result. All pretty straightforward.

An alternate way of applying Memcheck might be to make use of Automake's "LOG_COMPILER" feature. Automake is capable of recognizing a configurable list of file extensions on the items in TESTS; and for each of them, you can specify a "compiler" (which is in most examples not actually a compiler at all) to be used as a prefix to create the command line to run the test. For instance, you could specify "PL_LOG_COMPILER = perl" to run all *.pl files in TESTS with Perl. The variable LOG_COMPILER is applied to test names with no extension. If you set that to "valgrind --leak-check=full" then you'll implicitly get Memcheck leak-checking on all the tests. However, that wouldn't work in the case I actually described, where the test executables are shell scripts, because it would be leak-checking the shell rather than the software that's supposed to be under test. A more elaborate scheme would be needed to actually make that idea work; I'm mentioning it only as a possible option that someone might want to explore.

Thread testing with Helgrind

I set up a similar Automake-mediated test suite for another of my projects, which is a package called "ecchi." Unfortunately, at this point I can't share the code for ecchi nor even tell you specifically what it does; it's a piece of software I'm developing that is part of a currently-unpublished academic research project. But the hope is that at some point in the future when the paper is published I'll be able to release it, and in the mean time, even within my project there's a need for portability, hence Autotools. It's also the case that ecchi will be used for some very long-running computations, and because it's research it's pretty important that the results actually be correct, and so a lot of the testing stuff described above is relevant too.

My reason for telling you about ecchi at all is that ecchi has a special characteristic that creates many opportunities for naughty misbehaviour compared to IDSgrep: unlike IDSgrep, ecchi is heavily multithreaded. It's meant to run on clusters of SMP compute servers and use as many cores simultaneously as I can convince the University of Manitoba to let me have. It's really easy to imagine that I could forget to lock some piece of code that should have been atomic and cause a race condition, or grab two locks in the wrong order and cause a deadlock condition, and due to the nature of the problem being solved this might not happen until three weeks into a month-long computation, and it would be a really bad scene.

Valgrind provides two tools called Helgrind and DRD ("data race detector") for analysing thread problems. Both operate in a roughly similar way to Memcheck: they run the program under test, watch the things it does, and complain if it breaks any rules for what a good program ought to do. The difference is that instead of enforcing rules on memory access and allocation, Helgrind and DRD enforce rules on thread stuff, like locking shared data before accessing it.

The exact way they work is mired in all the complicated theoretical stuff we slept through in concurrency class, and I don't really understand some of the details. From my point of view as a logic programmer, I'm inclined to think of it generally as theorem proving. There are some logical statements that can be made about the properties a program should have if it will work correctly in a multithreaded environment (e.g. "If two threads access the same data, then we are always able to determine which one does so first") and the tools attempt to prove that those statements are true.

The difference between the two seems to be in exactly which theoretical models they are reasoning over. Helgrind seems to be a more mature product, and to be based on general rules about code executing before other code. The name refers to the entrance of Hel with one ell, which is the place in the Norse afterlife where chosen Wikipedia editors have an eternal revert war over whether there should or shouldn't be a "see also" link to the Christian Hell with two ells. Or possibly the goddess thereof, if she can be said to be distinct from her domain, that being the subject of another revert war. DRD is closer in nature to Memcheck plus threading tests, and it is focused specifically on race conditions involving memory addresses. In my own project I found Helgrind to be more useful - it found bugs that DRD didn't - but I have no way of knowing whether that would be generally the case. Interestingly enough, Helgrind's manual cautions the reader that it doesn't work well on condition variables, and ecchi uses condition variables extensively; nonetheless Helgrind did seem to work well for diagnosing bugs in ecchi's use of condition variables. It found some bugs that were genuine bugs, and didn't seem to flag any non-bugs. Maybe I was using my condition variables carefully enough that they didn't trigger undesirable behaviour, or maybe the tool is just good.

One issue I did have with it, which I don't think relates to condition variables, had to do with thread-local storage. I think what was happening was that when I start a thread, the system (thread library with compiler and linker special-case support) creates a block of thread-local data for it. When the thread terminates, that block is disposed of, and can be re-used by a newly-created thread. I sometimes got race conditions reported that involved the thread-local data. Reordering some statements in a way that should have had no effect, caused the warnings to cease. More study is needed, but it looked to me like what might have been happening was that the tool saw "Old thread wrote to this address... new thread read from it... they don't synchronize with each other... RACE!" without noticing that the new thread only had access to that data at all by virtue of the termination of the old one.

Much can be said about how to use the output of Helgrind, but given the structure I've already described, actually running it is easy. I've created a couple of test scripts that exercise different features of ecchi likely to trigger multithreading issues, and the scripts to run a test under Valgrind/Helgrind look very much like the one I already showed for running a test under Valgrind/Memcheck. If I wanted to do one with Valgrind/DRD, it would be similar again. Here's a sample, the script test/hgoverlap. Note that the check for termination with zero errors is just the same as with Memcheck; that's a nice consequence of their both reporting results through the Valgrind framework.

#!/bin/sh
if which valgrind > /dev/null 2> /dev/null
then
input-generator | valgrind --tool=helgrind ./ecchi -o 2> hgoverlap.$$
cat hgoverlap.$$
if test "x`grep 'ERROR SUMMARY: 0 errors' hgoverlap.$$`" = x
then
  rm -f hgoverlap.$$
  exit 1
else
  rm -f hgoverlap.$$
  exit 0
fi
else
  echo "Can't find valgrind, skipping test."
  exit 77
fi

Checking test coverage with Gcov

How do we know when we have enough test cases? One common target is full coverage of source lines: when we run all the tests in the suite, it should be the case that every line of source code which generated some code in the executable, has executed at least once. That certainly doesn't necessarily mean it all executed correctly, because we have no way of knowing that the test cases properly detected that; but at least if control has flowed through every line, we have some assurance that our tests exercise all the intended features of the software. And this is an easy, objective criterion to evaluate as a meta-test of the test suite.

Valgrind provides some tools (specifically, a couple of profilers) that could be applied to checking source-line coverage even though their main function is to gather statistics for performance optimization. For IDSgrep I chose instead to use Gcov, partly because I wanted to learn how it worked. Gcov is a GNU tool that instead of running the program under test in an emulator, uses hooks in the C compiler to make the program collect its own statistics. Then Gcov proper is a separate tool that analyses the statistics. Overall, the flow is very much like profiling with Gprof: you compile with special flags to GCC, run the program, then run the Gcov tool.

Because it requires special (and messy and inefficient) compile flags, this isn't just something I can stick in a simple shell script to test the normal build. Instead I added an option to configure. When I want to do coverage analysis - presumably rather less often than I would run the other kinds of tests - I'll run "./configure --enable-gcov ; make clean ; make check." The "make clean" is to be sure there are none of the old non-instrumented object files kicking around; everything has to be rebuilt from scratch with coverage instrumentation.

The first step is to add a few lines to configure.ac to create the new command-line option for configure.

AC_ARG_ENABLE([gcov],
  [AS_HELP_STRING([--enable-gcov],
    [use Gcov to test the test suite])],
    [],
    [enable_gcov=no])
AM_CONDITIONAL([COND_GCOV],[test '!' "$enable_gcov" = no])

That's reasonably standard Autoconf stuff. When the user chooses the new --enable-gcov option to configure, the COND_GCOV Automake conditional will become true, and we can test the conditional with code in the Makefile.am to enable the special features needed for coverage checking. Here is the relevant code, which in the actual Makefile.am is spread out with other things in between. The order isn't terribly critical and Automake will probably reorder it anyway.

if COND_GCOV
   MAYBE_COVERAGE=--coverage --no-inline
endif

AM_CFLAGS := $(MAYBE_COVERAGE) $(AM_CFLAGS)

MOSTLYCLEANFILES = \
  idsgrep.aux idsgrep.log idsgrep.blg idsgrep.bbl idsgrep.toc \
  *.gcda *.gcno *.gcov

GCOV_TESTS = \
  test/andor test/anynot test/basicmatch test/bighash test/demorgan \
  test/equal test/kvg-grone test/messages test/spacing test/tsu-grone \
  test/unord test/utf8

define GCDEP_RECIPE
$1.log: test/rmgcda.log

endef

if COND_GCOV

  TESTS = test/rmgcda $(GCOV_TESTS) test/gcov

  $(foreach test,$(GCOV_TESTS),$(eval $(call GCDEP_RECIPE,$(test))))

  test/gcov.log: $(foreach test,$(GCOV_TESTS),$(test).log)

else
  TESTS = $(GCOV_TESTS) test/vgneko
endif

The first direct effect of COND_GCOV is to set the MAYBE_COVERAGE flag to a couple of extra options for GCC. The --coverage option is the main one: it tells GCC to compile the code with extra instrumentation, in particular a counter attached to every source line and incremented when it executes, and further code to write those counters to *.gcda files when the program terminates. The --no-inline option turns off function inlining, and in particular, the automatic inlining of small functions. That's because when a function is inlined, the compiler normally also generates a non-inline "actual function" version to be used at such times as when you ask for a pointer to the function. This actual function may later be pruned out by the linker if you don't in fact need it, but when the compiler first runs it can't reliably predict that. What I found was that for some of my functions that got inlined automatically, Gcov ended up thinking that there were never-executed source lines at the very first line of the definition, and possibly also at the closing bracket at the bottom, corresponding to the never-called actual function version of the function and its stack frame setup and teardown. Turning off inlining prevents Gcov from complaining about the not-very-interesting fact that the function is only called in its inline form.

The extra CFLAGS options go in a separate variable of their own, concatenated with the previous value of AM_CFLAGS in the line "AM_CFLAGS := $(MAYBE_COVERAGE) $(AM_CFLAGS)." Note the colon on the equals - using just a regular equals would make AM_CFLAGS call itself recursively, with undesirable effects. This thing of putting the added flags in a separate variable is standard Automake usage and covered in its manual. Worth mentioning is that many people want to also specify "-O0" when compiling a coverage-check version, on the theory that optimizing compilation's dead-code elimination will skip generating code for source lines that it knows don't execute (such as the "foo" in "if (0) { foo(); }"), and then you won't be warned that they don't execute because there was no code for those lines anyway. I'm not sure about that. I had trouble figuring out how to override the default -O2 that appears in CFLAGS, short of committing the Automake sin of actually touching the CFLAGS variable (which is supposed to be reserved for the user). I also think that to the extent possible I'd like to do my coverage check on the same code that I'm testing in other ways, optimizations and all. But it's quite possible that in the future I'll change this to disable optimization rather than disabling just inlining.

The next thing relevant to Gcov is the MOSTLYCLEANFILES variable, listing files that should be deleted by "make mostlyclean" and friends. Most of those files listed were there already; the new additions for Gcov are "*.gcda *.gcno *.gcov," corresponding to the profusion of Gcov-related files created by the compiler (*.gcno), the program under test (*.gcda), and the gcov analysis utility (*.gcov).

Not every test in the former TESTS variable should actually be invoked when we're checking coverage. Running a program that has Gcov instrumentation inside Valgrind's additional instrumentation seems to be a bad idea. With Memcheck it'll at least be terribly slow, and as for Helgrind, it appears that the Gcov instrumentation is, itself, not thread-safe. So when I tried running ecchi with Gcov instrumentation inside Helgrind, it showed a race condition for every line executed until it had a thousand of them and then it gave me a snarky message about how it wasn't going to bother telling me about any more and I should fix my bugs before imposing on it again. Does this mean Gcov should not be used on multi-threaded code? I'm not sure. At the very least, its results should probably be taken with a grain of salt.

The tests that should be run inside the Gcov check - all of them except those that can't or shouldn't be run that way - are listed in GCOV_TESTS. When COND_GCOV is false (normal testing) then TESTS gets set to GCOV_TESTS plus the extra tests that don't work with Gcov, which in this case means the single script test/vgneko (mentioned earlier - it's a Valgrind test). If Gcov isn't being used, we're finished here.

When --enable-gcov has been chosen, I want to run all the things in GCOV_TESTS but also a couple of special Gcov-specific "tests": test/rmgcda (which runs "rm" on all the "gcda" files) and test/gcov (which actually invokes the gcov analyser). Removing the gcda files is important because the instrumentation is cumulative, as it should be to accumulate counts over multiple runs of the program under test in the test suite. The idea is that at the start of the coverage check I want to start with zero counts (implied by no gcda files existing); then after that has been set up, I want to run all my tests; then after they have all finished I want to run the gcov analyser to actually compute the coverage by checking that every line was covered.

Bear in mind that this is all done with Automake's "parallel" test driver. In general, tests could run in any order or all at once. For the moment I'm not going to worry about whether the non-thread-safe Gcov instrumentation might also be non-thread-safe with regard to the file system, causing fun and games if two tests terminate and try to write to the cumulative statistics files simultaneously. But I do have to force test/rmgcda to run first (before any others start) and test/gcov to run last (after all others finish). These constraints turn into Make dependencies. GCDEP_RECIPE is a multiline Make variable defined with "define", which gets called inside a $(foreach) loop to generate, for every test in GCOV_TESTS, a recipe saying "the log file for this test depends on test/rmgcda.log." That forces test/rmgcda to be first. In the other direction it's a little easier: I just say "The file test/gcov.log depends on all the log files of tests in GCOV_TESTS." It may be possible to streamline this code further - ideally to where it can avoid using any GNU-specific features of Make. I'm still playing with it; but the code does seem to work pretty well as shown.

The test/rmgcda pseudo-test is very simple; it looks like this:

#!/bin/sh

rm -f *.gcov *.gcda
exit 0

That deletes the files and always succeeds. On the other end, test/gcov is a bit more complicated because of another issue I haven't mentioned yet. When we think of source line coverage by tests the natural thing is to say we want to test every line of code. How many bugs would we like to have in our code? Zero bugs! But it's not that simple. Some lines of code may actually correspond to "should never happen" (SNH) errors - things that by design of the program ought to be impossible. We shouldn't be able to cover those lines with test cases - if we can, something is seriously wrong.

One example in IDSgrep has to do with match function results. The matching algorithm involves a bunch of special features each of which is implemented in its own function. There is a higher-level general matching function that calls a lower-level specific matching function via a pointer, and the higher-level function doesn't actually know the complete list of lower-level functions it might have to deal with. It just knows to call the pointer. This is something like OOP's "polymorphism" achieved in a non-OOP language. Lower-level functions return some of their findings by setting a value in an enumerated-type field which is supposed to have five different possible values. So the higher-level function takes that field value and uses it in a switch statement, and for reasons of safety, it doesn't trust the lower-level function to actually limit its results to the five official values. There is a "default" clause in the switch, which executes if the lower-level function returns an invalid value. In such a case it'll print an error message and immediately terminate. But that Should Never Happen - if there is no serious bug in the other code, there will be no input to the overall program that will actually cause that line to execute, and so there should exist no test suite consisting of inputs and outputs to the overall program that will achieve 100% coverage.

One way to deal with that sort of thing would be "unit" testing. I said no test suite consisting of inputs and outputs to the program could test the SNH lines, so let's create a test suite that does not consist of inputs and outputs to the program. Instead of testing the entire program, we might write a separate stub program that links to the higher-level matcher and calls it with a pointer to a made-for-the-purpose badly behaved lower-level function that will return a bad result and trigger the SNH error. I'm not enthusiastic about doing that, especially in a program as small as IDSgrep, because it means writing even more code that isn't directly useful and just exists to get around the formal requirement of 100% coverage. I know damn well that "puts("complaint");exit(1);" does what it should, and even if it doesn't, the only case where it would matter would be a case where I'd have much bigger things to worry about. That shouldn't necessitate writing another entire program, even a stub one. I suppose another way to avoid this problem would be to just live dangerously and never check for SNH errors at all. After all, if the code will never execute, it's dead code and maybe shouldn't be included to bloat the executable. But there's more to missing coverage than SNH errors. For instance, there may be code that can execute under circumstances we could create, but only with a lot of expense and inconvenience. "Printer on fire" might be an example - we want to have code that runs in that circumstance but we don't want to actually set the printer on fire every time we run tests.

Something they often do in industry is choose a random number less than 100 and (usually) more than 90, and say that that percentage of test coverage will be considered good enough, thus accommodating a reasonable amount of untestable code. That's why Johnny can't read, but it at least does have the big advantage of being something objectively measurable even if there's no sensible way to say what the acceptable percentage should be. For IDSgrep I chose a different route: I'm going to insist on 100% coverage of lines that are not manually marked as SNH, and 0% coverage of the lines that are so marked. So there's no specific overall number of uncovered lines I will tolerate, but for every uncovered line, I have to have thought about it and decided on purpose that there's a good enough excuse.

When the gcov analysis utility runs it produces a summary of the percentage coverage for each source file, but that's looking at all lines and thus not really what I want. I ended up writing a Perl script to parse the *.gcov files and compute coverage in the way I want it computed. That script is a little long to include verbatim here, but you can download it from the SVN repository. It scans the *.gcov files, which include all the C source lines as well as their number of executions, and it counts the percentage of lines that do not include "/* SNH */" or "// SNH" and were executed, and the percentage of lines that do include one of those markers and were executed. The percentages of unmarked and marked lines to be executed should be 100% and 0% respectively, and it prints a little report of percentage coverage for each file and overall, and passes a return code to the shell script saying whether the goal was achieved.

The test/gcov script (which, remember, will be run last by "make check" when --enable-gcov was selected) is shown below. It runs gcov, run check-coverage, returns success to the driver if check-coverage returned success, and if not, runs grep to find all the lines that weren't executed and all the lines that were marked SNH whether executed or not, and dump all those lines with windows of context into the log file. The idea is that if I have 100% coverage adjusted for SNH, the test suite will say "PASS"; if not, it'll say "FAIL" and I can look in the log to get some clues to where the coverage was inappropriate.

#!/bin/sh

( gcov -r *.c || gcov *.c ) > /dev/null 2> /dev/null
if perl -CS ./check-coverage *.gcov
then
  exit 0
else
  grep -E -A3 -B3 '(#####)|(SNH)' *.gcov
  exit 1
fi

One other small thing worth mentioning is that the above script tries to run gcov with the -r option (for ignoring system header files, basically) first, and then if that fails, it'll run gcov with no options. I first wrote "gcov -r" in the script, following the documentation for gcov, but then I discovered that on some systems including many of mine, the installed gcov is old enough not to support that option. Even if I can upgrade my own system many other people will not have done so yet, so this will fall back to not using the option when it's not available.

And there you have some notes on testing with Autotools, Valgrind, and Gcov. Happy hacking.

UPDATE: Subsequent to writing the above, I've found out the hard way that the Gcov profile-analyser tool has undesirable behaviour when it is run with multiple source filenames and there is not a *.gcno "node" file for every specified source file. This can occur, for instance, when a source file exists but doesn't get compiled due to local configuration choices, like a copy of "getopt.c" supplied for systems that need it but with the build set up to use the system's own getopt where possible. It didn't affect IDSgrep but it did affect ecchi after I changed some other things. In such a case, when Gcov sees the source file that has no node file, it not only produces an error message but aborts processing other files, so you end up with "No executable lines" messages and 0/0 coverage reported for many files containing code that certainly should have, and probably did, actually execute. A workaround is to call the Gcov analyser on just one source file at a time, replacing this:

( gcov -r *.c || gcov *.c ) > /dev/null 2> /dev/null

with this:

for srcfile in *.c
do
  ( gcov -r $srcfile || gcov $srcfile ) > /dev/null 2> /dev/null
done

It seems like maybe we should apply it to *.h files as well as *.c files, since those can certainly contain executable code, but in my tests they seem to end up getting included automatically, so it may not be necessary.

1 comment

Very interesting. When I was at Yahoo Search Marketing our team used an autotools-based build system called Skeletor that did a lot of stuff like this, though with a different slant. Our code was mostly broken up into libraries, each of which was built seperately. Each library had its own unit tests, and we used some odd preprocessor idioms to mock-out called functions in the C code under test. Skeletor supported "make test", "make coverage", and "make memcheck" targets, which ran the unit tests without fancy tools, under a coverage analyser (can't recall which one), and under valgrind memcheck (I think we also had a target for addrcheck, which apparently no longer exists). I think our coverage analyser had built-in support for an /* unreachable */ comment. The goal was to run every reachable line of code under memcheck.

I thought Skeletor had been open-sourced, but I can't find it. The only reference I found on the web is a historical description in http://code.google.com/p/fwtemplates/wiki/FramewerkIntro. Framewerk is apparently another iteration of the same idea, by some of the same developers.
Jeremy Leader - 2012-02-08 13:47

Ansuz