This entry summarizes the current status of the MetaPost serial-number bug. This bug is relevant to the Tsukurimashou project, and I need a consistent URL to link to in the documentation when I discuss this issue. I'll update this entry from time to time with the latest news.
Status of different versions of MetaPost
As of 5 October 2011:
- MetaPost 0.641 from the preserved image of my old computer, probably an old teTeX installation: serial numbers overflow at 2^25.
- MetaPost 1.211 from TeXLive 2010: serial numbers overflow at 2^25.
- MetaPost 1.504 from TeXLive 2011: serial numbers overflow at 2^31. This increase in the limit was announced with the release of version 1.501.
- MetaPost SVN revisions between 1485 and 1656 (probably also some earlier versions before 1485, but the number 1656 is exact): segfault immediately. This is because of a minor mistake in an output macro, corrected in revision 1657.
- MetaPost SVN revision 1507, identifies itself as 1.601 (with the diff from 1656 to 1657 backported): last revision before the memory leaks were introduced. Serial numbers overflow at 2^31.
- MetaPost SVN revision 1697, current head, identifies itself as 1.760: memory leaks cause test program to crash after roughly 2^22 serial numbers. It's reasonable to expect the serial number limit would be 2^31 if not for that.
See my postings (1 and 2) on the MetaPost mailing list regarding the memory leaks in the current development version. It looks like it will be very difficult for anyone to fix the memory leaks, and it's not really something I want to attempt absent being asked really nicely by third parties on their own initiative. It appears it may be a better idea for the structures being leaked to not be dynamic structures in the first place, but implementing that is also not within my purview. For the moment, best that I work from SVN revision 1507, the last one before they were made dynamic.
Plan for fixing the serial number issue: keep a doubly linked list of all "independent variable instance" data structures, defined as all live structures that consume serial numbers. This will cost a couple of pointers on each one. Keep this list sorted by serial number. Adding to the list is easy because serial numbers are only assigned in one place, and always assigned in increasing order, so it's easy to always add to the end of the list. Deleting from the list is trickier because it appears variables may lose their "I am an independent variable whose serial number matters" status, let alone their "I exist at all" status, in many places, all of which have to touch the list (else we leak serial numbers or worse have deallocated memory in the list); but still probably not TOO many places.
Whenever we notice that the list happens to be empty, we can safely reset the "next serial number" counter to 1. That should eliminate most instances of the next case: whenever we want to assign a new serial number and the counter is ready to overflow soon, we walk over the list assigning them all new serial numbers starting from 1 and reset the serial number counter to whatever is next. Now the counter cannot overflow because its range multiplied by the size of the data structures it's counting, works out to more than the size of address space (but we should probably still check for overflow and die if it occurs, in case some future implementation runs on mega-huge address space without expanding the counter's range).
What could still go wrong: if there are references to serial numbers built into other data structures elsewhere, which would be broken when we renumber the independent variable instances. I don't think this happens; it appears to me that all references to the objects that have serial numbers are implemented as pointers to the actual objects rather than records of their serial numbers. It still needs to be looked into, though. When a serial number is reassigned, all records of that serial number must also be changed.
What's going on?
MetaPost is based on METAFONT, and inherits from METAFONT its linear equation solving code (which is fundamental to all arithmetic in the programming language it implements). That code, as Knuth originally wrote it, produced results dependent in some cases on the details of how individual machines allocate storage, so that the same programs might produce different output on different machines. In keeping with Knuth's general philosophy that METAFONT's output must be absolutely consistent on all systems, he changed the code to be absolutely consistent. The result was that there was a counter in the code that started at zero and would be incremented every time an "independent variable instance" was created, so that such "instances" could be given unique serial numbers. Basically, a new serial number was consumed every time a variable was initialized. There was no possibility for the counter to ever decrease. When the counter reached 2^25 (about 32 million), METAFONT would start producing incorrect results.
When the issue was discovered, the first fix implemented was to detect the case of the counter reaching 2^25 and make METAFONT terminate with a fatal error in that case. That change seems to have happened around the year 2004. It is quite surprising to me, given what I know about Knuth, that he would consider it an acceptable thing to do in released code. We can reasonably guess that better responses to the problem must be difficult if he didn't implement one. This situation has the important theoretical consequence that METAFONT can no longer execute arbitrary algorithms: the running time of any METAFONT program is limited to a constant, which in practice turns out to be a couple minutes of execution on a current desktop machine. But never mind whether it was a good idea or not, this was what was done.
By the time of that first fix, it was the case that MetaPost existed, and had inherited the issue from METAFONT. The solution implemented for METAFONT was also implemented in MetaPost.
Some time later, changes were made to increase the limit from 2^25 to 2^31. That is an increase by a factor of 64. So METAFONT and MetaPost programs with the new change could run for a few hours, instead of a few minutes, before crashing hard. It has taken a long time for this change to diffuse through the community. It's only present in relatively recent versions of METAFONT and MetaPost, and some of the very most recent versions of MetaPost also contain fast memory leaks which cause them to crash anyway, much sooner than the serial-number issue would.
I want to implement serious algorithms that could run for many hours, in MetaPost. A limit of 2^31 is not enough for me, I need it to be really unlimited. It also offends my sensibilities as a theoretical computer scientist for any finite limit of this nature to exist - and I would expect someone like Knuth to feel the same way, which is why I'm quite surprised that this issue has existed for seven years, or ever existed at all.
At present it appears the best I can do is take the most recent version of MetaPost to NOT have the major memory leaks, and attempt to remove the execution-time limit from that. At such time as I have something that works, I can try posting a patch here and/or attempt to get the official maintainers to adopt it; but the main priority is for me to have a version that works, myself.