Code refactoring by combinatorial optimization
Mon 5 Dec 2011 by mskala Tags used: compsci, programming, 作りましょうI encountered an interesting problem on the Tsukurimashou project recently, and some inquiries with friends confirmed my suspicion that if anyone has solved it, they've done it in a language-specific way for ridiculous languages. It appears I have to solve it again myself. Here are some notes.
If you haven't been keeping up with my earlier postings: Tsukurimashou is a family of fonts for Japanese (and Korean, and most European languages including English). Partly because this is how I'm more comfortable working, and partly for certain technical benefits, the way I'm building the fonts is not by actually drawing pictures of what the glyphs should look like, but by writing a piece of computer software that, when it runs, draws the pictures for me. One consequence is that it's then very easy to create different font styles by making changes to the common subroutines in the software; I don't have to go through and re-draw every character by hand.
The font-generating software is written primarily in a language called MetaPost, which is a variant of Knuth's Metafont, the font-generating software associated with TeX and LaTeX. There's also a layer of GNU Makefiles, Perl, Prolog, and FontForge scripts, that takes care of assembling the output of the MetaPost code into modern-format font files. Most of that is not relevant to the present discussion.
What is relevant is that it's a pretty big project for one person. The current code base (counting only MetaPost code) is about 41000 lines and covers about 700 kanji (the Chinese-derived characters that form a big part of the Japanese writing system). If we assumed that the code base scaled linearly with number of kanji, the planned scope of this project extends to about 5000 kanji, which would be about 300000 lines of code. Many commercial software packages are bigger than that, but it's substantial for a one-person hobby project.
I remember a manager I once worked with telling me that his estimating rule for programmer productivity was ten lines of code per programmer per working day; at 250 working days and so 2500 lines per year, that would be 120 programmer-years of work; that's the entire careers of maybe four or five people, and quite a lot for one person to do alone. Of course that productivity estimating rule doesn't apply to me, and the number of lines of code doesn't have to scale linearly with the number of kanji, but Tsukurimashou is still a big task, especially since I'm not getting paid for it. The important consequence is that workflow and organizational issues really matter. In particular, if I spend a lot of time waiting for builds, it's a problem. And some of the build tools I'm using are slow, having been chosen for important reasons other than speed.
Fonts are not very modular
The basic way we deal with big software projects is by breaking them into smaller pieces that don't connect to each others' internals too much. That has been true ever since the days of "structured programming." "Object-oriented programming" in the more recent past, and whatever kids are calling it these days, are just new faces on that basic approach of splitting big things into little things. It only sort of works with fonts.
For sure, you can split the character range and say "This part over here is Latin, as used to write English; that part is hangul, used to write Korean; and this other part is kanji, used to write Japanese." There isn't too much sharing of code between Latin and hangul. But within a range, it's just a huge mess, partly because the highest-level organization of my source code is driven by Unicode and it doesn't seem practical to rearrange that. My high-level code is structured into what I call "page" files each corresponding to an aligned block of 256 Unicode code points. Each of those calls macros defined in other files to generate graphic elements that are often shared among many pages.
For instance, there's a graphic element that looks like 女. By itself that is the character U+5973 meaning "woman," defined on page (hexadecimal) 59. But it's also used on page 65 to build U+6570 数 meaning "number." And it is part of U+597D 好 ("like"), which in addition to being a character itself is invoked on page 60 as the upper part of U+604F 恏 (no meaning specified). (I am glossing over some subtleties having to do with alternate forms, in order to simplify the exposition.) There is some, but not a lot of, tendency for macros to be used primarily in specific code point ranges.
I have to follow Unicode's page division, and Unicode's page division derives indirectly, by way of a series of committee decisions, from traditional Chinese character indexing schemes that were cutting-edge database technology roughly 1900 years ago. Some commonly-used shapes, like 女, appear all over the place. On the other hand, basically all the code points from U+5FC3 to U+6207 (nicely covering all of pages 60 and 61, a little over 500 kanji though not all of them are eligible for inclusion in my project) correspond to characters featuring the "heart" radical, which looks like 心, or its alternate form which looks like 忄; that's a well-behaved cluster. In summary, the set of pages that use a given macro is more or less random - more random for some macros, less random for others.
If a macro is used on only one page, fine, it can live in the page file. Then when I modify that macro, I just rebuild that one page, and I'm done. GNU Make can track that with no problem by looking at the time stamps. For macros used on multiple pages, I need to build a library of macros in one or more other files, and then import the libraries (which in MetaPost is done by simple file inclusion) in the relevant page files.
Libraries create their own issues
First problem, already solved: when I change a page file, the selection of library files it requires may change. It's bad if it includes library files it doesn't need, because then when I change those, I have to rebuild the page file unnecessarily. But it's worse it if doesn't include a library file it does need, because then the build fails. That's just dependency tracking and we think we understand it. The solution in this context (since it's not C or some other language with popular standards for dependency tracking) involves a Perl program that scans all the source files, figures out where every macro is defined, and edits the include lines at the top of each file to make sure it depends on exactly the things it should. I invoke that by hand when I need it, so it doesn't have to run on the majority of changes that do not affect the dependency structure. Other code extracts the dependencies and notifies Make when things change, so that I'm basically free to use macros wherever I want, and shift them among different libraries, and not worry too much about maintaining the dependencies; I just have to run the auto-dependency checker either when I think I've changed the dependencies, or when the system complains about not finding things.
The new problem is that my libraries are too big. It's often the case that I change one macro in a library and try to rebuild. Obviously, any page file that actually uses that macro has to be rebuilt. But what if I change, for instance, 五 (meaning "five")? That's used in a few other characters (such as 語 meaning "language"), so I have to rebuild all the page files that use it. Fine. Trouble is, "five" happens to be defined in the same library as "woman," so having touched that file, I now also have to rebuild all the page files that use "woman" because one of their dependencies has changed, even if they don't use "five." Very quickly I end up with a situation where any change anywhere triggers a rebuild on all the page files, and it's exacerbated by the basically random arrangement of macros and page files. Using even one thing in a library renders a page file dependent for build-system purposes on everything in the library even though it may not use much else in the library. I need some kind of finer-grained dependency tracking if I want to avoid a lot of superfluous rebuilding.
One way to avoid the problem, of course, would be to just put every macro in its own file. Touching one then doesn't affect any other, and page files only get rebuilt if one of their real dependencies changes. I'd rather not do that because it means a profusion of little tiny files, many of which only contain five or ten lines of live code. My current code base has about 1000 macros, which would reasonably grow at a rate comparable to the number of lines of code. Having several thousand separate tiny source files would incur big overheads in disk storage, version control, and dependency tracking. It'd mean a lot of annoying switching between files as I edit, a necessity for some appropriate directory and naming conventions more elaborate than what I'm using now, and so on.
Even just such a simple thing as copyright notices would become a disproportionate issue. My current practice is to put a 28-line copyright notice in every source file (longer than many projects use because mine is GPL3 with font-embedding exception, a relatively complicated license), which in the extreme of one file per macro would mean significantly more than half the lines, and probably more than 80% of the bytes, in my code base would just consist of copyright notices. I think having most of the code base consist of copyright notices is what kids these days call a "code smell"; of course I could change my current practice of including the full notice in every file, but it's not clear that I should, and that's not the real issue.
So instead of doing that, I want to split the libraries into smaller chunks, but not so small as one library per macro. My current divisions, mostly splitting at the Japanese government's grade-level boundaries, lead to between 20 and 200 macros per file. (That will break down once I'm seriously working on the high-school level characters where there are a couple of single grade levels containing more like a thousand.) It seems like it would be better to aim for between 5 and 50 macros per file, roughly four times as many files each one quarter the size.
The question is: what is the most useful way to divide up a big library into several small libraries?
Factoring as a mathematical problem
I'm sure you can imagine my displeasure when I looked up code factoring on the Web and discovered that kids these days, having forgotten everything invented before Java (or Ruby, in extreme cases) and apparently never having paid attention in high school, now believe that "factoring" means exactly the same thing as "improvement." An encyclopedia article I won't link to actually cites a Reliable Secondary Source defining the word that way, and emphasizes that factoring should be an evolutionary process consisting of tiny changes. Bullshit! Factoring, in the context of code, means making large non-local changes that specifically consist of placing together shared sub-structures that would otherwise be repeated. That's what the word means. It doesn't just mean "making things better" and certainly not "making things better via small local changes."
In mathematics, if you have an expression like xa2+9b2y-6abx-6aby+9b2x+ya2, it looks complicated and it's hard to deal with. If you examine that expression carefully, you can figure out that it is equal to the expression (x+y)(a-3b)2, which is much easier to understand. Instead of a bunch of apparently random things added together, it's a few small and simple things multiplied together. "Things multiplied together" are called "factors," so the process of looking at a complicated expression and figuring out how to write it in the form of things multiplied together is called "factoring." That's the original meaning of the word. It is reasonable to apply it by analogy to what I want to do with my libraries: look at the current mess of each page file invoking macros from all over the place, and group the similarly-used macros together so as to simplify the overall structure. Finding and merging repeated or similar structures is the best meaning to give to the term "code factoring."
Some of my sources claim that there do exist tools to do what I want, but they're all specific to ridiculous languages. That's a shame, because it seems to me to be a very abstract problem that shouldn't be specific to any programming language. I don't really expect anyone to have designed serious engineering tools just for MetaPost - it's quite possible that my project is (or will be when finished) the most complicated single artifact ever built with MetaPost - but I was hoping for a general solver of the underlying combinatorial problem, which I could feed with a simple language-specific parser that extracts the necessary data from the source code. It looks like I have to write the solver myself, not just the parser.
The abstract problem I think needs to be solved looks something like this:
Given a bipartite graph with parts A and B and an integer k, partition A into a disjoint union of k pieces, so as to minimize the number of pairs (a,b) where a is a vertex in A, b is a vertex in B, a is not adjacent to b, but b is adjacent to some other vertex c in the same piece as a.
The idea here is to minimize the number of unnecessary dependency links. The vertices in A are the macros; the vertices in B are the page files; adjacent vertices mean "this page file uses that macro"; the pieces in the partition of A are the library files. If a page file b uses a macro a, then b must include the library that contains a. If that library also contains some other macro not used by b, it's an unnecessary dependency link; and assuming all macros are touched with equal probability, the number of unnecessary rebuilds is proportional to the number of unnecessary dependency links. Of course the real probabilities are not uniform, and not all rebuilds are equally costly, but uniformity seems like a reasonable approximation and a more precise weighting is probably not worthwhile.
That's a simple math problem. It is likely to be NP-complete (these optimization kinds of things tend to be), but especially since I only require a good solution, not provably the very best solution, it seems reasonable that I should be able to have a piece of software to solve it.
It should be reasonably clear to a programmer at the level where this kind of tool would be necessary, how the basic solver if it existed could be applied to actually solve the practical problem. In general, there would be a tool (probably along the lines of a Perl script) that would make a list, for every page file, of all the macros it uses. Then the solver would split the list of macros (or just those currently in a particular library we're interested in subdividing) optimally into lists for the smaller new libraries; and then some other tool outside the scope of the solver could take those lists and actually re-arrange the code to make the smaller libraries. The solver itself should be generic, not tied to any particular programming language, and from a computer science perspective, the internals of how it works are much more interesting than the nitty-gritty details of the language-specific scripts that apply it to the code base.
Solving the combinatorial problem
I don't have a good solver yet. But I spent most of yesterday afternoon playing with some ideas using ECLiPSe-CLP, which is an extended Prolog interpreter that provides a nice prototyping environment for solvers of these kinds of problems. Please note it is no relation to the integrated development environment called "Eclipse." Below is my current draft solver; understand that it is only a draft, not even yet in a condition to be checked into the Tsukurimashou public code repository, and it's changing fast and I'm not proposing that anyone should use it or even spend much time trying to understand it in detail. I'm just posting it so you can get an idea of what this kind of solver looks like in this language. Note that this version is only 90 lines, a third of which are the copyright notice. So, yeah, nine days of work.
% % Code factoring optimizer % Copyright (C) 2011 Matthew Skala % % This program is free software: you can redistribute it and/or modify % it under the terms of the GNU General Public License as published by % the Free Software Foundation, version 3. % % As a special exception, if you create a document which uses this font, and % embed this font or unaltered portions of this font into the document, this % font does not by itself cause the resulting document to be covered by the % GNU General Public License. This exception does not however invalidate any % other reasons why the document might be covered by the GNU General Public % License. If you modify this font, you may extend this exception to your % version of the font, but you are not obligated to do so. If you do not % wish to do so, delete this exception statement from your version. % % This program is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the % GNU General Public License for more details. % % You should have received a copy of the GNU General Public License % along with this program. If not, see <http://www.gnu.org/licenses/>. % % Matthew Skala % http://ansuz.sooke.bc.ca/ % mskala@ansuz.sooke.bc.ca % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% :-lib(ic). :-lib(ic_global). :-lib(ic_sets). :-lib(lists). :-lib(branch_and_bound). :-local struct(model(bigfiles,smallfiles,nslices,macros, fsets,mvars,objective,othervars)). factor_macros(Files,NSlices,Result):- build_model(Files,NSlices,Model), apply_constraints(Model), do_factoring(Model,Result). build_model(F,N,M):- sort(F,FS), M=model{bigfiles:Bigfiles,smallfiles:FS,nslices:N,macros:Macros, fsets:Fsets,mvars:Mvars}, setof(Mc,Fi^(defines_macro(Fi,Mc),memberchk(Fi,F)),Macros), setof(Fn,Mc^(uses_macro(Fn,Mc),memberchk(Mc,Macros)),Bigfiles), (foreach(_,Bigfiles),foreach(Fv,Fsets),param(N) do intset(Fv,1,N)), (foreach(_,Macros),foreach(Mv,Mvars),param(N) do (Mv#::1..N)). apply_constraints(M):- % every bigfile contains all the macros it should, and get a list of others M=model{bigfiles:Bigfiles,macros:Macros, fsets:Fsets,mvars:Mvars,objective:Objective}, (foreach(Bf,Bigfiles),foreach(Fv,Fsets), fromto([],IL,OL,VL),param(Macros,Mvars) do (foreach(Mc,Macros),foreach(Mv,Mvars), fromto(IL,ILL,OLL,OL),param(Bf,Fv) do (uses_macro(Bf,Mc)->Mv in Fv,ILL=OLL; in(Mv,Fv,DMV),[DMV|ILL]=OLL))), % objective is to minimize the sum of those others sumlist(VL,Objective), % impose a symmetry-breaking constraint (foreach(Mv,Mvars),fromto(0,IL,OL,_) do (Mv#=<IL+1,OL#=max(Mv,IL))). do_factoring(M,R):- M=model{objective:Objective}, bb_min(label_model(M),Objective,M,MO,_OO, bb_options{strategy:dichotomic,factor:0.7}), R=MO. label_model(M):- M=model{mvars:Mvars,fsets:Fsets,objective:Objective}, (foreach(Fv,Fsets),foreach(Cv,CL) do #(Fv,Cv)), append(CL,Mvars,VL), search(VL,0,most_constrained,indomain_random,dbs(3,bbs(5)),[]), label_sets(Fsets), indomain(Objective). label_sets([]). label_sets([H|T]):- insetdomain(H,increasing,small_first,in_notin),label_sets(T).
This is a constrained optimization problem. We have some variables whose values describe the problem (such as which macros are in which library files); we choose values for the variables subject to rules (constraints) like "every page file must include all the libraries in which it uses any of the macros"; there is a variable called the objective whose value, by virtue of the constraints, represents how much a given solution "costs"; and we want to find a choice of values for all the variables such that all the constraints are satisfied and subject to that the objective is minimized.
ECLiPSe-CLP allows a number of different types for the variables; in this problem I've chosen small integers and sets of them as the basic building blocks. This code establishes the variables:
build_model(F,N,M):- sort(F,FS), M=model{bigfiles:Bigfiles,smallfiles:FS,nslices:N,macros:Macros, fsets:Fsets,mvars:Mvars}, setof(Mc,Fi^(defines_macro(Fi,Mc),memberchk(Fi,F)),Macros), setof(Fn,Mc^(uses_macro(Fn,Mc),memberchk(Mc,Macros)),Bigfiles), (foreach(_,Bigfiles),foreach(Fv,Fsets),param(N) do intset(Fv,1,N)), (foreach(_,Macros),foreach(Mv,Mvars),param(N) do (Mv#::1..N)).
That looks at the "defines_macro/2" predicate, which is supposed to come from the parser that analyses the code base, and finds all the macros that are defined in the library files whose names are listed in the argument F. Typically I would query this with a short list of library files (often just one) and the idea is it would logically merge those files, then split the resulting list of macros optimally into the appropriate number of new libraries. Most often I'd expect to be splitting just one library, but it's reasonable I might instead be re-arranging a few libraries that were previously split. After making the list of macros, it makes a similar list of "big files," which are the page files that call the macros.
Then the actual modelling part creates the variables of the abstract problem. For each macro, there is an integer variable with a range from 1 to N, whose value will name which of the split-up libraries that macro should go into. Obviously, every macro has to end up in exactly one of the libraries. For each big file, there is a variable whose value is a mathematical set of integers in the range 1 to N. That describes the libraries included by the big file. There will also be an objective variable describing the cost of the assignment, but it happens to be created implicitly by some later code that references it, so it isn't mentioned here.
Once I have the variables in place, I have to establish the constraints between them. Those are fairly simple in this particular problem, and are established by the apply_constraints/1 predicate. First, for every pair of a big file and a macro, if the big file uses the macro then the big file must include the library file that contains the macro. The code to establish that looks like this:
% every bigfile contains all the macros it should, and get a list of others M=model{bigfiles:Bigfiles,macros:Macros, fsets:Fsets,mvars:Mvars,objective:Objective}, (foreach(Bf,Bigfiles),foreach(Fv,Fsets), fromto([],IL,OL,VL),param(Macros,Mvars) do (foreach(Mc,Macros),foreach(Mv,Mvars), fromto(IL,ILL,OLL,OL),param(Bf,Fv) do (uses_macro(Bf,Mc)->Mv in Fv,ILL=OLL; in(Mv,Fv,DMV),[DMV|ILL]=OLL))),
This code is a little more elaborate than just creating those constraints, because along the way it also iterates through macros not used by the current big file. Each of those gets added to a list called VL, which is used by the next constraint:
% objective is to minimize the sum of those others sumlist(VL,Objective),
VL contains a list of boolean (integer equal to either 0 or 1) variables that were implicitly created above. Each one corresponds to a case of a macro not being used by a specific big file; the variable is 1 if and only if the macro is nonetheless included in a library that the big file includes. That's the bad case we're trying to minimize. So the sum of all of those boolean variables is the number that we want to make as small as possible, and that's exactly what the sumlist/2 call establishes.
At this point, we have put code in place that enforces that the variables, if they have values, must express a feasible (not necessarily optimal) solution or partial solution to the assign-macros-to-files problem. Prolog operates in a somewhat nonlinear way (it is not a procedural language); what happens next is that we can try assigning values to variables, and if we ever break a rule, the interpreter will detect that we did so and refuse to continue. Instead it will "backtrack" and force our code to choose a different alternative.
The apply_constraints/1 predicate also contains a couple more lines of code to do symmetry breaking, but that's a more advanced topic I may talk about later. The basic search doesn't require it.
There's a lot of theory for how to solve constraint problems in this kind of framework. For a brief introduction, you might like to read the slides and listen to the audio for lecture 10 of the third-year algorithms course I taught last Summer. (I can't promise those links will last forever, but they'll probably be valid until U of M offers the course again.)
The general idea of backtracking search is that we're going to assign values to the variables until we have a complete solution. While doing that, we build a stack of the assignments we've made. If we ever break one of the rules, then we have to go back and change one of the previously-assigned variables, popping the stack as we go. If we let it run without limit, it will eventually visit every feasible solution, so while we're going along we just keep track of the best one seen so far and when we're done, it must be the best solution possible.
That takes too long. A typical problem size might be 80 macros splitting into four libraries, so the per-macro variables are 80 integers in the range 1 to 4, a total of 160 bits of information. Any choices for those variables will be a feasible solution (some much better than others) and so looking at them all takes 2160 time, on the same order as breaking strong cryptography.
The first step to making it faster is to apply branch and bound, which in ECLiPSe-CLP is available as a simple library call:
bb_min(label_model(M),Objective,M,MO,_OO, bb_options{strategy:dichotomic,factor:0.7}),
The bb_min/6 library predicate calls our own label_model/1 predicate (whose general function is the basic backtracking algorithm) with some added smarts. As soon as label_model/1 finds a solution - any solution - bb_min/6 intercedes. It records the solution that was found, and then adds an extra constraint to the problem saying now you must find a solution better than that one. From then on, any variable assignment that would increase the objective past the already-known solution results in an immediate backtrack. We can expect that as we find better and better solutions, backtracking will become more frequent, until eventually it's easy to know we have the best possible solution. The actual bb_min/6 call above applies some further enhancements, too; what I just described is just the basic algorithm.
So with just this level of intelligence, we already have a solver that, in theory, should be able to find an optimal solution. How well does it work? Unfortunately, not very well. The biggest problem seems to be that the search space is very big and branch and bound doesn't provide good guidance on how to explore it.
Suppose for the sake of argument that it's a bad idea to put "five" and "woman" (my earlier examples) in the same library. I don't know whether that's actually a bad idea or not - part of the point is that it's very hard to come up with reasonable answers to questions of that type - but just suppose that it's a bad idea, and suppose that the first couple of variable assignments in the backtracking happen to make that assignment. Those assignments end up at the bottom of the stack, and then the backtracking algorithm piles dozens more variable assignments on top of them.
Because we're not going to let it run for billions of years, there is no time for it to go back and re-visit those first few decisions. It only has time to backtrack the bottom few levels of the computation tree, corresponding to the top few entries on the stack. So it will only ever see solutions in which "five" and "woman" are placed together, and if that initial decision happens to be a bad one, we'll be stuck with it anyway. This effect was quite visible in my tests - the solver finds an initial solution, then can't make much progress on improving it.
Improving the search
People have invented a lot of tricks for making this kind of search work better, and I've experimented with a lot of them. (One reason to use ECLiPSe-CLP is that it makes it easy to prototype different choices of these things.) Here's the label_model/1 predicate as it currently stands, with a support routine it uses:
label_model(M):- M=model{mvars:Mvars,fsets:Fsets,objective:Objective}, (foreach(Fv,Fsets),foreach(Cv,CL) do #(Fv,Cv)), append(CL,Mvars,VL), search(VL,0,most_constrained,indomain_random,dbs(3,bbs(5)),[]), label_sets(Fsets), indomain(Objective). label_sets([]). label_sets([H|T]):- insetdomain(H,increasing,small_first,in_notin),label_sets(T).
Assigning a value to a variable in this kind of framework is called "labelling" it. My code relatively straightforwardly tries to just go ahead and label all the variables. What isn't visible above, but is important, is Prolog's implicit backtracking: every time I attempt an infeasible labelling, it stops, goes back, and automatically re-tries pieces of code until the labelling becomes feasible. In this case, the most important cause for that is the extra constraints added by bb_min/6, which keep turning the screws on the labelling code, requiring it to come up with better and better solutions.
It is really annoying to try to label both integer and set-type variables in the same code, because most of the libraries are designed to do just one or the other. Rather than trying to write my own code to switch between the two on the fly, I've instead elected to try to make the sets look like integers. The most important thing about a set (remember, the sets represent libraries included by page files) is how many elements it contains; so I add some extra variables with the line "(foreach(Fv,Fsets),foreach(Cv,CL) do #(Fv,Cv))," one per set, representing that information. There's a constraint between these integers and the sets. So the new plan is to backtrack over all the integer variables, which now include macro-to-library assignments, number of libraries per page file, and overall number of wasted dependencies, with implicit connections among them. Once there is a choice for all those integers that looks feasible, the system will call label_sets/1 to try to come up with actual values for the set variables. By that point many of them will actually have already been given their values as a consequence of the constraints, but label_sets/1 should mop up any remaining uncertainty.
Now that we're mostly only looking at a bunch of integer variables, it's possible to invoke powerful library routines. The first easy enhancement to backtracking is a smarter variable selection rule. Basic backtracking will function (if we don't care about time) no matter what order we choose the variables. The "most_constrained" argument on search/6 tells it to choose variables first that have the fewest remaining choices. That encourages the search to hit infeasible conditions, and therefore fail and backtrack, as soon as possible - which is what we want, because it saves search time. In case of ties between two variables with the same number of choices, the rule is to choose the one that participates in the largest number of constraints. This rule is a fairly popular one that people have generally found works well in practice on many problems. In my experiments it seems to work well on this one.
A similar issue applies to the order in which we try different values for a variable. I've chosen "indomain_random," which shuffles the different values into random order and then works through that list. A non-random rule might typically be something like "use the smallest value allowed by the constraints," which in fact would end up looking first at "put everything in library 1," the worst possible solution. Other deterministic rules seem likely to cause other kinds of bad behaviour; mixing it up with a random assignment makes such things easier to avoid. There is also a special benefit in this case because (in other code elsewhere) I've instructed the branch and bound to restart the search from an empty stack every time it looks for a new solution; combined with the random value assignment that guarantees we'll see a larger sample of different choices for variables toward the beginning and middle of the stack, reducing the "What if it happens to make an early bad decision?" problem I mentioned earlier.
With those improvements it seems to run better, but still isn't very good. It basically starts with a random solution first, which is typically sort of okay, and then it improves it a little, and then it bogs down in a local minimum. Adjustments to the branch-and-bound settings - in particular, telling branch and bound to look for a 30% improvement on every new solution and do binary search if it can't achieve that ("strategy:dichotomic,factor:0.7") - help somewhat. I added in the symmetry breaking constraints but they didn't seem to make much difference and with this article getting long I don't think I'll talk about them. What makes the real difference to search time is to sacrifice "best" on the altar of "good enough."
We don't really need the very best possible assignment of macros. We just want one that is pretty good. So it's not necessary to really run a search program that will look at, or prove it doesn't need to look at, every one of the 2160 solutions. We just want one that can say when it finishes, "Here's a good solution, and finding a better one would be not much better and would take so long as to be impractical."
That issue is addressed by another argument to search/6, the one that currently reads "dbs(3,bbs(5))." That says: instead of trying every possible solution by backtracking, do five levels of backtracking at the end of the calculation (top of the stack) and when that fails, jump down to the bottom of the stack and try every possible assignment for the first three variables. If the search can't find a feasible solution within those limits, it will just fail, even though there might well be other solutions out there. These limits were chosen by trial and error; I'm still not sure they are the best ones, and ECLiPSe-CLP supports many other kinds of limits on search. But this rule seems to be the best compromise I've found so far. In practice, on smallish instances of the problem, different runs produce different results (because of the randomization) but they all tend to produce roughly the same quality of answers, and it seems to be hard to push for anything much better.
The next steps
How well does it work? Unfortunately, at this point it doesn't work very well. If I take a library with about 100 macros in it and tell this code to split that into four smaller ones, which is a typical size of problem where I'd like to use it, after about five minutes of processing this solver can often produce a 25 or 30% reduction in the number of wasted dependencies. That's not a big improvement in build effort stacked up against the overhead of maintaining four times as many library files. A purely random assignment often gives me 5 or 10% improvement over a single file anyway, so all the work of the heuristic search isn't actually buying me much. And it doesn't seem to be the case that if I wait another five minutes, the solver will shave off another 30%; instead, it's running into a wall where it can't improve the split any further.
I think part of what's going on is that there just isn't a whole lot of real improvement possible. The dependencies among my source files are inescapably highly connected, so that no really nice splits into smaller libraries exist. I don't want to blame too much of this problem on those Second Century Chinese scholars, nor even the Unicode Consortium (who have other crimes to answer for); both groups of people did some really clever work in solving important problems, but they didn't solve my problems. It might be interesting - but my availability for this is limited because I have other work to do - to apply this kind of thing to more typical software projects where we might hope factoring is more possible, and see if it can get better results.
Another thing I want to try is a search algorithm of the type called "local search" instead of the one described above which is a type called "global search." Really, with all solutions basically feasible, this problem is crying out for local search, and I tried the global approach first only because I had tools that were really convenient to apply, not because I thought it was really the approach most likely to succeed. The critical difference is that with global search (the kind described above) we start out with a partial solution and extend it, in a way that if left unmodified is guaranteed to eventually look at every possible solution. Local search instead maintains one or more complete solutions at all times - maybe not good solutions, but complete ones with values for all variables. It then makes changes to the complete solution(s) trying to look for better ones. The branch and bound algorithm could almost be called that, but it's more of a leash applied to an underlying global search. True local search techniques include things like simulated annealing.
For this particular problem, one that I'd like to try (and it's sexy, in an almost literal way) is what's called genetic optimization. Maintain a population of possible solutions, have a way of combining two of them to produce a new one (possibly with some random elements too), and implement a selective breeding program until you get a solution you like. I have some vague theoretical support for why that might be an especially good technique to apply to this particular problem, and it would be fun to implement. But it'll require a fair bit more coding to actually build a solver that works that way.
Another global approach, which might at least provide a good starting point for the genetic optimization, would be to try to turn it into some kind of godawful matrix eigenvalue problem. In statistics there is this thing you can do variously called "singular value decomposition" or "principal component analysis" where you have N variables that may be connected to each other, you build an N by N matrix describing, pairwise, how much each thing is connected to each other thing, and then you call a library routine, and it gives you information on the groups of variables that tend to be mutually correlated. So far I haven't proven that the solution to that would actually be the solution to my problem, but it seems like a solution to it would be like a solution to my problem, so it's at least worth a look. When I first started thinking about this problem I thought that since I have potentially a few thousand macros, that would be a few thousand by few thousand matrix, with millions of entries (and "dense," i.e. many entries non-zero, which makes it expensive to work with); but on further thought I realized that I only have to look at macros in the libraries I'm splitting, typically not more than a couple hundred at a time, which means the matrices are a lot smaller and well within the amount of computation I'm willing to do.
I have less hope for other popular "clustering" kinds of techniques, which tackle the general problem of splitting up a bunch of things into nicely separated clusters. I think many of those would tend not to put enough emphasis on the all-or-nothing aspect of my objective function: if a page file uses just one macro that happens to be put into a large cluster, then I'm going to end up paying for wasted dependencies to all the other macros in its cluster. But in principle, clustering is looking at this same general kind of problem and with the right definition of distances among macros, it might be a worthwhile way to think about it.
Even if, as things are starting to look, I'm not really going to save a lot of build time on Tsukurimashou this way, it seems like a fun problem to think about.
2 comments
Something else that occurred to me was that maybe taking an existing file or union of them, and splitting the whole thing into completely new parts, may not be what I really want to do anyway - especially because the most common change to these files isn't actually modifying an existing macro, but adding a new one (at which point I can be *sure* nothing that doesn't use the new macro needs to be rebuilt). So the most useful question to answer isn't "how to split this file?" but "where should I add the new macro?" That may be an easier question to answer: if I know who uses the new macro, I should be looking for library files they already have in common and don't share with too many others. And on the maintenance side, it may be an interesting question "What small change from the current organization - either moving one macro from one file to another, or splitting one or two into a new file of their own - would make the biggest difference?" Those are things I can compute relatively easily.
Matt - 2011-12-06 07:09
Bucko - 2011-12-06 01:57