« Removing things from Firefox's... | Home | Tsukurimashou 0.8 »

Papers from the recent past and near future

Wed 21 Aug 2013 by mskala Tags used:

I just got back from a trip to multiple conferences in Ontario, and that makes it a good time to update my publications page. Most people interested in my academic work are likely to find out about it from other sources, but I'm going to post some brief and relatively non-technical introductions here as well for my general Web site readers. The official versions of these papers are mostly behind paywalls, but there are also unofficial "preprint" versions available to all.

Where applicable, I am supposed to tell you that "The original publication is available at www.springerlink.com."

IanFest 66: we threw an event in Waterloo this Summer in honour of my PhD supervisor Ian Munro on the occasion of his 66th birthday, with most of his former PhD students and a bunch of other friends and colleagues coming in from all over the world to present papers related to the sort of computer science Ian does and teaches. I decided to contribute what's called a "survey paper" on the topic of array range queries.

Very briefly: if you have data in some kind of array or sequence, you might want to answer questions about sub-ranges of that sequence. For instance, "What was the highest temperature recorded between the years 1950 and 2000?" or "Was there any single referring Web site responsible for 10% or more of my traffic last month?" The obvious way to answer such a query is to march through the entire sequence; but if you have to answer many of them, you might want to build an index in advance that helps you to answer these kinds of queries faster when they come along. It should be clear that people have many reasons to want to do things like that. I and other researchers have done a lot of work recently on pushing the limits of array range queries.

The idea of a survey paper is that instead of presenting new material, it's supposed to sum up and provide a guide to the existing work in the field. Most papers that present new work start with a little bit of that kind of summary, but it's usually narrowly focused on the previous results that support the new one. A survey paper is usually a bit broader and designed to give an overview of the whole field. I'd never written one of those before, and I wasn't sure quite how to go about it. It ended up being a lot more work than I'd imagined. But I think the end result came out pretty well.

Array range queries were a natural choice because they're something Ian, I, and many of his other students have worked on a lot. A student named Sharma Thankachan from Louisiana State University came and visited my group at the University of Manitoba last Summer and again this Winter, and he and I and my postdoctoral supervisor here at U of M (Steph Durocher) collaborated on several array range query results. We found that there was a lot of existing work on similar things, and not much in the way of a good survey of what was out there - so we were constantly asking "Okay, what's the best result known on this problem? Has someone else thought of this idea before? What other questions have been asked?" and so on. It looked like I was going to have to prepare some notes on the current state of the field just for our internal use anyway; and so when the invitation for the Munro Festschrift came along, it seemed like a chance to get added value from the work I'd already be doing preparing the notes.

In fact, it spiralled out of control. I knew there was a lot of work on the array range stuff, but I'd underestimated how much this field overlapped with geometric and string searching. Everyone wants to create the next popular search engine, search engines need range queries, and so everyone wants to have a kick at that can research-wise. I asked for and was allowed a few extra pages beyond the official guideline for the length of the paper, and I narrowed my focus a couple of times, and it still feels cramped. One student suggested I should extend this into a whole book on array range queries. Corrections and new papers were and are coming out all the time. When I actually gave the talk for this I ended up citing results like "Travis's talk last Tuesday" and "Steph's talk ten minutes ago," and two of my own papers didn't even make it into the survey because the venues hadn't officially accepted them yet by the time the deadline came around. It's sort of exciting knowing that I'm really on the cutting edge like that.

Anyway, here is the unofficial version and the official version of my IanFest survey. Some more material, like the slides and maybe even a video from my talk, may be forthcoming later. Since this is a survey and a (brief) introduction of the subject area, it may be a good place to start before reading the more specific papers I describe below, which is why I'm introducing it first even though it's chronologically in the middle.

Range majority, ICALP 2011 journal version in Information and Computation: The conference version of this was actually the first array-range paper I worked on, at about the time I was moving from Waterloo to Manitoba. My co-authors are all either from Waterloo, or formerly from Waterloo. The conference version was presented (not by me) in Zurich in 2011. The newly-updated item is an extended version that appeared in a scientific journal in 2013. We were lucky to have the gap between the two be so short as two years; journal reviewing is often glacially slow.

The "range majority" problem is when you're looking for an item that occurs at least a certain fraction of the time, in the range. My example query of "Was there any single referring Web site responsible for 10% or more of my traffic last month?" is that kind of question. You choose a threshold on how frequent the item should be, and look for any item that occurs at more than that threshold as a proportion of the items in the range. The interesting thing here is that we can do it in constant time and linear space. That's much faster than looking at every item in the range (which would be linear time). At the time, we thought of the threshold as something you would choose up front (when you build the index) and not change. Later on, in subsequent papers, we realized that it would be interesting to try to make it so that you can choose your threshold at query time, not always being stuck with one choice. That makes the problem more complicated. Here's the official and the unofficial version of the paper.

Mode and least frequent, MFCS 2013: This has actually not been presented yet (the conference will be in Austria a few days from now), but the official version is already on Springer's Web site so I think it's okay to link to the preprint on my coauthor's. One of the interesting things about that previous range majority paper is that if you look for an item that is "pretty frequent," that is, over a certain fixed threshold, it's an easier problem than if you're looking for something that is really the most frequent. Range mode, in this MFCS paper, is the harder problem: "Which referrer sent me the most hits in this range of times?" We can't do that in constant time, it takes basically square-root time; but actually, we're not doing it on arrays anyway (other people have covered that pretty well). We're doing it on trees, where the query consists of two nodes in the tree and then we find the most frequent value along the path between them. That kind of query is used internally by some kinds of search engines. We're also studying least frequent elements, which is sort of the opposite thing and requires a bunch of different techniques, even though the end result in time and space complexity is quite similar.

Top-k in trees, SPIRE 2013: this one is also in the future (to be presented in Jerusalem this coming October), and the official version is not yet available, but we have a preprint. This is again a tree range situation: you have a tree structure, the nodes are labelled with what we call "colours," and you want to specify two nodes and get a list of all the different colours that occur on the path between them. The gotchas are that the colours have a priority order and you want to get them in that order; you only want to be told about each colour once (no duplicates); and you won't tell me how many colours you want. I'm just supposed to keep giving you answers until you say to stop.

The "give me answers until I say stop" thing has consequences for the efficiency of the query. If I knew in advance that you would want 1000 answers then I might be able to process them all at once in a batch and save some time overall but with the top-k query I can't afford to do that because you might say "stop" after the first one and then all my work on the others would be wasted. Details like this are important in the design of data structures.

Of course a query like that is connected with search engines. You can imagine the "colors" as different documents (that is what they actually are, in typical applications). The query you put into a system like Google maps into a path through the tree, and then you want to get out a list of the documents matching the query, listed once each, in decreasing order of whatever "PageRank" has mutated into by now, and have the option of paging through however many results you want. There are a lot of subtleties in the definition of the problem, and our version isn't necessarily the one that any given system would actually want to use, but that's the general context.

k-enclosing objects, CCCG 2013: The full proceedings of this conference are open access, no paywall, but watch out! That's a 313-page, 25-megabyte PDF file. You might prefer the preprint of just my group's paper. Most of the nine coauthors are people from the University of Manitoba; we'd been discussing the material in this paper at our group meetings, and so (unsurprisingly) everyone in the group ended up getting involved in solving the problems.

The general idea for the k-enclosing objects problem is that you have some collection of things, in some sort of space (like maybe the 2-dimensional plane) and you want to find a region that contains a specified combination of them. As Nicholas Vining said on Twitter, it might be that you're looking for a starting site in a game world with two caves and one forest. Our paper goes into several variations of that. Unfortunately, some of the results are still like O(n2) or worse. That's better than the simplest algorithm you can think of for this kind of problem, but still kind of slow. But we've learned a bit about what makes different versions of the problem harder or easier, and at the conference I got into some good discussions of how to expand and improve the results, so I'm hoping it will lead to other interesting things.

Still to come: my TUG 2013 paper about Tsukurimashou, to be presented in Tokyo in October. No PDF because the full paper is actually not written yet, but I'm looking forward to it. It'll be really nice to finally have some academic work out there on this project. I'm also hoping in the near future to put together a paper about the bit vector features of IDSgrep 0.4 (new release yesterday!), because there's some computer science in there that I think is pretty cool. In the mean time, you can download the new release and read the chapter I added at the end of the manual to get some idea of what I'm working on.

0 comments



(optional field)
(optional field)
Answer "bonobo" here to fight spam. ここに「bonobo」を答えてください。SPAMを退治しましょう!
I reserve the right to delete or edit comments in any way and for any reason.