I recently had occasion to want to use condition variables in SDL. I know enough about concurrency to, for instance, know that condition variables were the things I wanted to use; and I've got most of a PhD in computer science and can certainly be expected to read and understand the original work by people like Hoare. However, I never did actually take a concurrency class, and knowing how these things are supposed to work from a computer science point of view doesn't necessarily prepare one for whatever weirdness might be built into SDL's particular implementation. So I went looking on the Web for tutorials and other information on using SDL condition variables in particular, and I discovered to my horror that it was all written by game programmers. So here are my notes on how to use condition variables in SDL, posted both for my own reference and for anyone else who might be in a similar situation.
The deal with condition variables is that you have several threads that all have to work on some shared data structure. Moreover, they have to make decisions on when to start and stop, based on potentially complicated conditions, and sometimes they have to trigger each other to wake up as the conditions change and other threads become eligible to run. Condition variables are a general-purpose concurrency thing for dealing with that kind of situation. They serve a similar function to, but are more general and powerful than, things like mutexes and semaphores. I'll assume you know what a mutex is, because in SDL, you need those to use condition variables.
In more detail, when you're going to use a condition variable you need three kinds of things. First, you choose some variables that are to be protected by the concurrency primitives. I think those are actually the real condition variables in computer science terminology, but SDL uses the phrase "condition variable" to refer to something else, so to reduce confusion I'm going to call these variables the "protected variables" and keep "condition variable" for the SDL data structure SDL_cond. The protected variables are the ones threads read when they decide whether to run, or write when they are controlling other threads. The protected variables carry the data payload when threads signal each other.
You also have two concurrency primitives used to control threads and access to the protected variables: a mutex of type SDL_mutex, and (what SDL calls) a condition variable of type SDL_cond. The mutex is straightforward. It's used to keep access to the other things atomic. If you want to read or write the protected variables, or touch the condition variable, then you lock the mutex first, and unlock it when you're done. The condition variable is essentially a queue of waiting threads. You join the queue if you want to be notified of changes in the protected variables; when you change the protected variables, you (can, if you think anyone cares) notify all the threads that are waiting for changes.
When you wait for a signal on the condition variable, you pass the mutex (which you had better ALREADY have locked) into the SDL_CondWait call. Inside that call, SDL will unlock the mutex, wait for the signal, then lock it again before returning. So you put your SDL_CondWait call inside a lock/unlock pair, and as far as you're concerned you keep your lock, but really what's happening is that SDL is passing off the lock to other threads to give them the chance to change the protected variables. The SDL_CondWait call can be thought of as "get me a new state of the protected variables."
So, the general flow is that you lock the mutex, look at the protected variables, and if you want to wait for something to happen, then you repeat calling SDL_CondWait until the protected variables are in such a state that you want to continue running. Then you unlock the mutex and go on your way. At other times, either as part of that block or as part of something else, you might want to send a message to other threads by changing the protected variables. In that case, you lock the mutex, read and write the protected variables to reflect the information you want to convey, call SDL_CondSignal to wake up the other threads, and unlock the mutex.
The protected variables store messages to and from threads. Those messages could be as simple as setting and clearing a flag, or some kind of complicated message queue structure, or whatever. The mutex locks all access to the protected variable and the condition variable. The condition variable keeps track of which threads are waiting for messages, and gets signalled when a new message arrives. Threads are responsible for using the mutex to serialize their access; making sure that the protected variables are in a sensible state every time they unlock the mutex or wait on the condition variable; signalling the condition variable when they make a change that other threads will want to know about; and understanding the state of the protected variables whenever the mutex gets unlocked.
SDL supports a couple of extra operations which are not essential to the model (because you could fake them with the operations already given), but probably useful in practice. First, when you wait for a signal you can wait with a timeout, so that the call will return after a while even if no signal occurs. Also, when you send a signal you can choose between waking ALL waiting threads, or just ONE of the waiting threads. Depending on what you decide a signal is supposed to mean, if you know that only one waiting thread will care about your signal and you don't care which waiting thread gets the signal, then it could be an efficieny win to not wake them all.
I've got a master thread and a bunch of slave threads. The master thread piles up work in a queue and then tells the slaves to work on it. The slaves take tasks from the queue, work on them, and then take more tasks from the queue. The master wants to start the slaves and then go to sleep until they're finished, so when the last slave finishes, it has to wake the master. However, the master may or may not be enforcing a time limit. If it is, then when the time expires, it has to tell the slaves "stop working." In that case, the slaves don't start any new tasks, but they DO complete their existing tasks. The "stop working" is "finish your current tasks and then stop," not "stop working immediately." And then, as in the "run to completion" case, the last slave to finish must wake the master. It should be clear that this is fairly reasonable way to set things up, but at the same time, there are lots of potential things that could go wrong if the messages don't get from one thread to another in the right sequence. The original application is a real-time ray tracer that may or may not be trying to complete the frame by the time it has to display it on the screen; if it can't, it fills in the gaps by a process beyond the scope of this article.
I think the best way to figure out this kind of thing is to list each of the data structures and what the responsibilities of each thread are with respect to that structure. So: to begin with, I have a mutex. All threads must lock and unlock it with the standard condition variable procedure described above. I have protected variables. The protected variables will be two integers that count how many slaves are currently working, and how many slaves the master wants to have working. Really, that last could be a Boolean, because the master always wants either all or none of the slaves to be working; I'm making it an integer in case I want to do trippy load-balancing things in the future. The slaves are responsible for looking at the "how many should be working" integer to decide when to start or stop work, and updating the "how many are working" integer when they do start or stop work. The master is responsible for setting the "how many should be working" integer. It actually doesn't need to look at how many are working, because the slaves will signal it when they finish. The slaves also have the additional responsibility of knowing to stop work when there is no work, if the master hasn't stopped them first.
I'm going to use not one but two condition variables. One is for the slaves. The master signals it when slaves should start working. The other is for the master. The last slave to finish signals this condition variable when it does finish. Using two means I won't waste time waking the slaves (only to have them say "no more work to do" and go back to sleep) at the end of the job. I'm going to share the mutex between the two condition variables because most of the protected variables are protected by both and I'd need a shared mutex for those anyway.
So, let's assume the master is not enforcing the time limit. Then it locks the mutex, puts tasks in the queue, sets the "how many should be working" integer to the number of slaves there are, signals the slaves' condition variable, and then waits on the master's condition variable.
The slaves run a loop that looks something like this:
The counter-intuitive thing here is that the "work on the task" step is a sort of anti-critical section. The slave unlocks the mutex before doing that step and locks it afterward. There's another unlock/lock pair concealed in the "wait on the slaves' condition variable" step. The critical section, protected by the mutex, is inside-out: it's all the rest of the loop. Slaves will spend most of their time either doing work, or waiting for the master to tell them to start work, in one of the anti-critical sections. Note that the loop is guaranteed to hit exactly one of those per iteration. The critical section is only used for deciding whether to work and whether to wake up the master.
If the master needs to enforce the time limit, then it is responsible for somehow waking up when it is ready to stop the slaves. Most likely that would happen by virtue of it calling SDL_CondWaitTimeout instead of SDL_CondWait when it waits for the signal from the last slave, to let SDL wake it. A more complicated system (supporting more elaborate conditions for waking the master) might having the master spin off a wakeup thread that would notify it through the same condition variable the slaves use to wake the master. In either case, if the master wakes as a result of timeout instead of as a result of the slaves finishing, then it must tell the slaves to stop. It does that by just setting the "should be working" integer to zero and then going back to waiting (without timeout) on its condition variable. That has the effect of unlocking the mutex. The slaves are probably in their "work on the task" anti-critical sections. As they come out, each one locks the mutex, decrements the "number of slaves working" counter, and then loops around and goes into the "tasks in queue, but I should not be working" wait state. The last slave to finish sees that slaves working and should be working are both zero, so it signals the master. Then the master wakes and knows that the slaves are finished up to the point they were allowed to go.
After the master gets the final signal, it has the chance to reload the queue and restart the slaves by setting "should be working" appropriately and signalling the slaves' condition variable. At that point they will all be waiting on it, so they each loop around, see the new tasks in the queue, and start work again. In my application, all this would happen once per frame, with the master having to deal with blitting out the results and so on in between stopping and starting the slaves.
One last comment: having the "should be working" count separate from the queue may seem unnecessary, since I only ever test them at the same time. It looks like the master could simply empty the queue to tell the slaves to stop, and the slaves could assume that they are all supposed to work whenever the queue is not empty. I'm keeping it as a separate variable because of the way I'm storing my queue; I actually have a fixed standard queue of tasks that run every frame, and a counter indicating how many of them have been allocated to threads so far, and some other things going on. The reason for doing it that way has to do with saving time in memcpy(3) when I start tracing a new frame. Having a counter of number of slaves that should be working also allows, as I mentioned before, the possibility of making that less than "all of them" for load balancing reasons. If I had a more ordinary structure for my task queue, it would probably make sense to get rid of the "should be working" count and just trash the queue when the master wants the slaves to stop early.
Rob from 93.96.1.238 at Fri, 10 Oct 2008 17:14:59 +0000:
What the fuck do you mean "I discovered to my horror that it was all written by game programmers"?
Azadeh from 193.10.65.26 at Tue, 24 Feb 2009 09:21:23 +0000:
By SDL you mean Specification and Description Language and not SDL which game programmers use.
Matt from 216.59.226.11 at Tue, 24 Feb 2009 10:45:15 +0000:
No, I mean the Simple Directmedia Layer.
Richard Barrell from 137.222.107.66 at Fri, 14 Sep 2007 10:10:17 +0000:
Thanks for writing this. It sounds, from your description, rather similar to pthreads.