Not so obvious multi-thread programming specific bugs.

We all know that when writing multi-threaded programs one should remember about few more details like locking, using thread-safe libraries etc. Here is a list of not-so obvious bugs specific to multi-threaded programs. Many of them are not mentioned in documentation or tutorials for beginners, but I think everybody who is using threads will hit them eventually.

Use thread safe system functions


Not all system and standard library functions can be safely used. One of the most obvious examples is strtok(3) that does string tokenizing. It returns next token in each invocation and uses a global state to keep the current position in the source string. When you read the manual page for this function you will see that there is thread-save version: strtok_r(3) with additional argument: a state variable's pointer that is used instead of the global one. Other examples of such functions are:

Using variables not protected by mutexes, volatile keyword misunderstanding


Updated on 2011-03-01 after a user's comment.

You might think that you can use a shared "simple" variable, for example a single boolean flag without a mutex. For example this code:

  1. bool stop = false;
  2.  
  3. while (!stop) {
  4. sleep (1);
  5. }

compiled with optimizations turned on can't be interrupted by another thread by setting the stop variable to true. This is because the compiler is free to apply optimizations: it can see that the variable is not modified in the loop, so it can omit the while condition. Another reason is that, depending on the system's architecture, such change of memory could be unnoticed by other processors.

The volatile keyword is sometimes considered as a solution, but it has nothing to do with threads. This keyword is intended to be used in low level code (like device drivers), just to ensure writing to a device's memory etc. It does not do what we need in a multi-threaded process: it doesn't make the change in the memory visible by other processors. On some architectures it may be sufficient, but should't be used this way.

The proper solution is just to use mutexes when accessing the stop variable even if it's so "simple" memory access.

Double close() and other uses of invalid file descriptors


Consider this code fragment:

  1. fd = open ("file", O_RDONLY);
  2. if (fd < 0) exit (1);
  3.  
  4. while ((res = read (fd, buf, sizeof(buf)))) {
  5. if (res < 0) {
  6. close (fd);
  7. fprintf (stderr, "Read error!\n");
  8. break;
  9. }
  10. else {
  11. printf ("Read %zd bytes\n", res);
  12. }
  13. }
  14.  
  15. close (fd);

What's wrong with it? In a single-threaded program it will always work properly even if there is one bug: in case of a read error the file descriptor will get closed twice - the second close(2) will just return an error that will be ignored. However using this code in a multi-threaded program will get you into troubles, often very nasty. Why? Because the second close(3) may not fail... There is a race: if another thread opens a file or creates a socket between the first and the second close(3) and will get the same descriptor number the thread above will close it. Remember that file descriptors are shared between threads of the same process. Closing another threads's descriptor isn't the worst think that can happen: imagine that the code tries to write something to it before the second close(): it could write data to some other file or TCP connection! From my experience double close() is one of the hardest threading bugs that can happen because such races are rare and results are very strange errors. As a solution I always advice to check the result of every close(3) operation. It's often not check in programs, especially when the descriptor is used only to read a file (so at first look it can't fail). If we log every close(3) fail we are able to detect such bugs in our programs before the race occurs: in most cases the second close(3) has a better chance to fail than to close another descriptor.

Uncaught exceptions


An uncaught exception will cause the process to exit with an error message. When writing a multi-process network daemon such bug will terminate one process and a properly written program will re-create it. When such a daemon is converted to a multi-threaded design uncaught exceptions are more dangerous: they will kill the whole program, not just one thread. You must remember it and somewhere at the top level code catch really everything and even do catch (...) { log ("unknown exception") }. Catching "..." and not re-throwing the exception is a bad practice, but at least the program will still handle the rest of the clients. This is probably the only case when I catch "...".

Using fork()


I wrote about this in Threads and fork(): think twice before mixing them. Basically: there is no safe way to use fork() in a multi-threaded process and doing something more than execve() in the child process because you can't tell what other threads were doing at the time of the fork() call - some mutexes may be held, some thread could be in the middle of modifying some complex data etc.

Using IO when a mutex is locked


Here is a performance tip: avoid doing I/O operation while holding a mutex. At least I/O, better is to avoid any syscall or even a library call while a mutex is locked. Believe me: you don't want your threads in a very busy network demon handling at least thousands requests per second to wait for some thread that happens to write some error message via syslog(3) and has a mutex locked. Use mutexes just to synchronize access to memory and unlock them as soon as possible. Let's look at this example:

  1. pthread_mutex_lock (&mutex);
  2. if (freeSlots == 0) {
  3. syslog (LOG_ERR, "No slots available, rejecting request");
  4. }
  5. else {
  6. freeSlots--;
  7. }
  8.  
  9. pthread_mutex_unlock (&mutex);

During the syslog() call the mutex is locked. Depending on syslog daemon's configuration and machine's load it may even take few tens or hundreds of milliseconds to complete - when an fsync() is done after each log line. Just unlock the mutex before doing logging and other threads may run without taking a break.

Good advice: make a Mutex class


If you are using the C++ language, don't use POSIX mutexes functions directly. It's a lot easier when you create some class that will lock mutex in it's constructor and unlock in the destructor. This way just creating an automatic variable of this class will lock the mutex and automatically unlock it at the end of code's scope. An example of such class is scoped_lock from the Boost library.

Comments

So what's the problem with

So what's the problem with using the "stop" variable without a mutex? The while loop does not modify the variable, the access is read-only. If there is another thread that writes the variable, that's fine because it doesn't conflict with loop's read.

Because from the compiler's

Because from the compiler's point of view the "stop" variable has always value false and never changes in this part of code. So the compiler is free to optimize the loop into an infinite loop by omitting the condition. A mutex is a barrier for such an optimization.

Multi-threading and volatile

Hi, I 100% agree that this code is invalid. But not because it doesn't work, but because it's terribly difficult to figure out if and why it does. In this example, I believe it does work - the compiler doesn't know what sleep() does, so it's forced to assume that it might change "stop". If "stop" was static, then it couldn't, so the compiler could assume it doesn't (but it could call a function in the same source, which would change "stop", so the compiler actually can't assume it). But if sleep() is implemented inline as a busy loop, then the compiler does know what it does. The volatile keyword would help, because it tells the compiler to just read the variable again each time, assuming nothing. You're right that it's meant for memory mapped I/O, but it can be useful for multi-threading. Anyway, if you code's correctness depends on understanding all the details above - you're in trouble. Also, I'm not sure a Mutex would save you here. The proper solution is, I think, replacing "stop" with a Semaphore. Uri.

volatile for shared variables

I think that volatile will not help in your example because it will not ensure that the memory is synchronized between the processors involved. It works by accident on current processors but is not enforced by the C or POSIX standard. A synchronization primitive like a mutex will ensure that the proper memory synchronization steps are performed.

You're right, this section is

You're right, this section is completely wrong about the volatile keyword, I'll change it. Thanks for pointing this out.