Profiling Input/Output performance

There are various nice tools for program profiling when it comes to CPU usage like gprof or oprofile. Those tools will tell you exactly which functions in your code consume most CPU time, where they are called, how often etc. But what if your program is slower than it should be because it waits for I/O disk operations? How can you check if the I/O is the bottleneck and what are the slowest operations?

How to tell that the I/O operations are the bottleneck


You might hear the hard disk's heads moving or see the HDD LED constantly blinking when your program works (or even not if it works remotely), but how do you know if this is the bottleneck, not the CPU?

top and vmstat


First, the top Linux command will tell you how the processor spends time when your program runs. See the line:

Cpu(s): 14.4%us,  9.2%sy,  0.0%ni,  0.0%id, 75.5%wa,  0.3%hi,  0.7%si,  0.0%st

It's easy to tell that in this situation the system does mostly I/O operations. The "wa" (I/O wait) column shows that CPU spends 75% of the time waiting to I/O operations to complete, about 9% of the time in kernel context and less than 15% running user space programs. I said "easy" because in this particular case the system was running just one program (find in this case) and there was only one processor.

Similar information is also available using the vmstat command. It's sometimes more handy when you want to monitor situation for longer time. Just run vmstat 5 and let it run in a terminal.

iotop


Another useful tool is iotop. You will need quite recent kernel (>= 2.6.20) to use that. It shows all processes, like top and the percent of time that processes waits for IO. Let's see what it show for the find command:

5359 be/4 daper    413.09 K/s    0.00 B/s  0.00 % 93.71 % find /

We can see the program spends 93% of the time waiting for I/O operations. We can also see how much data it reads or writes. We can also distinguish I/O operations requested by the program explicitly (93.71%) from I/O operations performed due to swapping (0.00%). This is how we can tell that find is a program that mostly uses I/O, so doing CPU profiling to boost its performance doesn't make sense.

iostat


This command is more useful for an administrator than for a programmer but it's definitely worth to know. My favorite use is iostat -k 5 -i /dev/sda (you can list many devices at a time). Let's see what it shows when the find command is running every 5s:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6,80    0,00    8,00   85,20    0,00    0,00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0,00 125,20 103,80 5,00 415,20 520,80 17,21 1,57 14,38 8,59 93,44

The first table is what top also shows, but the second tells what each device does. Probably the most useful information are last columns:

(citations from the iostat manual pages)

  • avgrq-sz - The average size (in sectors) of the requests. This can be useful knowing that performing I/O with larger requests (buffers) is faster.
  • avgqu-sz - The average queue length of the requests.
  • await = The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.. This is especially interesting when many processes are issuing I/O requests simultaneously. In a network server scenario this column is very interesting.
  • %util - Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.. This tells the system administrator if the storage is too slow. If this value is constantly near 100% this probably means that everything waits for disks and it's time to do something about this.

How to profile a program when it comes to I/O?


Now, when you know that I/O load is the problem for your program it's time to see what exactly happens, what are the slowest operations. It's the sad moment I must inform you that there is no tool that will make as nice table as gprof or oprofile. For I/O profiling it's a bit harder.

What can be slow? What we are looking for?


There are many reasons that the systems executes you I/O requests slower than you would like to:
  • Non-sequential I/O. Hard disks are moving parts of our computer and are really slow when have to do many operations that involve moving their heads.
  • Slow metadata operations (unlink, create, rename). Some filesystems, like XFS are significantly slower than other when comes to such operations. If we do many of them it might slow down our program.
  • fsync(). Sometimes it's required to make sure data are written to the medium and we are using functions like fsync() for that. This is really slow operation in most cases (the exception is a battery backed RAID controller).
  • To many system calls. Making a system call forces the OS to make a context switch into the kernel mode and to user mode after the call finished. Even if the call is simple (like getting the current time) the cost of context switching may impact performance is we use them often.
  • Doing I/O in small chunks. Bigger buffers may significantly boost the performance.

strace


The strace utility tells us what system calls the program does. It's good to start with it. I sometimes see a case when just a strace log from a few seconds of running program tells what is the problem. Some common one are many repeated stat() to the same files, reading by one char instead of bigger chunks, excessive use of the time()/ call (not really I/O related, but it's very common problem).

Using strace one has to remember that it will not show I/O done using mapped memory. On the other hand, when using FreeBSD ktrace utility you can see it.

Measuring time


A very useful option is the -T option: it shows the time spend in each syscall, so you can see which calls are the problematic one, let's see it in action:

open("..", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) = 5 <0.000051>
fstat64(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000045>
fchdir (5)                               = 0 <0.000046>
close (5)                                = 0 <0.000047>
fstatat64(AT_FDCWD, "LanguageNames", {st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0 <0.000061>
open("LanguageNames", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5 <0.000053>
fchdir (5)                               = 0 <0.000045>
getdents64(5, /* 3 entries */, 32768)   = 104 <0.033817>
getdents64(5, /* 0 entries */, 32768)   = 0 <0.000053>
close (5)                                = 0 <0.000057>
open("..", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) = 5 <0.000055>
fstat64(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000046>
fchdir (5)

This is a fragment of a find run. Not really useful, but shows how strace reports time. In this example the getdents64 call is the slowest one (more than 30ms), other take less than 100us. I remember that I saw once that when reading files the slowest operation was the last read() that returned 0 (EOF). I managed to skip this last read because from the data in that file I could guess that I read enough to do the work and skipping this last call improved the performance.

To get overview of time spent in various system calls you can use the the -c option, the output for find is (first few lines):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 93.15    0.138539          14      9705           getdents64
  1.72    0.002554           0      9721        12 open
  1.68    0.002498           1      4850           fstatat64
  1.47    0.002183           0      9703           fchdir
  1.31    0.001949           0      9710           close
  0.51    0.000761           0      4858           fstat64
  0.16    0.000240           0       709           write

As expected from the previous test getdents64() is the function that is running almost all the time, so this is the place where one could search for improvement.

Per process I/O statistics: /proc/PID/io


In recent Linux kernels each process has a file named []io[/io] in its proc directory. This is the information that iotop is using. This file has the following content:

rchar: 12688872
wchar: 723095
syscr: 9872
syscw: 3931
read_bytes: 1531904
write_bytes: 552960
cancelled_write_bytes: 25395

It gives some more information than iotop is displaying and may be used to profile our program, especially in conjunction with the drop_caches functionality described below. The lines contain the following information:

  • rchar, wchar - Simple counters of data read and written by the process using functions like read() or write(), including TTY I/O.
  • syscr, syscw - Counters for number of I/O operation (respectively: read and write). We may use then to compute average amount of data passed in one system call.
  • read_bytes - Counts data that actually needed to be read from the medium to satisfy the process' read request. So if the read data were in the page cache it's not counted here.
  • write_bytes - Counts data that caused a page to be dirty. It's similar to read_bytes but due to the nature of caching writes it can't be described so simply.

Clearing page cache to reliably simulate cold cache case


If you expect that a cold page cache is an typical scenario for your program it's useful to clear the page cache. In Linux the /proc/sys/vm/drop_caches file is useful for that. By writing a numerical value (1, 2 or 3) you force the system to clear the cache:

To free the page cache, do:
echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes cache:
echo 2 > /proc/sys/vm/drop_caches

To free both page cache and dentries cache:
echo 3 > /proc/sys/vm/drop_caches

This can be useful to get consistent benchmark results. Dirty pages will stay in memory after that, so using the sync command is also a good idea.

How to make it faster?


Now I'll present some low level techniques that can improve I/O performance. The biggest optimizations you can do with your program are at high level: application architecture, implement some sort of caching, different data structures and similar but this article is about low level stuff and this chapter not an exception. Despite of this the general optimization rules we've learned will apply to I/O: first think about algorithms and if it's polished you can look at low level code.

Larger I/O buffers


It should be obvious now that enlarging I/O buffers when using functions like read() and write()/ is a good idea. There is of course some boundary above which there is no gain in performance, you will need to try it yourself taking into consideration the memory requirements. Generally I don't use buffers smaller than 4KB.

When using library I/O functions like fread(), getchar() you should remember that they use buffering internally and it's often enough to get good performance.

Fewer system calls


I said that making a system call is costly, so avoiding them is a good practice. One obvious solution is described above: doing larger I/O requests at a time but there are few nice mechanism that can improve the situation even further:

  • Use pread(2)/pwrite(2) functions instead of lseek(2) if you are doing random I/O of files. Those are functions that take an additional argument: an offset in file and can safe one system call. An example of an application that can use that is SQLite. IT might be compiled with a flag that switched to pread()/pwrite() instead of lseek() making this database faster because all of its I/O is random.
  • Use mmap(2). This can be a nice way of accessing files without any system call.
  • Use writev(2)/readv(2) if you need to read or write data from some non-contiguous in-memory data structures. An exampe would be an HTTP server that send the response to the client. After request processing the response headers and content may be fragmented, the headers may be in some linked list and the body in a buffer. It may use writev(2) to send the response using one system call without gluing whole data in another buffer first.

Tell the OS what you will be doing with the file


There is a nice (in theory) interface for optimizing disk access - posix_fadvise(2). This function is used to tell your operating system how you will use the file. There are various parameters, but at the time of writing many of them don't work as advertised. In practice there are two useful things this function does:

When called with POSIX_FADV_WILLNEED flag it will cause the specified file fragment (or the whole file) to be read into the page cache. The function will return immediately - the read is non-blocking. It sometimes helps much to do such pre-reading before accessing a file in non-sequential order. One example is using an embedded database like SQLite - pre-reading it at program startup may improve it's later access time a lot due to replacing random access with a sequential read.

Another useful flag is POSIX_FADV_DONTNEED that causes the specified file fragment to be removed from the page cache. It may be really helpful to tell the OS that a large file you were using will not be accessed any time soon so it doesn't waste memory to keep it in the cache. Do you ever felt your computer is slow just after copying large amount of data (for example after making a backup or copying a movie to a pen drive)? It is because the copied files replaced useful things in the page cache despite of the fact that it was a one time operation and you will not use them now. If the program that was making a backup or the cp command had used POSIX_FADV_DONTNEED during the operation it would not hurt the page cache. If you are writing a program that uses many files and expect that some of them are used constantly (like some indexes) and some just one time (for example user want to read an old email) you may use this flag to force the OS to keep your constantly used file in the page cache and forget other immediately.

Comments

Io monitoring in linux

Nice written.. Higher the number of Major Segfault, higher will be the IO usage. So buffer caching helps a lot to increase minor segfault instead of major. there are multiple tools out there in Linux that does the job..The below article will be an added advantage in understanding IO in Linux. http://www.slashroot.in/linux-system-io-monitoring

Programming

Please visit my blog, I'll be publishing C++ tutorials during the summer! http://programming-blog.com Thanks! :)