Pipe in bash can be a trap!

Today a colleague at work tried to debug a script in bash that didn't want to work as he expected. He hit one of traps people get into when writing bash scripts. Let's look at the code that find the largest message that was sent from this computer:

  1. #!/bin/sh
  2.  
  3. max=-1
  4.  
  5. grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/' |
  6. while read size
  7. do
  8. if [ $size -gt $max ]
  9. then
  10. max=$size
  11. fi
  12. done
  13.  
  14. echo $max

This example finds all lines containing size=NUMBER string in mail log file and finds the greatest size. I know it can be done using a one line of awk code but I wanted to show a simple example. At first look it seems that the script should work, but it doesn't! It always displays -1. Why? The reason is simple, but not so easy to fix: using a pipe starts a new process for the command at the right side of the pipe. It means that we start grep, then sed that process it's output and something that is not directly visible: a bash process for the while loop. It has an important consequence: it has it's own environment, own variables. When the loop ends the bash child process that executed the loop is destroyed and the final echo sees it's own $max variable that is different from the $max variable of the child bash process. Our result disappears when loop processed all lines.

Solution 1


How can we fix it? The simplest (but not perfect) solution is such cases when you find out that you need to access a variable in the body of the loop is to replace while with for:

  1. for size in `grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/'`
  2. do
  3. if [ $size -gt $max ]
  4. then
  5. max=$size
  6. fi
  7. done

This does exactly the same thing, but not the same way and has some limitations. The reason it works is that no variables are assigned in subprocesses. We run grep and sed in a subprocess, but their output is substituted as for bash command arguments and the loop is executed in the main bash process. The bad news is that the number of arguments for a command is limited, on my system it's about 98000. Beside that it's not memory efficient to make a list of all mail sizes and then process them instead of processing just the current number.

Solution 2


To really solve this problem we must rewrite the script:

  1. max=-1
  2.  
  3. while read line
  4. do
  5. size=`echo "$line" | grep 'size=' | sed -e 's/.*size=\([0-9]*\).*/\1/'`
  6. if [ ! "$size" ]
  7. then
  8. continue;
  9. fi
  10.  
  11. if [ $size -gt $max ]
  12. then
  13. max=$size
  14. fi
  15. done < /var/log/mail.log
  16.  
  17. echo $max

Now we have the loop in the mail process, we don't have any limitation on the size of the input. The rewrite is sometimes not trivial and you loose performance. This way we run grep and sed processes for every input line, not for the whole file and the script is much slower.

Soution 3


Using a trick we can also rewrite the first script this way:
  1. max=-1
  2.  
  3. max=`
  4. grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/' |
  5. while read size
  6. do
  7. if [ $size -gt $max ]
  8. then
  9. max=$size
  10. echo $max
  11. fi
  12. done | tail -n 1
  13. `
  14.  
  15. echo $max

It's as fast as the first one and works: we have the maximum size in the $max variable. There is no problem with this solution only when when there is one variable to assign.

Summary


Bash is a nice shell that allows to write complex scripts, but you must always remember that it's just a shell, not a programming language and has funny limitations.

Comments

:-D

grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/' | sort -n | tail -1

Think functional

If you only need to get max size, the more elegant solution would be
function max {
    read max_value
    while read value; do
        if [ $value -gt $max_value ]; then
            max_value=$value
        fi
    done
    echo $max_value
}

max_size=$(grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/' | max)

Another way

#!/bin/bash max=-1 while read size do if [ ! "$size" ] then continue; fi if [ $size -gt $max ] then max=$size fi done <

How about process substitution?

Nice article. I've been digging for hours for a solution to this problem. But what about this one, using process substitution? Seems closest to the initial attempt, but the while block doesn't run in a subshell:
#!/bin/bash

max=-1
    
while read size
do
    if [ ! "$size" ]
    then
        continue;
    fi
    
    if [ $size -gt $max ]
    then
        max=$size
    fi
done < <(grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/')
#      ^^^ process substitution in bash, not sh mode.
#      See http://www.gnu.org/software/bash/manual/bashref.html#Process-Substitution
#      See http://tldp.org/LDP/abs/html/process-sub.html

echo $max

zsh, ksh shell

when using zsh, ksh shell, the first script is ok. Can you explain for me why ?

I've check ksh and it seems

I've check ksh and it seems it doesn't execute the while loop in a separated process. It may be that (from man ksh about pipes): "Each command, except possibly the last, is run as a separate process". Manual page for bash explains that a pipe always creates a separate process.

bash can do that

shopt -s lastpipe