Pipe in bash can be a trap!
Today a colleague at work tried to debug a script in bash that didn't want to work as he expected. He hit one of traps people get into when writing bash scripts. Let's look at the code that find the largest message that was sent from this computer:
#!/bin/sh max=-1 grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/' | while read size do if [ $size -gt $max ] then max=$size fi done echo $max
This example finds all lines containing size=NUMBER
string in mail log file and finds the greatest size. I know it can be done using a one line of awk code but I wanted to show a simple example. At first look it seems that the script should work, but it doesn't! It always displays -1. Why? The reason is simple, but not so easy to fix: using a pipe starts a new process for the command at the right side of the pipe. It means that we start grep, then sed that process it's output and something that is not directly visible: a bash process for the while
loop. It has an important consequence: it has it's own environment, own variables. When the loop ends the bash child process that executed the loop is destroyed and the final echo sees it's own $max
variable that is different from the $max
variable of the child bash process. Our result disappears when loop processed all lines.
Solution 1
How can we fix it? The simplest (but not perfect) solution is such cases when you find out that you need to access a variable in the body of the loop is to replace
while
with for
:
for size in `grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/'` do if [ $size -gt $max ] then max=$size fi done
This does exactly the same thing, but not the same way and has some limitations. The reason it works is that no variables are assigned in subprocesses. We run grep and sed in a subprocess, but their output is substituted as for
bash command arguments and the loop is executed in the main bash process. The bad news is that the number of arguments for a command is limited, on my system it's about 98000. Beside that it's not memory efficient to make a list of all mail sizes and then process them instead of processing just the current number.
Solution 2
To really solve this problem we must rewrite the script:
max=-1 while read line do size=`echo "$line" | grep 'size=' | sed -e 's/.*size=\([0-9]*\).*/\1/'` if [ ! "$size" ] then continue; fi if [ $size -gt $max ] then max=$size fi done < /var/log/mail.log echo $max
Now we have the loop in the mail process, we don't have any limitation on the size of the input. The rewrite is sometimes not trivial and you loose performance. This way we run grep and sed processes for every input line, not for the whole file and the script is much slower.
Soution 3
Using a trick we can also rewrite the first script this way:
max=-1 max=` grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/' | while read size do if [ $size -gt $max ] then max=$size echo $max fi done | tail -n 1 ` echo $max
It's as fast as the first one and works: we have the maximum size in the $max
variable. There is no problem with this solution only when when there is one variable to assign.
Summary
Bash is a nice shell that allows to write complex scripts, but you must always remember that it's just a shell, not a programming language and has funny limitations.
Comments
:-D
grep 'size=' /var/log/mail.log | sed -e 's/.*size=\([0-9]*\).*/\1/' | sort -n | tail -1
Think functional
Another way
How about process substitution?
zsh, ksh shell
I've check ksh and it seems
bash can do that