From: Andrew Tridgell 
Subject: tip of the week (strace & ltrace)
Date:   Tue, 9 Nov 1999 01:53:26 +1100

I can just imagine the jokes about "tip of the year" give the time
since my last tip :)

This tip is courtesy of a question asked by our ISP neighbors in
Canberra. One of the ISPs engineers came in today asking why it was
really slow for some of their customers to connect to the SMTP port on
their servers while other clients got in instantly. He knew about
using telnet to the SMTP port and he found that "telnet server 25"
took about 30 seconds to give the sendmail banner back from this one
client whereas it took less than a second for all other clients. 

While we knew what the likely answer was (and luckily we were right!)
the more interesting thing is how to diagnose this sort of
problem. The tools that immediately spring to mind are strace, ltrace,
tcpdump and netcat. 

Lets start with strace. strace allows you to look at all system calls
made by a process. Combine this with a little bit of knowledge of
programming and you have an amazingly useful tool for diagnosing just
about any sort of problem on a Linux box. In this case we ran:

  strace -t -p 3128 -f -o trace.out

where 3128 was the PID of the main sendmail process (use "ps axf" to
find that. I assume you know about the f flag to ps? Very useful ...).

The above sets -t which gives the time in seconds of each system call
(as we are looking for a timing problem), -f so that we "follow" child
processes as sendmail tends to fork a child for each connection and -o
to send output to a file.

Then connect to the sendmail daemon from the misbehaving client and
control-C the strace after the problem has happened. The trace.out
shows a big wait during a connect to port 113 on the client. We have
the answer! The problem was that the client was dropping all packets
to the auth port using a firewall rule. It should have been using
"reject" instead so that sendmail would get an immediate connection
refused rather than a 30 second wait.

If this has whetted your appetite for strace then I suggest that you
start playing with it now. Try "strace ls" and a few other progs. Get
used to the syntax of the output and how errors are handled. strace is
a fantastic tool but it isn't one that you can learn to use well in 5
minutes when you have a broken server and need to fix it. It is
something you need to get used to.

So what about those other tools I mentioned? 

Well, ltrace would have been really useful if the strace hadn't
already solved the problem. ltrace is like strace but it shows all
library calls instead of system calls. That fantastic when you want to
see the gory details of how a program is working. It is also very
instructive to do an strace of ltrace to work out how ltrace works -
it's a really neat trick.

The other tool I used was netcat. netcat was used to confirm our
diagnosis by reproducing the problem. I ran the following as root 
on a spare host:
  nc -l -p 113
then did a "telnet server 25" from that host. We immediately saw the
ident connection from sendmail to the client and sendmail didn't give
its banner. As soon as we did a control-C on nc we got the banner from
sendmail. So we'd reproduced the problem.

netcat is a really great tool. I suggest that everyone gets it now and
reads the docs that come with it. It has a thousand uses (some of
which could get you in big trouble!). One of my favorites is teaching
people protocols by making then type them manually. Try this:
 nc -l -p 8080
then connect to localhost:8080 with your favorite browser and see if
you can type in some HTML from memory. Cute hey?

Finally, I mentioned tcpdump. That's one that everyone should know how
to use. tcpdump is a packet sniffer and we could have solve the
sendmail/ident problem by just sniffing the connection between the two
hosts and noticing the syn from the server to the client on TCP port
113 that never gets a reply. strace is more fun :)

Cheers, Tridge

PS: rsync 2.3.2 is out.


