Part of my masters project work involves using network measurement tools to garner information about a path to a website. One useful type of tool that I don’t believe is used that often is a bandwidth estimation tool. These type of tools employ one of a variety of methods to estimate the available bandwidth between each TTL hop along a given router path to a host. To learn more about these tools, including clink, I recommend reading “Creating a Bandwidth Estimation Testbed Summer 2001 Status Report.”
One of these tools, Clink, was written by Allen Downey and has made significant improvements to Van Jaconbson’s similar tool, pathchar. Unfortunately, I noticed a problem where clink seemed to hang on certain hosts. I don’t believe I am alone in reporting this problem. In, “Measuring Bandwidth between PlanetLab Nodes” (PDF Link) as published in the proceedings of PAM 2005 – Passive & Active Measurement Workshop, the researchers noticed that clink would hang on PlanetLab’s machines and attributed the hang to a possible Linux kernel version problem. It is possible the kernel is the case, but I ran into another situation where clink would experience what looked like a program hang and might explain their hang as well.
When clink experiences a timeout on a probe to a TTL hop, it simply retries the probe again. Of course, if the router has been setup to not respond to UDP packets as many routers in todays internet are now setup to do, clink will endlessly try probing the router with no success. To the end user, this looks like a hang, but a tcpdump will confirm clink is still firing off the same UDP packet probe over and over. When clink was written in 1998-99, many routers were configured to (nicely) respond to a probe, but this is not the case any more.
Because I found clink’s bandwidth estimation using the even-odd technique even-odd technique, as described in the SIGCOMM paper, to be the best available, I rewrote part of the code to fix the infinite loop bug caused by router timeouts. I introduced two new program arguments. The first is a maximum probe retry value and the second being a maximum probe failures per TTL hop. Therefore, you could retry a probe of a specific size against a specific TTL hop multiple times using the first argument before declaring the probe a failure. Then, if the number of probe failures on a specific TTL hop exceed the second argument, the TTL hop is simply indicated as failed and is skipped. Clink then goes on to measure the rest of the hops.
I am not publishing the code patches yet as I am still testing it, but if you are interested in taking a peak at it, please comment and I’ll email you a copy.