Tech Support Notes

MTR / Traceroute

MTR

MTR (or Matt's traceroute) is a combination of traceroute and ping programs (in terms of functionality) and works as a single network diagnostic tool. If you want to have the test run and then report use the -r flag. The default is to send 10 packets. If you would like to specify more use the -c flag. You can also tell mtr to skip looking up hostnames with the -n (--no-dns) flag.

Basic Usage

Here is an example of a 50 count MTR to google while skipping hostname lookup

[gitlab] (~) >>>  mtr -c 50 --no-dns --report google.com
HOST: gitlab                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 208.43.6.65                0.0%    50    0.3   1.8   0.3  42.8   6.1
  2.|-- 66.228.118.213             0.0%    50    0.3   0.4   0.3   8.2   1.1
  3.|-- 173.192.18.210             0.0%    50    0.2   0.2   0.2   0.2   0.0
  4.|-- 50.97.16.37                0.0%    50    0.3   3.0   0.3  47.9   7.4
  5.|-- 72.14.233.85               0.0%    50    8.1  14.0   0.3  42.9  12.3
  6.|-- 209.85.254.133             0.0%    50    0.6   0.6   0.6   0.8   0.0
  7.|-- 173.194.115.70             0.0%    50    0.3   0.3   0.3   0.3   0.0

Without the --report (-r) option, mtr will run continuously in an interactive environment. The interactive mode reflects current round trip times to each host. In most cases, the --report mode provides sufficient data in a useful format.

The Loss% column shows the percentage of packet loss at each hop. The Snt column counts the number of packets sent.

The next four columns Last, Avg, Best, and Wrst are all measurements of latency in milliseconds (e.g. ms). Last is the latency of the last packet sent, Avg is average latency of all packets, while Best and Wrst display the best (shortest) and worst (longest) round trip time for a packet to this host. In most cases, the average (Avg) column should be the focus of your attention.

The final column, StDev, provides the standard deviation of the latencies to each host. The higher the standard deviation, the greater the difference is between measurements of latency.

Verifying Loss/Latency

When analyzing the output of an MTR report, you are looking for two things: loss and latency. First, we'll discuss loss. If you see a percentage of loss at any particular hop, that may be an indication that there is a problem with that particular router. This is because it is common practice among some service providers to rate limit the ICMP protocol and de-prioritize ICMP packets. This can give the illusion of packet loss when there is in fact no loss. To figure out if the loss is legitimate you will want to look at the subsequent hop. If the next hop shows 0% loss then you can be relatively sure that the loss reported is simply due to ICMP rate limiting. This can be seen in the following example:

[gitlab] (~) >>>  mtr -c 50 -r  example.com
HOST: gitlab                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 208.43.6.65-static.revers  0.0%    50    0.5   2.8   0.3  49.3   7.8
  2.|-- ae14.dar02.sr01.dal01.net  0.0%    50    9.6   0.4   0.2   9.6   1.3
  3.|-- ae14.bbr01.eq01.dal03.net  0.0%    50    0.3   0.5   0.3   9.8   1.3
  4.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
  5.|-- 108-161-251-64.edgecastcd  0.0%    50    0.7   4.8   0.5  31.5   7.7
  6.|-- 93.184.216.119             0.0%    50    0.7   0.6   0.5   1.1   0.1

If the loss were actual loss the output would like more like this:

[gitlab] (~) >>> mtr -c 50 -r -n 192.168.1.100
HOST: gitlab                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 208.43.6.65                0.0%    50    0.4   1.9   0.3  42.5   6.1
  2.|-- 66.228.118.213             0.0%    50    0.3   0.3   0.3   2.1   0.3
  3.|-- 173.192.18.254             0.0%    50    0.3   1.4   0.3  34.4   5.1
  4.|-- 80.239.195.177             0.0%    50    0.3   0.8   0.3  27.2   3.8
  5.|-- 192.205.37.49              0.0%    50    1.7   3.1   1.1   6.3   1.3
  6.|-- 12.122.139.18              0.0%    50    8.3   8.5   6.5  10.4   1.1
  7.|-- 12.122.28.158              0.0%    50    9.9   8.9   7.3  10.9   1.1
  8.|-- 12.122.103.113             0.0%    50    7.2   8.3   6.5  10.3   1.1
  9.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0
 10.|-- 71.144.128.226            96.0%    50    7.6   7.7   7.6   7.9   0.2
 11.|-- 71.144.128.123            98.0%    50    7.4   7.4   7.4   7.4   0.0
 12.|-- 192.168.1.254             84.0%    50    9.8   9.6   8.3  11.7   1.2
 13.|-- ???                       100.0    50    0.0   0.0   0.0   0.0   0.0

The other part that we're interested in is Latency. Due to physical location, latency always increases with the number of hops in a route. However, the increases should be consistent and linear. Latency could be caused by a problem with the return route as well.

[saturn] (~) >>> mtr -r www.google.com
Start: Mon Apr 21 11:48:38 2014
HOST: saturn                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- homeportal                 0.0%    10    4.0   1.9   1.2   4.5   1.2
  2.|-- 70-138-176-2.lightspeed.h 10.0%    10  102.3  36.4  22.9 102.3  26.8
  3.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- 71.144.128.122            60.0%    10   23.7  23.5  23.0  23.9   0.0
  5. 72.14.233.56                  0.0%    10  390.6 360.4 342.1 396.7   0.2
  6. 209.85.254.247                0.0%    10  391.6 360.4 342.1 396.7   0.4
  7. 64.233.174.46                 0.0%    10  391.8 360.4 342.1 396.7   2.1
  8. gw-in-f147.1e100.net          0.0%    10  392.0 360.4 342.1 396.7   1.2

As you can see from the above report the amount of latency jumps significantly between hops 4 and 5 and remains high through the rest of the route. This may point to a network latency issue as round trip times remain high after the fourth hop. From this report, it is impossible to determine the actual cause of the latency spike.

When troubleshooting both loss and latency it is best to get a 300 count bi-directional MTR since packets can take completely different routes to and from a particular destination.

Other caveats/notes

Timeouts can happen for various reasons. Some routers will discard ICMP and no replies will be shown on the output as timeouts (???). Timeouts are not necessarily an indication of packet loss. Packets may still reach their destination without significant packet loss or latency. Timeouts may be attributable to routers dropping packets for QoS (quality of service) purposes or there may be some issue with return routes causing the timeouts.

traceroute

traceroute attempts to trace the route an IP packet would follow to some internet host by launching probe packets with a small ttl (time to live) then listening for an ICMP "time exceeded" reply from a gateway. It start its probes with a ttl of one and increases this by one until it gets an ICMP "port unreachable" (or TCP reset), which means we got to the "host", or hit a max (which defaults to 30 hops).

Basic usage

[pluto] (~) >>> traceroute example.com
traceroute to example.com (93.184.216.119), 30 hops max, 60 byte packets
 1  81.95.150.129-static.reverse.softlayer.com (81.95.150.129)  0.311 ms  0.328 ms  0.342 ms
 2  159.253.158.150-static.reverse.softlayer.com (159.253.158.150)  0.216 ms  0.264 ms  0.195 ms
 3  ae9.bbr02.eq01.ams02.networklayer.com (50.97.18.250)  0.461 ms  0.378 ms ae9.bbr01.eq01.ams02.networklayer.com (50.97.18.238)  0.361 ms
 4  ae1.bbr02.eq01.wdc02.networklayer.com (50.97.18.214)  79.037 ms  78.867 ms ae7.bbr01.eq01.ams02.networklayer.com (50.97.18.212)  0.583 ms
 5  ae7.bbr01.eq01.wdc02.networklayer.com (173.192.18.194)  79.019 ms ae1.bbr02.eq01.wdc02.networklayer.com (50.97.18.214)  79.051 ms ae7.bbr01.eq01.wdc02.networklayer.com (173.192.18.194)  79.025 ms
 6  ae0.bbr01.tl01.atl01.networklayer.com (173.192.18.153)  91.321 ms ae7.bbr01.eq01.wdc02.networklayer.com (173.192.18.194)  79.318 ms  79.299 ms
 7  ae13.bbr02.eq01.dal03.networklayer.com (173.192.18.134)  136.871 ms  111.083 ms ae0.bbr01.tl01.atl01.networklayer.com (173.192.18.153)  91.426 ms
 8  ae7.bbr01.eq01.dal03.networklayer.com (173.192.18.208)  111.741 ms  111.949 ms  112.678 ms
 9  ae7.bbr01.eq01.dal03.networklayer.com (173.192.18.208)  110.812 ms core1.dfw.edgecastcdn.net (206.223.118.44)  115.655 ms  113.653 ms
10  108-161-251-64.edgecastcdn.net (108.161.251.64)  112.967 ms core1.dfw.edgecastcdn.net (206.223.118.44)  113.778 ms  113.571 ms
11  108-161-251-64.edgecastcdn.net (108.161.251.64)  115.423 ms  113.521 ms  115.516 ms
12  93.184.216.119 (93.184.216.119)  113.905 ms  113.841 ms  112.882 ms

If we are not concerned with resolving hostnames along the way use the -n flag

[pluto] (~) >>> traceroute -n 10.106.15.195
traceroute to 10.106.15.195 (10.106.15.195), 30 hops max, 60 byte packets
 1  10.104.167.129  0.324 ms  0.335 ms  0.367 ms
 2  10.0.29.118  0.339 ms  0.355 ms  0.385 ms
 3  10.0.30.218  2.960 ms  0.481 ms  2.955 ms
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * 10.0.29.125  115.563 ms 10.0.29.121  113.872 ms
11  10.0.29.181  112.238 ms 10.0.29.179  112.696 ms 10.0.29.181  112.442 ms
12  10.0.29.181  113.675 ms 10.0.29.179  111.937 ms 10.106.15.195  113.310 ms

There are times when one could encounter an * in the output rather than a value. This depicts that the required field could not be fetched. The reason can be anything from reverse DNS lookup failure to packets not hitting the target router to packets getting lost on their way back. So we see that the reason could be many but for all these type of cases the traceroute utility provides an * in the output.

Other Flags

  • -w = Wait time. It expects a value which the utility will take as the new response time to wait for.
  • -q = Configure number of queries per hop. By default it sends 3 queries.
  • -f = Change TTL. By default its value is 1 which means it starts off with the first router in the path. This can be overridden with the f flag