MTR (or Matt's traceroute) is a combination of traceroute and ping programs (in terms of functionality) and works as a single network diagnostic tool. If you want to have the test run and then report use the -r flag. The default is to send 10 packets. If you would like to specify more use the -c flag. You can also tell mtr to skip looking up hostnames with the -n (--no-dns) flag.
Here is an example of a 50 count MTR to google while skipping hostname lookup
[gitlab] (~) >>> mtr -c 50 --no-dns --report google.com
HOST: gitlab Loss% Snt Last Avg Best Wrst StDev
1.|-- 208.43.6.65 0.0% 50 0.3 1.8 0.3 42.8 6.1
2.|-- 66.228.118.213 0.0% 50 0.3 0.4 0.3 8.2 1.1
3.|-- 173.192.18.210 0.0% 50 0.2 0.2 0.2 0.2 0.0
4.|-- 50.97.16.37 0.0% 50 0.3 3.0 0.3 47.9 7.4
5.|-- 72.14.233.85 0.0% 50 8.1 14.0 0.3 42.9 12.3
6.|-- 209.85.254.133 0.0% 50 0.6 0.6 0.6 0.8 0.0
7.|-- 173.194.115.70 0.0% 50 0.3 0.3 0.3 0.3 0.0
Without the --report (-r) option, mtr will run continuously in an interactive environment. The interactive mode reflects current round trip times to each host. In most cases, the --report mode provides sufficient data in a useful format.
The Loss% column shows the percentage of packet loss at each hop. The Snt column counts the number of packets sent.
The next four columns Last, Avg, Best, and Wrst are all measurements of latency in milliseconds (e.g. ms). Last is the latency of the last packet sent, Avg is average latency of all packets, while Best and Wrst display the best (shortest) and worst (longest) round trip time for a packet to this host. In most cases, the average (Avg) column should be the focus of your attention.
The final column, StDev, provides the standard deviation of the latencies to each host. The higher the standard deviation, the greater the difference is between measurements of latency.
When analyzing the output of an MTR report, you are looking for two things: loss and latency. First, we'll discuss loss. If you see a percentage of loss at any particular hop, that may be an indication that there is a problem with that particular router. This is because it is common practice among some service providers to rate limit the ICMP protocol and de-prioritize ICMP packets. This can give the illusion of packet loss when there is in fact no loss. To figure out if the loss is legitimate you will want to look at the subsequent hop. If the next hop shows 0% loss then you can be relatively sure that the loss reported is simply due to ICMP rate limiting. This can be seen in the following example:
[gitlab] (~) >>> mtr -c 50 -r example.com
HOST: gitlab Loss% Snt Last Avg Best Wrst StDev
1.|-- 208.43.6.65-static.revers 0.0% 50 0.5 2.8 0.3 49.3 7.8
2.|-- ae14.dar02.sr01.dal01.net 0.0% 50 9.6 0.4 0.2 9.6 1.3
3.|-- ae14.bbr01.eq01.dal03.net 0.0% 50 0.3 0.5 0.3 9.8 1.3
4.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0
5.|-- 108-161-251-64.edgecastcd 0.0% 50 0.7 4.8 0.5 31.5 7.7
6.|-- 93.184.216.119 0.0% 50 0.7 0.6 0.5 1.1 0.1
If the loss were actual loss the output would like more like this:
[gitlab] (~) >>> mtr -c 50 -r -n 192.168.1.100
HOST: gitlab Loss% Snt Last Avg Best Wrst StDev
1.|-- 208.43.6.65 0.0% 50 0.4 1.9 0.3 42.5 6.1
2.|-- 66.228.118.213 0.0% 50 0.3 0.3 0.3 2.1 0.3
3.|-- 173.192.18.254 0.0% 50 0.3 1.4 0.3 34.4 5.1
4.|-- 80.239.195.177 0.0% 50 0.3 0.8 0.3 27.2 3.8
5.|-- 192.205.37.49 0.0% 50 1.7 3.1 1.1 6.3 1.3
6.|-- 12.122.139.18 0.0% 50 8.3 8.5 6.5 10.4 1.1
7.|-- 12.122.28.158 0.0% 50 9.9 8.9 7.3 10.9 1.1
8.|-- 12.122.103.113 0.0% 50 7.2 8.3 6.5 10.3 1.1
9.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0
10.|-- 71.144.128.226 96.0% 50 7.6 7.7 7.6 7.9 0.2
11.|-- 71.144.128.123 98.0% 50 7.4 7.4 7.4 7.4 0.0
12.|-- 192.168.1.254 84.0% 50 9.8 9.6 8.3 11.7 1.2
13.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0
The other part that we're interested in is Latency. Due to physical location, latency always increases with the number of hops in a route. However, the increases should be consistent and linear. Latency could be caused by a problem with the return route as well.
[saturn] (~) >>> mtr -r www.google.com
Start: Mon Apr 21 11:48:38 2014
HOST: saturn Loss% Snt Last Avg Best Wrst StDev
1.|-- homeportal 0.0% 10 4.0 1.9 1.2 4.5 1.2
2.|-- 70-138-176-2.lightspeed.h 10.0% 10 102.3 36.4 22.9 102.3 26.8
3.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
4.|-- 71.144.128.122 60.0% 10 23.7 23.5 23.0 23.9 0.0
5. 72.14.233.56 0.0% 10 390.6 360.4 342.1 396.7 0.2
6. 209.85.254.247 0.0% 10 391.6 360.4 342.1 396.7 0.4
7. 64.233.174.46 0.0% 10 391.8 360.4 342.1 396.7 2.1
8. gw-in-f147.1e100.net 0.0% 10 392.0 360.4 342.1 396.7 1.2
As you can see from the above report the amount of latency jumps significantly between hops 4 and 5 and remains high through the rest of the route. This may point to a network latency issue as round trip times remain high after the fourth hop. From this report, it is impossible to determine the actual cause of the latency spike.
When troubleshooting both loss and latency it is best to get a 300 count bi-directional MTR since packets can take completely different routes to and from a particular destination.
Timeouts can happen for various reasons. Some routers will discard ICMP and no replies will be shown on the output as timeouts (???). Timeouts are not necessarily an indication of packet loss. Packets may still reach their destination without significant packet loss or latency. Timeouts may be attributable to routers dropping packets for QoS (quality of service) purposes or there may be some issue with return routes causing the timeouts.
traceroute attempts to trace the route an IP packet would follow to some internet host by launching probe packets with a small ttl (time to live) then listening for an ICMP "time exceeded" reply from a gateway. It start its probes with a ttl of one and increases this by one until it gets an ICMP "port unreachable" (or TCP reset), which means we got to the "host", or hit a max (which defaults to 30 hops).
[pluto] (~) >>> traceroute example.com
traceroute to example.com (93.184.216.119), 30 hops max, 60 byte packets
1 81.95.150.129-static.reverse.softlayer.com (81.95.150.129) 0.311 ms 0.328 ms 0.342 ms
2 159.253.158.150-static.reverse.softlayer.com (159.253.158.150) 0.216 ms 0.264 ms 0.195 ms
3 ae9.bbr02.eq01.ams02.networklayer.com (50.97.18.250) 0.461 ms 0.378 ms ae9.bbr01.eq01.ams02.networklayer.com (50.97.18.238) 0.361 ms
4 ae1.bbr02.eq01.wdc02.networklayer.com (50.97.18.214) 79.037 ms 78.867 ms ae7.bbr01.eq01.ams02.networklayer.com (50.97.18.212) 0.583 ms
5 ae7.bbr01.eq01.wdc02.networklayer.com (173.192.18.194) 79.019 ms ae1.bbr02.eq01.wdc02.networklayer.com (50.97.18.214) 79.051 ms ae7.bbr01.eq01.wdc02.networklayer.com (173.192.18.194) 79.025 ms
6 ae0.bbr01.tl01.atl01.networklayer.com (173.192.18.153) 91.321 ms ae7.bbr01.eq01.wdc02.networklayer.com (173.192.18.194) 79.318 ms 79.299 ms
7 ae13.bbr02.eq01.dal03.networklayer.com (173.192.18.134) 136.871 ms 111.083 ms ae0.bbr01.tl01.atl01.networklayer.com (173.192.18.153) 91.426 ms
8 ae7.bbr01.eq01.dal03.networklayer.com (173.192.18.208) 111.741 ms 111.949 ms 112.678 ms
9 ae7.bbr01.eq01.dal03.networklayer.com (173.192.18.208) 110.812 ms core1.dfw.edgecastcdn.net (206.223.118.44) 115.655 ms 113.653 ms
10 108-161-251-64.edgecastcdn.net (108.161.251.64) 112.967 ms core1.dfw.edgecastcdn.net (206.223.118.44) 113.778 ms 113.571 ms
11 108-161-251-64.edgecastcdn.net (108.161.251.64) 115.423 ms 113.521 ms 115.516 ms
12 93.184.216.119 (93.184.216.119) 113.905 ms 113.841 ms 112.882 ms
If we are not concerned with resolving hostnames along the way use the -n
flag
[pluto] (~) >>> traceroute -n 10.106.15.195
traceroute to 10.106.15.195 (10.106.15.195), 30 hops max, 60 byte packets
1 10.104.167.129 0.324 ms 0.335 ms 0.367 ms
2 10.0.29.118 0.339 ms 0.355 ms 0.385 ms
3 10.0.30.218 2.960 ms 0.481 ms 2.955 ms
4 * * *
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * 10.0.29.125 115.563 ms 10.0.29.121 113.872 ms
11 10.0.29.181 112.238 ms 10.0.29.179 112.696 ms 10.0.29.181 112.442 ms
12 10.0.29.181 113.675 ms 10.0.29.179 111.937 ms 10.106.15.195 113.310 ms
There are times when one could encounter an *
in the output rather than a value. This depicts that the required field could not be fetched. The reason can be anything from reverse DNS lookup failure to packets not hitting the target router to packets getting lost on their way back. So we see that the reason could be many but for all these type of cases the traceroute utility provides an *
in the output.
-w = Wait time. It expects a value which the utility will take as the new response time to wait for.
-q = Configure number of queries per hop. By default it sends 3 queries.
-f = Change TTL. By default its value is 1 which means it starts off with the first router in the path. This can be overridden with the f flag