adds post on using traceroute/mtr
This commit is contained in:
		
							parent
							
								
									7f59a40446
								
							
						
					
					
						commit
						e4fbb697f0
					
				@ -0,0 +1,259 @@
 | 
			
		||||
Title: Using traceroute/mtr, or: Diagnosing network problems 101
 | 
			
		||||
Date: 2019-12-13T16:39-05:00
 | 
			
		||||
Author: Wxcafé
 | 
			
		||||
Category: 
 | 
			
		||||
Slug: content/using_traceroute-mtr,_or:_diagnosing_network_problems_101
 | 
			
		||||
 | 
			
		||||
I was a in a twitter discussion recently about **Traceroute**, and how it was not
 | 
			
		||||
necessarily as simple as it seemed, and that it caused a lot of confusion on the
 | 
			
		||||
user side and a lot of frustration on the network admin side. So I decided to
 | 
			
		||||
write this little guide on how `traceroute` and `mtr` work, how to use it, and
 | 
			
		||||
how to <s>read the tea leaves</s> interpret the output.
 | 
			
		||||
 | 
			
		||||
### How it works
 | 
			
		||||
 | 
			
		||||
`traceroute` and `mtr` (and similar tools) all work the same way: they send
 | 
			
		||||
packets with low TTL (Time to Live, the number of hops a packet will be
 | 
			
		||||
transmitted for before <s>dying</s> being dropped), and rely on the routers on
 | 
			
		||||
each step of the way to send an ICMP Type 11 packet (TTL Expired). `traceroute`
 | 
			
		||||
sends UDP by default, whereas `mtr` sends ICMP, but the idea is the same: first
 | 
			
		||||
you send a packet with a TTL of 1, it expires on the first hop, which tells you
 | 
			
		||||
it did. Then you send a packet with a TTL of 2, and the second router along the
 | 
			
		||||
way tells you it expired. And you do that again and again until you get to the
 | 
			
		||||
target.
 | 
			
		||||
 | 
			
		||||
The layer 4 protocol you're using doesn't matter (in general), because the TTL
 | 
			
		||||
is an IP-level option, so you'll get an answer anyway. But you can switch which
 | 
			
		||||
one you're using to debug different problems, whether it is reachability in
 | 
			
		||||
general or on a specific TCP port, or something else.
 | 
			
		||||
 | 
			
		||||
`traceroute` only has a 'report mode', in that it immediately outputs to the
 | 
			
		||||
terminal and tries three times, and that's it. `mtr`, on the other hand, uses
 | 
			
		||||
a curses interface by default, and tries until you tell it to stop, gathering
 | 
			
		||||
stats along the way, but it can also do reporting similarly to `traceroute`, and
 | 
			
		||||
can try multiple times even in report mode.
 | 
			
		||||
 | 
			
		||||
### How to use it
 | 
			
		||||
 | 
			
		||||
`traceroute` and `mtr` are pretty simple to use, you point them to your
 | 
			
		||||
destination and shoot. Here are a few common and useful flags:
 | 
			
		||||
 | 
			
		||||
#### `traceroute`:
 | 
			
		||||
 | 
			
		||||
- `-4`/`-6`: use IPv4/IPv6 (it will use **v4** by default)
 | 
			
		||||
- `-I`: use ICMP instead of UDP packets
 | 
			
		||||
- `-T`: use TCP SYN instead of UDP packets
 | 
			
		||||
- `-U`: use UDP but keep the port consistent (by default, the port is
 | 
			
		||||
  incremented with each packet sent)
 | 
			
		||||
- `-n`: do not use reverse DNS to get hostnames in the results. Useful if your
 | 
			
		||||
  DNS is broken.
 | 
			
		||||
- `-p <port>`: destination port for TCP or UDP with `-U`
 | 
			
		||||
- `-A`: lookup and show AS number of each hop
 | 
			
		||||
- `-N`: selects the number of packets sent simultaneously (default is 16. too
 | 
			
		||||
few will be slow, too many might get filtered)
 | 
			
		||||
 | 
			
		||||
#### `mtr`
 | 
			
		||||
 | 
			
		||||
- `-4`/`-6`: use IPv4/IPv6 (it will use **v6** by default)
 | 
			
		||||
- `-r`/`-w`: generate a report instead of going into the interactive interface
 | 
			
		||||
  (`-w` is for the "wide" mode, which doesn't cut hostnames)
 | 
			
		||||
- `-j`/`-x`/`-C`: output json/xml/csv, respectively
 | 
			
		||||
- `-n`: do not use reverse DNS to get hostnames in the results.
 | 
			
		||||
- `-z`: lookup and show AS number of each hop
 | 
			
		||||
- `-c`: number of cycles to run for
 | 
			
		||||
- `-s <size>`: specify packet size
 | 
			
		||||
- `-u`: use UDP instead of ICMP packets
 | 
			
		||||
- `-T`: use TCP instead of ICMP packets
 | 
			
		||||
- `-P <port>`: destination port for UDP and TCP
 | 
			
		||||
 | 
			
		||||
`mtr` also has an interactive mode (in fact, it's the default). A few useful
 | 
			
		||||
shortcuts for that mode:
 | 
			
		||||
 | 
			
		||||
- `p` will pause display updates, `<SPACE>` will unpause
 | 
			
		||||
- `d` will switch display mode between statistics and two per-packet displays
 | 
			
		||||
- `n` will toggle reverse DNS resolution on/off
 | 
			
		||||
- `r` will reset the display, dropping all history and starting from scratch
 | 
			
		||||
- `y` will toggle IP info and cycle between AS number lookup, IP address
 | 
			
		||||
  display, country, RIR, and date of registration of the network.
 | 
			
		||||
- `q` will quit (useful to know 😁)
 | 
			
		||||
 | 
			
		||||
### How to interpret the output (the most important part)
 | 
			
		||||
 | 
			
		||||
So, now that we know all of that... how do we read the output?
 | 
			
		||||
 | 
			
		||||
``` shell
 | 
			
		||||
> traceroute wxcafe.net   
 | 
			
		||||
traceroute to wxcafe.net (62.210.115.205), 30 hops max, 60 byte packets
 | 
			
		||||
 1  bowser.wx (10.0.42.1)  0.224 ms  0.272 ms  0.324 ms
 | 
			
		||||
 2  * * *
 | 
			
		||||
 3  B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50)  5.967 ms B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48)  2.234 ms B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50)  5.959 ms
 | 
			
		||||
 4  * * *
 | 
			
		||||
 5  0.ae3.BR2.NYC4.ALTER.NET (140.222.1.59)  4.696 ms  4.692 ms 0.ae2.BR2.NYC4.ALTER.NET (140.222.229.93)  4.618 ms
 | 
			
		||||
 6  verizon.com.customer.alter.net (152.179.78.154)  4.245 ms  3.719 ms  3.251 ms
 | 
			
		||||
 7  ae-2-3211.edge7.Paris1.Level3.net (4.69.133.238)  112.460 ms  111.249 ms  109.206 ms
 | 
			
		||||
 8  212.3.235.202 (212.3.235.202)  87.401 ms  87.113 ms  86.841 ms
 | 
			
		||||
 9  49e-s202b-1-dc2-a9k1.dc2.poneytelecom.eu (195.154.1.29)  86.806 ms  86.919 ms  87.126 ms
 | 
			
		||||
10  51.158.8.83 (51.158.8.83)  87.125 ms  86.566 ms  88.011 ms
 | 
			
		||||
11  wxcafe.net (62.210.115.205)  87.847 ms  87.766 ms  87.778 ms
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Here's an example of a traceroute from my laptop to `wxcafe.net`. Just at a
 | 
			
		||||
glance, we can see a few things: I'm on verizon's network. The first hop is my
 | 
			
		||||
private router (it has a private IPv4 address). The next hop does not send us
 | 
			
		||||
ICMP TTL Expired packets for some reason. After that, we got three answers:
 | 
			
		||||
verizon does some load balancing, we're going over multiple paths. Then once
 | 
			
		||||
again a hop that doesn't answer, then two ansers from verizon core (this NSFNET
 | 
			
		||||
block is now verizon's...), then again verizon core, and suddenly we went over
 | 
			
		||||
the atlantic and we're on Level3's Paris1 router! Then another Level3 IP that
 | 
			
		||||
doesn't have a reverse DNS entry, and we enter Online.net's network
 | 
			
		||||
(poneytelecom is their ISP name). Finally we see the server's gateway, and the
 | 
			
		||||
server itself!.
 | 
			
		||||
 | 
			
		||||
The numbers after the host part all show round trip time (there are three
 | 
			
		||||
because traceroute sends three packets to each host by default), so we can spot
 | 
			
		||||
very clearly the moment we went from the US over to France even without looking
 | 
			
		||||
at the router names: when it goes from 4.2ms to 112ms, it's because the packet
 | 
			
		||||
took a trip in some submarine cables. We can also see that some later hops have
 | 
			
		||||
lower RTT than some earlier ones (for example hop 5 has a lower RTT than hop 3,
 | 
			
		||||
and hops 8, 9, 10 and 11 all have lower RTTs than hop 7). This is due to the
 | 
			
		||||
fact that traceroute gets data from each host independently: the replies from
 | 
			
		||||
host 8 have no link with the replies from hop 7, and in general network devices
 | 
			
		||||
are much faster at forwarding packets than they are at generating ICMP TTL
 | 
			
		||||
Expired replies. Thus the packets we got back from hop 7 didn't take necessarily
 | 
			
		||||
take longer to travel back to us, they just took longer to be generated (though
 | 
			
		||||
they **can** sometimes take longer to travel back: the path the packets take
 | 
			
		||||
from our machine to the target is not necessarily the same that they take to get
 | 
			
		||||
from some random hop on the way back to our machine!)
 | 
			
		||||
 | 
			
		||||
Now, let's see another one:
 | 
			
		||||
 | 
			
		||||
``` shell
 | 
			
		||||
> sudo traceroute -T -p 22 imaginair.es     
 | 
			
		||||
traceroute to imaginair.es (188.40.106.245), 30 hops max, 60 byte packets
 | 
			
		||||
1  bowser.wx (10.0.42.1)  0.169 ms  0.165 ms  0.206 ms
 | 
			
		||||
2  * * *
 | 
			
		||||
3  B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48)  5.816 ms  5.821 ms 5.868 ms
 | 
			
		||||
4  * * *
 | 
			
		||||
5  0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131)  3.578 ms 0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107)  3.513 ms 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131)  3.572 ms
 | 
			
		||||
6  ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145)  2.903 ms  3.714 ms  3.695 ms
 | 
			
		||||
7  et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226)  85.106 ms  84.522 ms 83.907 ms
 | 
			
		||||
8  46.33.77.6 (46.33.77.6)  88.457 ms  88.430 ms  89.192 ms
 | 
			
		||||
9  core21.fsn1.hetzner.com (213.239.245.217)  98.676 ms  99.107 ms 99.088 ms
 | 
			
		||||
10  ex9k1.dc13.fsn1.hetzner.com (213.239.245.238)  99.047 ms  97.777 ms ex9k1.dc13.fsn1.hetzner.com (213.239.245.242)  97.651 ms
 | 
			
		||||
11  * * *
 | 
			
		||||
12  * * *
 | 
			
		||||
13  * * *
 | 
			
		||||
14  * * *
 | 
			
		||||
15  * * *
 | 
			
		||||
16  * * *
 | 
			
		||||
17  * * *
 | 
			
		||||
18  * * *
 | 
			
		||||
19  * * *
 | 
			
		||||
20  * * *
 | 
			
		||||
21  * * *
 | 
			
		||||
22  * * *
 | 
			
		||||
23  * * *
 | 
			
		||||
24  * * *
 | 
			
		||||
25  * * *
 | 
			
		||||
26  * * *
 | 
			
		||||
27  * * *
 | 
			
		||||
28  * * *
 | 
			
		||||
29  * * *
 | 
			
		||||
30  * * *
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Here, we can see that it starts the same way as before, except it goes through
 | 
			
		||||
frankfurt and germany instead of paris, but then it stops inside hetzner's
 | 
			
		||||
network... why? because the firewall of the target (imaginair.es) filters TCP
 | 
			
		||||
port 22, and won't accept it nor forward it. So it's dropped, and there's no
 | 
			
		||||
ICMP TTL Expired for traceroute to receive! As it doesn't know what happens, it
 | 
			
		||||
goes up to its maximum TTL (30 by default) and then gives up.
 | 
			
		||||
 | 
			
		||||
Alright, let's move to `mtr`...
 | 
			
		||||
 | 
			
		||||
``` shell
 | 
			
		||||
> mtr -w -T -P 5050 wxcafe.net
 | 
			
		||||
Start: 2019-12-13T18:54:21-0500
 | 
			
		||||
HOST: cwh                                         Loss%   Snt   Last   Avg  Best  Wrst StDev
 | 
			
		||||
  1.|-- bowser.wx                                    0.0%    10    0.5   0.4   0.3   0.6   0.1
 | 
			
		||||
  2.|-- tunnel536764.tunnel.tserv4.nyc4.ipv6.he.net  0.0%    10    5.0   5.3   4.3   6.1   0.6
 | 
			
		||||
  3.|-- ve422.core1.nyc4.he.net                      0.0%    10    2.8   3.2   2.6   4.1   0.5
 | 
			
		||||
  4.|-- 100ge4-1.core1.par2.he.net                  50.0%    10   89.8  88.0  73.8  97.3   9.0
 | 
			
		||||
  5.|-- ???                                         100.0    10    0.0   0.0   0.0   0.0   0.0
 | 
			
		||||
  6.|-- 2001:bc8:400:1::8e                           0.0%    10   75.1  75.0  74.4  75.8   0.4
 | 
			
		||||
  7.|-- 2001:bc8:400:100::7f                        20.0%    10   74.6 126.5  74.4 591.5 163.4
 | 
			
		||||
  8.|-- wxcafe.net                                   0.0%    10   74.3  78.3  73.8 113.9  12.5
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Here I try to `mtr` from my machine to `wxcafe.net`, over TCP port 5050. At
 | 
			
		||||
first glance we can see that I use Hurricane Electric's tunnel service to get
 | 
			
		||||
IPv6 (because Verizon won't provide v6 yet... Come on, it's 2019...), and that
 | 
			
		||||
this time most of the way goes through HE's network (up to Paris). This is
 | 
			
		||||
probably because they peer directly with Online.net/Illiad in Paris, and don't
 | 
			
		||||
want to pay for the traffic by sending it to one of their transits when they can
 | 
			
		||||
transport it over to the peering point.
 | 
			
		||||
 | 
			
		||||
We can also see that there's a lot more info visible, and that the layout looks
 | 
			
		||||
a lot better! Here the fields are, in order: hostname, Loss percentage, number
 | 
			
		||||
of packets sent, RTT of the last packet, average RTT, best RTT, worst RTT, and
 | 
			
		||||
standard deviation of the RTTs.
 | 
			
		||||
 | 
			
		||||
From that we can deduce that it sent 10 packets, and thus the
 | 
			
		||||
Last/Average/Best/Worst/Standard Deviation fields are a lot more useful than the
 | 
			
		||||
simple three RTT values we got from `traceroute`!
 | 
			
		||||
 | 
			
		||||
We also notice that in the Loss% column, besides the host that didn't answer our
 | 
			
		||||
probes, there's also two hops that have respectively 50% and 20% loss. Now, we
 | 
			
		||||
could jump to the conclusion that this means these hops dropped our packets, and
 | 
			
		||||
that something's wrong with them! But on closer inspection, later hops don't
 | 
			
		||||
show that drop, and everything works well... That's weird.
 | 
			
		||||
 | 
			
		||||
The reason why that's happening is simple: sometimes, routers have other things
 | 
			
		||||
to do with their time than reply to any rando's packet that has an expired TTL.
 | 
			
		||||
Replying with an ICMP TTL Expired packet is actually very low priority for
 | 
			
		||||
routers, and when they have other stuff going on they sometimes simply don't
 | 
			
		||||
answer. This obviously doesn't mean that there's something *actually* wrong on
 | 
			
		||||
the path, or the "Loss" would continue down to the later hops! This is actually
 | 
			
		||||
a very common error.
 | 
			
		||||
 | 
			
		||||
Let's look at a last one:
 | 
			
		||||
 | 
			
		||||
``` shell
 | 
			
		||||
> mtr -4 -wbz -T -P 22 imaginair.es
 | 
			
		||||
Start: 2019-12-13T19:44:34-0500
 | 
			
		||||
HOST: cwh                                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
 | 
			
		||||
  1. AS???    bowser.wx (10.0.42.1)                                 0.0%    10    0.6   0.5   0.3   0.7   0.1
 | 
			
		||||
  2. AS???    ???                                                  100.0    10    0.0   0.0   0.0   0.0   0.0
 | 
			
		||||
  3. AS701    B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50)   0.0%    10    5.6   6.6   4.4   8.9   1.4
 | 
			
		||||
     AS701    B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48)
 | 
			
		||||
  4. AS???    ???                                                  100.0    10    0.0   0.0   0.0   0.0   0.0
 | 
			
		||||
  5. AS???    0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107)            0.0%    10   63.9   9.7   3.1  63.9  19.1
 | 
			
		||||
     AS???    0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131)
 | 
			
		||||
  6. AS3257 4436ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145)            0.0%    10    3.8   4.2   2.5   6.9   1.3
 | 
			
		||||
  7. AS3257 4436et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226)      0.0%    10   84.0  85.8  83.7  94.8   3.4
 | 
			
		||||
  8. AS3257 443646.33.77.6                                            0.0%    10   89.0 100.1  89.0 162.4  22.6
 | 
			
		||||
  9. AS24940  core21.fsn1.hetzner.com (213.239.245.217)             0.0%    10   98.8 100.4  98.1 115.4   5.3
 | 
			
		||||
     AS24940  core22.fsn1.hetzner.com (213.239.245.178)
 | 
			
		||||
 10. AS24940  ex9k1.dc13.fsn1.hetzner.com (213.239.245.242)         0.0%    10  100.0  98.8  97.7 100.0   0.7
 | 
			
		||||
     AS24940  ex9k1.dc13.fsn1.hetzner.com (213.239.245.238)
 | 
			
		||||
 11. AS???    ???                                                  100.0    10    0.0   0.0   0.0   0.0   0.0
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Here again we chose a target that won't work, `imaginair.es` on TCP port 22. In
 | 
			
		||||
this case, though, we can see that there is no long trail of `* * *`, mtr simply
 | 
			
		||||
shows `AS??? ??? 100.0`, 100% loss. It's clear what's happening, if the last hop
 | 
			
		||||
is unknown with 100% loss, clearly it's blocked somewhere.
 | 
			
		||||
 | 
			
		||||
We can also see multiple addresses for some hops, once again these are due to
 | 
			
		||||
load-balancing. Some of the ASN lookups failed, and that happens sometimes.
 | 
			
		||||
 | 
			
		||||
There was also some display error on hops 6, 7 and 8, probably because the AS
 | 
			
		||||
lookup code got two results and displayed both, breaking the display... :/ here
 | 
			
		||||
the right address for hop 8 is `46.33.77.6`.
 | 
			
		||||
 | 
			
		||||
--- 
 | 
			
		||||
 | 
			
		||||
Anyway, if you want to report a network problem to an engineer... generally,
 | 
			
		||||
you're better off running `mtr -wbz <target>` and letting the person on the
 | 
			
		||||
other hand figure it out. And don't open a report if you're not sure it's
 | 
			
		||||
a network error!
 | 
			
		||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user