adds post on using traceroute/mtr
This commit is contained in:
parent
7f59a40446
commit
e4fbb697f0
@ -0,0 +1,259 @@
|
|||||||
|
Title: Using traceroute/mtr, or: Diagnosing network problems 101
|
||||||
|
Date: 2019-12-13T16:39-05:00
|
||||||
|
Author: Wxcafé
|
||||||
|
Category:
|
||||||
|
Slug: content/using_traceroute-mtr,_or:_diagnosing_network_problems_101
|
||||||
|
|
||||||
|
I was a in a twitter discussion recently about **Traceroute**, and how it was not
|
||||||
|
necessarily as simple as it seemed, and that it caused a lot of confusion on the
|
||||||
|
user side and a lot of frustration on the network admin side. So I decided to
|
||||||
|
write this little guide on how `traceroute` and `mtr` work, how to use it, and
|
||||||
|
how to <s>read the tea leaves</s> interpret the output.
|
||||||
|
|
||||||
|
### How it works
|
||||||
|
|
||||||
|
`traceroute` and `mtr` (and similar tools) all work the same way: they send
|
||||||
|
packets with low TTL (Time to Live, the number of hops a packet will be
|
||||||
|
transmitted for before <s>dying</s> being dropped), and rely on the routers on
|
||||||
|
each step of the way to send an ICMP Type 11 packet (TTL Expired). `traceroute`
|
||||||
|
sends UDP by default, whereas `mtr` sends ICMP, but the idea is the same: first
|
||||||
|
you send a packet with a TTL of 1, it expires on the first hop, which tells you
|
||||||
|
it did. Then you send a packet with a TTL of 2, and the second router along the
|
||||||
|
way tells you it expired. And you do that again and again until you get to the
|
||||||
|
target.
|
||||||
|
|
||||||
|
The layer 4 protocol you're using doesn't matter (in general), because the TTL
|
||||||
|
is an IP-level option, so you'll get an answer anyway. But you can switch which
|
||||||
|
one you're using to debug different problems, whether it is reachability in
|
||||||
|
general or on a specific TCP port, or something else.
|
||||||
|
|
||||||
|
`traceroute` only has a 'report mode', in that it immediately outputs to the
|
||||||
|
terminal and tries three times, and that's it. `mtr`, on the other hand, uses
|
||||||
|
a curses interface by default, and tries until you tell it to stop, gathering
|
||||||
|
stats along the way, but it can also do reporting similarly to `traceroute`, and
|
||||||
|
can try multiple times even in report mode.
|
||||||
|
|
||||||
|
### How to use it
|
||||||
|
|
||||||
|
`traceroute` and `mtr` are pretty simple to use, you point them to your
|
||||||
|
destination and shoot. Here are a few common and useful flags:
|
||||||
|
|
||||||
|
#### `traceroute`:
|
||||||
|
|
||||||
|
- `-4`/`-6`: use IPv4/IPv6 (it will use **v4** by default)
|
||||||
|
- `-I`: use ICMP instead of UDP packets
|
||||||
|
- `-T`: use TCP SYN instead of UDP packets
|
||||||
|
- `-U`: use UDP but keep the port consistent (by default, the port is
|
||||||
|
incremented with each packet sent)
|
||||||
|
- `-n`: do not use reverse DNS to get hostnames in the results. Useful if your
|
||||||
|
DNS is broken.
|
||||||
|
- `-p <port>`: destination port for TCP or UDP with `-U`
|
||||||
|
- `-A`: lookup and show AS number of each hop
|
||||||
|
- `-N`: selects the number of packets sent simultaneously (default is 16. too
|
||||||
|
few will be slow, too many might get filtered)
|
||||||
|
|
||||||
|
#### `mtr`
|
||||||
|
|
||||||
|
- `-4`/`-6`: use IPv4/IPv6 (it will use **v6** by default)
|
||||||
|
- `-r`/`-w`: generate a report instead of going into the interactive interface
|
||||||
|
(`-w` is for the "wide" mode, which doesn't cut hostnames)
|
||||||
|
- `-j`/`-x`/`-C`: output json/xml/csv, respectively
|
||||||
|
- `-n`: do not use reverse DNS to get hostnames in the results.
|
||||||
|
- `-z`: lookup and show AS number of each hop
|
||||||
|
- `-c`: number of cycles to run for
|
||||||
|
- `-s <size>`: specify packet size
|
||||||
|
- `-u`: use UDP instead of ICMP packets
|
||||||
|
- `-T`: use TCP instead of ICMP packets
|
||||||
|
- `-P <port>`: destination port for UDP and TCP
|
||||||
|
|
||||||
|
`mtr` also has an interactive mode (in fact, it's the default). A few useful
|
||||||
|
shortcuts for that mode:
|
||||||
|
|
||||||
|
- `p` will pause display updates, `<SPACE>` will unpause
|
||||||
|
- `d` will switch display mode between statistics and two per-packet displays
|
||||||
|
- `n` will toggle reverse DNS resolution on/off
|
||||||
|
- `r` will reset the display, dropping all history and starting from scratch
|
||||||
|
- `y` will toggle IP info and cycle between AS number lookup, IP address
|
||||||
|
display, country, RIR, and date of registration of the network.
|
||||||
|
- `q` will quit (useful to know 😁)
|
||||||
|
|
||||||
|
### How to interpret the output (the most important part)
|
||||||
|
|
||||||
|
So, now that we know all of that... how do we read the output?
|
||||||
|
|
||||||
|
``` shell
|
||||||
|
> traceroute wxcafe.net
|
||||||
|
traceroute to wxcafe.net (62.210.115.205), 30 hops max, 60 byte packets
|
||||||
|
1 bowser.wx (10.0.42.1) 0.224 ms 0.272 ms 0.324 ms
|
||||||
|
2 * * *
|
||||||
|
3 B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 5.967 ms B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48) 2.234 ms B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 5.959 ms
|
||||||
|
4 * * *
|
||||||
|
5 0.ae3.BR2.NYC4.ALTER.NET (140.222.1.59) 4.696 ms 4.692 ms 0.ae2.BR2.NYC4.ALTER.NET (140.222.229.93) 4.618 ms
|
||||||
|
6 verizon.com.customer.alter.net (152.179.78.154) 4.245 ms 3.719 ms 3.251 ms
|
||||||
|
7 ae-2-3211.edge7.Paris1.Level3.net (4.69.133.238) 112.460 ms 111.249 ms 109.206 ms
|
||||||
|
8 212.3.235.202 (212.3.235.202) 87.401 ms 87.113 ms 86.841 ms
|
||||||
|
9 49e-s202b-1-dc2-a9k1.dc2.poneytelecom.eu (195.154.1.29) 86.806 ms 86.919 ms 87.126 ms
|
||||||
|
10 51.158.8.83 (51.158.8.83) 87.125 ms 86.566 ms 88.011 ms
|
||||||
|
11 wxcafe.net (62.210.115.205) 87.847 ms 87.766 ms 87.778 ms
|
||||||
|
```
|
||||||
|
|
||||||
|
Here's an example of a traceroute from my laptop to `wxcafe.net`. Just at a
|
||||||
|
glance, we can see a few things: I'm on verizon's network. The first hop is my
|
||||||
|
private router (it has a private IPv4 address). The next hop does not send us
|
||||||
|
ICMP TTL Expired packets for some reason. After that, we got three answers:
|
||||||
|
verizon does some load balancing, we're going over multiple paths. Then once
|
||||||
|
again a hop that doesn't answer, then two ansers from verizon core (this NSFNET
|
||||||
|
block is now verizon's...), then again verizon core, and suddenly we went over
|
||||||
|
the atlantic and we're on Level3's Paris1 router! Then another Level3 IP that
|
||||||
|
doesn't have a reverse DNS entry, and we enter Online.net's network
|
||||||
|
(poneytelecom is their ISP name). Finally we see the server's gateway, and the
|
||||||
|
server itself!.
|
||||||
|
|
||||||
|
The numbers after the host part all show round trip time (there are three
|
||||||
|
because traceroute sends three packets to each host by default), so we can spot
|
||||||
|
very clearly the moment we went from the US over to France even without looking
|
||||||
|
at the router names: when it goes from 4.2ms to 112ms, it's because the packet
|
||||||
|
took a trip in some submarine cables. We can also see that some later hops have
|
||||||
|
lower RTT than some earlier ones (for example hop 5 has a lower RTT than hop 3,
|
||||||
|
and hops 8, 9, 10 and 11 all have lower RTTs than hop 7). This is due to the
|
||||||
|
fact that traceroute gets data from each host independently: the replies from
|
||||||
|
host 8 have no link with the replies from hop 7, and in general network devices
|
||||||
|
are much faster at forwarding packets than they are at generating ICMP TTL
|
||||||
|
Expired replies. Thus the packets we got back from hop 7 didn't take necessarily
|
||||||
|
take longer to travel back to us, they just took longer to be generated (though
|
||||||
|
they **can** sometimes take longer to travel back: the path the packets take
|
||||||
|
from our machine to the target is not necessarily the same that they take to get
|
||||||
|
from some random hop on the way back to our machine!)
|
||||||
|
|
||||||
|
Now, let's see another one:
|
||||||
|
|
||||||
|
``` shell
|
||||||
|
> sudo traceroute -T -p 22 imaginair.es
|
||||||
|
traceroute to imaginair.es (188.40.106.245), 30 hops max, 60 byte packets
|
||||||
|
1 bowser.wx (10.0.42.1) 0.169 ms 0.165 ms 0.206 ms
|
||||||
|
2 * * *
|
||||||
|
3 B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48) 5.816 ms 5.821 ms 5.868 ms
|
||||||
|
4 * * *
|
||||||
|
5 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131) 3.578 ms 0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107) 3.513 ms 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131) 3.572 ms
|
||||||
|
6 ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145) 2.903 ms 3.714 ms 3.695 ms
|
||||||
|
7 et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226) 85.106 ms 84.522 ms 83.907 ms
|
||||||
|
8 46.33.77.6 (46.33.77.6) 88.457 ms 88.430 ms 89.192 ms
|
||||||
|
9 core21.fsn1.hetzner.com (213.239.245.217) 98.676 ms 99.107 ms 99.088 ms
|
||||||
|
10 ex9k1.dc13.fsn1.hetzner.com (213.239.245.238) 99.047 ms 97.777 ms ex9k1.dc13.fsn1.hetzner.com (213.239.245.242) 97.651 ms
|
||||||
|
11 * * *
|
||||||
|
12 * * *
|
||||||
|
13 * * *
|
||||||
|
14 * * *
|
||||||
|
15 * * *
|
||||||
|
16 * * *
|
||||||
|
17 * * *
|
||||||
|
18 * * *
|
||||||
|
19 * * *
|
||||||
|
20 * * *
|
||||||
|
21 * * *
|
||||||
|
22 * * *
|
||||||
|
23 * * *
|
||||||
|
24 * * *
|
||||||
|
25 * * *
|
||||||
|
26 * * *
|
||||||
|
27 * * *
|
||||||
|
28 * * *
|
||||||
|
29 * * *
|
||||||
|
30 * * *
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, we can see that it starts the same way as before, except it goes through
|
||||||
|
frankfurt and germany instead of paris, but then it stops inside hetzner's
|
||||||
|
network... why? because the firewall of the target (imaginair.es) filters TCP
|
||||||
|
port 22, and won't accept it nor forward it. So it's dropped, and there's no
|
||||||
|
ICMP TTL Expired for traceroute to receive! As it doesn't know what happens, it
|
||||||
|
goes up to its maximum TTL (30 by default) and then gives up.
|
||||||
|
|
||||||
|
Alright, let's move to `mtr`...
|
||||||
|
|
||||||
|
``` shell
|
||||||
|
> mtr -w -T -P 5050 wxcafe.net
|
||||||
|
Start: 2019-12-13T18:54:21-0500
|
||||||
|
HOST: cwh Loss% Snt Last Avg Best Wrst StDev
|
||||||
|
1.|-- bowser.wx 0.0% 10 0.5 0.4 0.3 0.6 0.1
|
||||||
|
2.|-- tunnel536764.tunnel.tserv4.nyc4.ipv6.he.net 0.0% 10 5.0 5.3 4.3 6.1 0.6
|
||||||
|
3.|-- ve422.core1.nyc4.he.net 0.0% 10 2.8 3.2 2.6 4.1 0.5
|
||||||
|
4.|-- 100ge4-1.core1.par2.he.net 50.0% 10 89.8 88.0 73.8 97.3 9.0
|
||||||
|
5.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||||
|
6.|-- 2001:bc8:400:1::8e 0.0% 10 75.1 75.0 74.4 75.8 0.4
|
||||||
|
7.|-- 2001:bc8:400:100::7f 20.0% 10 74.6 126.5 74.4 591.5 163.4
|
||||||
|
8.|-- wxcafe.net 0.0% 10 74.3 78.3 73.8 113.9 12.5
|
||||||
|
```
|
||||||
|
|
||||||
|
Here I try to `mtr` from my machine to `wxcafe.net`, over TCP port 5050. At
|
||||||
|
first glance we can see that I use Hurricane Electric's tunnel service to get
|
||||||
|
IPv6 (because Verizon won't provide v6 yet... Come on, it's 2019...), and that
|
||||||
|
this time most of the way goes through HE's network (up to Paris). This is
|
||||||
|
probably because they peer directly with Online.net/Illiad in Paris, and don't
|
||||||
|
want to pay for the traffic by sending it to one of their transits when they can
|
||||||
|
transport it over to the peering point.
|
||||||
|
|
||||||
|
We can also see that there's a lot more info visible, and that the layout looks
|
||||||
|
a lot better! Here the fields are, in order: hostname, Loss percentage, number
|
||||||
|
of packets sent, RTT of the last packet, average RTT, best RTT, worst RTT, and
|
||||||
|
standard deviation of the RTTs.
|
||||||
|
|
||||||
|
From that we can deduce that it sent 10 packets, and thus the
|
||||||
|
Last/Average/Best/Worst/Standard Deviation fields are a lot more useful than the
|
||||||
|
simple three RTT values we got from `traceroute`!
|
||||||
|
|
||||||
|
We also notice that in the Loss% column, besides the host that didn't answer our
|
||||||
|
probes, there's also two hops that have respectively 50% and 20% loss. Now, we
|
||||||
|
could jump to the conclusion that this means these hops dropped our packets, and
|
||||||
|
that something's wrong with them! But on closer inspection, later hops don't
|
||||||
|
show that drop, and everything works well... That's weird.
|
||||||
|
|
||||||
|
The reason why that's happening is simple: sometimes, routers have other things
|
||||||
|
to do with their time than reply to any rando's packet that has an expired TTL.
|
||||||
|
Replying with an ICMP TTL Expired packet is actually very low priority for
|
||||||
|
routers, and when they have other stuff going on they sometimes simply don't
|
||||||
|
answer. This obviously doesn't mean that there's something *actually* wrong on
|
||||||
|
the path, or the "Loss" would continue down to the later hops! This is actually
|
||||||
|
a very common error.
|
||||||
|
|
||||||
|
Let's look at a last one:
|
||||||
|
|
||||||
|
``` shell
|
||||||
|
> mtr -4 -wbz -T -P 22 imaginair.es
|
||||||
|
Start: 2019-12-13T19:44:34-0500
|
||||||
|
HOST: cwh Loss% Snt Last Avg Best Wrst StDev
|
||||||
|
1. AS??? bowser.wx (10.0.42.1) 0.0% 10 0.6 0.5 0.3 0.7 0.1
|
||||||
|
2. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||||
|
3. AS701 B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 0.0% 10 5.6 6.6 4.4 8.9 1.4
|
||||||
|
AS701 B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48)
|
||||||
|
4. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||||
|
5. AS??? 0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107) 0.0% 10 63.9 9.7 3.1 63.9 19.1
|
||||||
|
AS??? 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131)
|
||||||
|
6. AS3257 4436ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145) 0.0% 10 3.8 4.2 2.5 6.9 1.3
|
||||||
|
7. AS3257 4436et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226) 0.0% 10 84.0 85.8 83.7 94.8 3.4
|
||||||
|
8. AS3257 443646.33.77.6 0.0% 10 89.0 100.1 89.0 162.4 22.6
|
||||||
|
9. AS24940 core21.fsn1.hetzner.com (213.239.245.217) 0.0% 10 98.8 100.4 98.1 115.4 5.3
|
||||||
|
AS24940 core22.fsn1.hetzner.com (213.239.245.178)
|
||||||
|
10. AS24940 ex9k1.dc13.fsn1.hetzner.com (213.239.245.242) 0.0% 10 100.0 98.8 97.7 100.0 0.7
|
||||||
|
AS24940 ex9k1.dc13.fsn1.hetzner.com (213.239.245.238)
|
||||||
|
11. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||||
|
```
|
||||||
|
|
||||||
|
Here again we chose a target that won't work, `imaginair.es` on TCP port 22. In
|
||||||
|
this case, though, we can see that there is no long trail of `* * *`, mtr simply
|
||||||
|
shows `AS??? ??? 100.0`, 100% loss. It's clear what's happening, if the last hop
|
||||||
|
is unknown with 100% loss, clearly it's blocked somewhere.
|
||||||
|
|
||||||
|
We can also see multiple addresses for some hops, once again these are due to
|
||||||
|
load-balancing. Some of the ASN lookups failed, and that happens sometimes.
|
||||||
|
|
||||||
|
There was also some display error on hops 6, 7 and 8, probably because the AS
|
||||||
|
lookup code got two results and displayed both, breaking the display... :/ here
|
||||||
|
the right address for hop 8 is `46.33.77.6`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Anyway, if you want to report a network problem to an engineer... generally,
|
||||||
|
you're better off running `mtr -wbz <target>` and letting the person on the
|
||||||
|
other hand figure it out. And don't open a report if you're not sure it's
|
||||||
|
a network error!
|
Loading…
x
Reference in New Issue
Block a user