adds post on using traceroute/mtr
This commit is contained in:
parent
7f59a40446
commit
e4fbb697f0
@ -0,0 +1,259 @@
|
||||
Title: Using traceroute/mtr, or: Diagnosing network problems 101
|
||||
Date: 2019-12-13T16:39-05:00
|
||||
Author: Wxcafé
|
||||
Category:
|
||||
Slug: content/using_traceroute-mtr,_or:_diagnosing_network_problems_101
|
||||
|
||||
I was a in a twitter discussion recently about **Traceroute**, and how it was not
|
||||
necessarily as simple as it seemed, and that it caused a lot of confusion on the
|
||||
user side and a lot of frustration on the network admin side. So I decided to
|
||||
write this little guide on how `traceroute` and `mtr` work, how to use it, and
|
||||
how to <s>read the tea leaves</s> interpret the output.
|
||||
|
||||
### How it works
|
||||
|
||||
`traceroute` and `mtr` (and similar tools) all work the same way: they send
|
||||
packets with low TTL (Time to Live, the number of hops a packet will be
|
||||
transmitted for before <s>dying</s> being dropped), and rely on the routers on
|
||||
each step of the way to send an ICMP Type 11 packet (TTL Expired). `traceroute`
|
||||
sends UDP by default, whereas `mtr` sends ICMP, but the idea is the same: first
|
||||
you send a packet with a TTL of 1, it expires on the first hop, which tells you
|
||||
it did. Then you send a packet with a TTL of 2, and the second router along the
|
||||
way tells you it expired. And you do that again and again until you get to the
|
||||
target.
|
||||
|
||||
The layer 4 protocol you're using doesn't matter (in general), because the TTL
|
||||
is an IP-level option, so you'll get an answer anyway. But you can switch which
|
||||
one you're using to debug different problems, whether it is reachability in
|
||||
general or on a specific TCP port, or something else.
|
||||
|
||||
`traceroute` only has a 'report mode', in that it immediately outputs to the
|
||||
terminal and tries three times, and that's it. `mtr`, on the other hand, uses
|
||||
a curses interface by default, and tries until you tell it to stop, gathering
|
||||
stats along the way, but it can also do reporting similarly to `traceroute`, and
|
||||
can try multiple times even in report mode.
|
||||
|
||||
### How to use it
|
||||
|
||||
`traceroute` and `mtr` are pretty simple to use, you point them to your
|
||||
destination and shoot. Here are a few common and useful flags:
|
||||
|
||||
#### `traceroute`:
|
||||
|
||||
- `-4`/`-6`: use IPv4/IPv6 (it will use **v4** by default)
|
||||
- `-I`: use ICMP instead of UDP packets
|
||||
- `-T`: use TCP SYN instead of UDP packets
|
||||
- `-U`: use UDP but keep the port consistent (by default, the port is
|
||||
incremented with each packet sent)
|
||||
- `-n`: do not use reverse DNS to get hostnames in the results. Useful if your
|
||||
DNS is broken.
|
||||
- `-p <port>`: destination port for TCP or UDP with `-U`
|
||||
- `-A`: lookup and show AS number of each hop
|
||||
- `-N`: selects the number of packets sent simultaneously (default is 16. too
|
||||
few will be slow, too many might get filtered)
|
||||
|
||||
#### `mtr`
|
||||
|
||||
- `-4`/`-6`: use IPv4/IPv6 (it will use **v6** by default)
|
||||
- `-r`/`-w`: generate a report instead of going into the interactive interface
|
||||
(`-w` is for the "wide" mode, which doesn't cut hostnames)
|
||||
- `-j`/`-x`/`-C`: output json/xml/csv, respectively
|
||||
- `-n`: do not use reverse DNS to get hostnames in the results.
|
||||
- `-z`: lookup and show AS number of each hop
|
||||
- `-c`: number of cycles to run for
|
||||
- `-s <size>`: specify packet size
|
||||
- `-u`: use UDP instead of ICMP packets
|
||||
- `-T`: use TCP instead of ICMP packets
|
||||
- `-P <port>`: destination port for UDP and TCP
|
||||
|
||||
`mtr` also has an interactive mode (in fact, it's the default). A few useful
|
||||
shortcuts for that mode:
|
||||
|
||||
- `p` will pause display updates, `<SPACE>` will unpause
|
||||
- `d` will switch display mode between statistics and two per-packet displays
|
||||
- `n` will toggle reverse DNS resolution on/off
|
||||
- `r` will reset the display, dropping all history and starting from scratch
|
||||
- `y` will toggle IP info and cycle between AS number lookup, IP address
|
||||
display, country, RIR, and date of registration of the network.
|
||||
- `q` will quit (useful to know 😁)
|
||||
|
||||
### How to interpret the output (the most important part)
|
||||
|
||||
So, now that we know all of that... how do we read the output?
|
||||
|
||||
``` shell
|
||||
> traceroute wxcafe.net
|
||||
traceroute to wxcafe.net (62.210.115.205), 30 hops max, 60 byte packets
|
||||
1 bowser.wx (10.0.42.1) 0.224 ms 0.272 ms 0.324 ms
|
||||
2 * * *
|
||||
3 B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 5.967 ms B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48) 2.234 ms B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 5.959 ms
|
||||
4 * * *
|
||||
5 0.ae3.BR2.NYC4.ALTER.NET (140.222.1.59) 4.696 ms 4.692 ms 0.ae2.BR2.NYC4.ALTER.NET (140.222.229.93) 4.618 ms
|
||||
6 verizon.com.customer.alter.net (152.179.78.154) 4.245 ms 3.719 ms 3.251 ms
|
||||
7 ae-2-3211.edge7.Paris1.Level3.net (4.69.133.238) 112.460 ms 111.249 ms 109.206 ms
|
||||
8 212.3.235.202 (212.3.235.202) 87.401 ms 87.113 ms 86.841 ms
|
||||
9 49e-s202b-1-dc2-a9k1.dc2.poneytelecom.eu (195.154.1.29) 86.806 ms 86.919 ms 87.126 ms
|
||||
10 51.158.8.83 (51.158.8.83) 87.125 ms 86.566 ms 88.011 ms
|
||||
11 wxcafe.net (62.210.115.205) 87.847 ms 87.766 ms 87.778 ms
|
||||
```
|
||||
|
||||
Here's an example of a traceroute from my laptop to `wxcafe.net`. Just at a
|
||||
glance, we can see a few things: I'm on verizon's network. The first hop is my
|
||||
private router (it has a private IPv4 address). The next hop does not send us
|
||||
ICMP TTL Expired packets for some reason. After that, we got three answers:
|
||||
verizon does some load balancing, we're going over multiple paths. Then once
|
||||
again a hop that doesn't answer, then two ansers from verizon core (this NSFNET
|
||||
block is now verizon's...), then again verizon core, and suddenly we went over
|
||||
the atlantic and we're on Level3's Paris1 router! Then another Level3 IP that
|
||||
doesn't have a reverse DNS entry, and we enter Online.net's network
|
||||
(poneytelecom is their ISP name). Finally we see the server's gateway, and the
|
||||
server itself!.
|
||||
|
||||
The numbers after the host part all show round trip time (there are three
|
||||
because traceroute sends three packets to each host by default), so we can spot
|
||||
very clearly the moment we went from the US over to France even without looking
|
||||
at the router names: when it goes from 4.2ms to 112ms, it's because the packet
|
||||
took a trip in some submarine cables. We can also see that some later hops have
|
||||
lower RTT than some earlier ones (for example hop 5 has a lower RTT than hop 3,
|
||||
and hops 8, 9, 10 and 11 all have lower RTTs than hop 7). This is due to the
|
||||
fact that traceroute gets data from each host independently: the replies from
|
||||
host 8 have no link with the replies from hop 7, and in general network devices
|
||||
are much faster at forwarding packets than they are at generating ICMP TTL
|
||||
Expired replies. Thus the packets we got back from hop 7 didn't take necessarily
|
||||
take longer to travel back to us, they just took longer to be generated (though
|
||||
they **can** sometimes take longer to travel back: the path the packets take
|
||||
from our machine to the target is not necessarily the same that they take to get
|
||||
from some random hop on the way back to our machine!)
|
||||
|
||||
Now, let's see another one:
|
||||
|
||||
``` shell
|
||||
> sudo traceroute -T -p 22 imaginair.es
|
||||
traceroute to imaginair.es (188.40.106.245), 30 hops max, 60 byte packets
|
||||
1 bowser.wx (10.0.42.1) 0.169 ms 0.165 ms 0.206 ms
|
||||
2 * * *
|
||||
3 B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48) 5.816 ms 5.821 ms 5.868 ms
|
||||
4 * * *
|
||||
5 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131) 3.578 ms 0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107) 3.513 ms 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131) 3.572 ms
|
||||
6 ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145) 2.903 ms 3.714 ms 3.695 ms
|
||||
7 et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226) 85.106 ms 84.522 ms 83.907 ms
|
||||
8 46.33.77.6 (46.33.77.6) 88.457 ms 88.430 ms 89.192 ms
|
||||
9 core21.fsn1.hetzner.com (213.239.245.217) 98.676 ms 99.107 ms 99.088 ms
|
||||
10 ex9k1.dc13.fsn1.hetzner.com (213.239.245.238) 99.047 ms 97.777 ms ex9k1.dc13.fsn1.hetzner.com (213.239.245.242) 97.651 ms
|
||||
11 * * *
|
||||
12 * * *
|
||||
13 * * *
|
||||
14 * * *
|
||||
15 * * *
|
||||
16 * * *
|
||||
17 * * *
|
||||
18 * * *
|
||||
19 * * *
|
||||
20 * * *
|
||||
21 * * *
|
||||
22 * * *
|
||||
23 * * *
|
||||
24 * * *
|
||||
25 * * *
|
||||
26 * * *
|
||||
27 * * *
|
||||
28 * * *
|
||||
29 * * *
|
||||
30 * * *
|
||||
```
|
||||
|
||||
Here, we can see that it starts the same way as before, except it goes through
|
||||
frankfurt and germany instead of paris, but then it stops inside hetzner's
|
||||
network... why? because the firewall of the target (imaginair.es) filters TCP
|
||||
port 22, and won't accept it nor forward it. So it's dropped, and there's no
|
||||
ICMP TTL Expired for traceroute to receive! As it doesn't know what happens, it
|
||||
goes up to its maximum TTL (30 by default) and then gives up.
|
||||
|
||||
Alright, let's move to `mtr`...
|
||||
|
||||
``` shell
|
||||
> mtr -w -T -P 5050 wxcafe.net
|
||||
Start: 2019-12-13T18:54:21-0500
|
||||
HOST: cwh Loss% Snt Last Avg Best Wrst StDev
|
||||
1.|-- bowser.wx 0.0% 10 0.5 0.4 0.3 0.6 0.1
|
||||
2.|-- tunnel536764.tunnel.tserv4.nyc4.ipv6.he.net 0.0% 10 5.0 5.3 4.3 6.1 0.6
|
||||
3.|-- ve422.core1.nyc4.he.net 0.0% 10 2.8 3.2 2.6 4.1 0.5
|
||||
4.|-- 100ge4-1.core1.par2.he.net 50.0% 10 89.8 88.0 73.8 97.3 9.0
|
||||
5.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||
6.|-- 2001:bc8:400:1::8e 0.0% 10 75.1 75.0 74.4 75.8 0.4
|
||||
7.|-- 2001:bc8:400:100::7f 20.0% 10 74.6 126.5 74.4 591.5 163.4
|
||||
8.|-- wxcafe.net 0.0% 10 74.3 78.3 73.8 113.9 12.5
|
||||
```
|
||||
|
||||
Here I try to `mtr` from my machine to `wxcafe.net`, over TCP port 5050. At
|
||||
first glance we can see that I use Hurricane Electric's tunnel service to get
|
||||
IPv6 (because Verizon won't provide v6 yet... Come on, it's 2019...), and that
|
||||
this time most of the way goes through HE's network (up to Paris). This is
|
||||
probably because they peer directly with Online.net/Illiad in Paris, and don't
|
||||
want to pay for the traffic by sending it to one of their transits when they can
|
||||
transport it over to the peering point.
|
||||
|
||||
We can also see that there's a lot more info visible, and that the layout looks
|
||||
a lot better! Here the fields are, in order: hostname, Loss percentage, number
|
||||
of packets sent, RTT of the last packet, average RTT, best RTT, worst RTT, and
|
||||
standard deviation of the RTTs.
|
||||
|
||||
From that we can deduce that it sent 10 packets, and thus the
|
||||
Last/Average/Best/Worst/Standard Deviation fields are a lot more useful than the
|
||||
simple three RTT values we got from `traceroute`!
|
||||
|
||||
We also notice that in the Loss% column, besides the host that didn't answer our
|
||||
probes, there's also two hops that have respectively 50% and 20% loss. Now, we
|
||||
could jump to the conclusion that this means these hops dropped our packets, and
|
||||
that something's wrong with them! But on closer inspection, later hops don't
|
||||
show that drop, and everything works well... That's weird.
|
||||
|
||||
The reason why that's happening is simple: sometimes, routers have other things
|
||||
to do with their time than reply to any rando's packet that has an expired TTL.
|
||||
Replying with an ICMP TTL Expired packet is actually very low priority for
|
||||
routers, and when they have other stuff going on they sometimes simply don't
|
||||
answer. This obviously doesn't mean that there's something *actually* wrong on
|
||||
the path, or the "Loss" would continue down to the later hops! This is actually
|
||||
a very common error.
|
||||
|
||||
Let's look at a last one:
|
||||
|
||||
``` shell
|
||||
> mtr -4 -wbz -T -P 22 imaginair.es
|
||||
Start: 2019-12-13T19:44:34-0500
|
||||
HOST: cwh Loss% Snt Last Avg Best Wrst StDev
|
||||
1. AS??? bowser.wx (10.0.42.1) 0.0% 10 0.6 0.5 0.3 0.7 0.1
|
||||
2. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||
3. AS701 B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 0.0% 10 5.6 6.6 4.4 8.9 1.4
|
||||
AS701 B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48)
|
||||
4. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||
5. AS??? 0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107) 0.0% 10 63.9 9.7 3.1 63.9 19.1
|
||||
AS??? 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131)
|
||||
6. AS3257 4436ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145) 0.0% 10 3.8 4.2 2.5 6.9 1.3
|
||||
7. AS3257 4436et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226) 0.0% 10 84.0 85.8 83.7 94.8 3.4
|
||||
8. AS3257 443646.33.77.6 0.0% 10 89.0 100.1 89.0 162.4 22.6
|
||||
9. AS24940 core21.fsn1.hetzner.com (213.239.245.217) 0.0% 10 98.8 100.4 98.1 115.4 5.3
|
||||
AS24940 core22.fsn1.hetzner.com (213.239.245.178)
|
||||
10. AS24940 ex9k1.dc13.fsn1.hetzner.com (213.239.245.242) 0.0% 10 100.0 98.8 97.7 100.0 0.7
|
||||
AS24940 ex9k1.dc13.fsn1.hetzner.com (213.239.245.238)
|
||||
11. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
|
||||
```
|
||||
|
||||
Here again we chose a target that won't work, `imaginair.es` on TCP port 22. In
|
||||
this case, though, we can see that there is no long trail of `* * *`, mtr simply
|
||||
shows `AS??? ??? 100.0`, 100% loss. It's clear what's happening, if the last hop
|
||||
is unknown with 100% loss, clearly it's blocked somewhere.
|
||||
|
||||
We can also see multiple addresses for some hops, once again these are due to
|
||||
load-balancing. Some of the ASN lookups failed, and that happens sometimes.
|
||||
|
||||
There was also some display error on hops 6, 7 and 8, probably because the AS
|
||||
lookup code got two results and displayed both, breaking the display... :/ here
|
||||
the right address for hop 8 is `46.33.77.6`.
|
||||
|
||||
---
|
||||
|
||||
Anyway, if you want to report a network problem to an engineer... generally,
|
||||
you're better off running `mtr -wbz <target>` and letting the person on the
|
||||
other hand figure it out. And don't open a report if you're not sure it's
|
||||
a network error!
|
Loading…
x
Reference in New Issue
Block a user