Browse Source

adds post on using traceroute/mtr

master
Clément Hertling 8 months ago
parent
commit
e4fbb697f0
1 changed files with 259 additions and 0 deletions
  1. +259
    -0
      content/using_traceroute-mtr,_or:_diagnosing_network_problems_101.md

+ 259
- 0
content/using_traceroute-mtr,_or:_diagnosing_network_problems_101.md View File

@@ -0,0 +1,259 @@
Title: Using traceroute/mtr, or: Diagnosing network problems 101
Date: 2019-12-13T16:39-05:00
Author: Wxcafé
Category:
Slug: content/using_traceroute-mtr,_or:_diagnosing_network_problems_101

I was a in a twitter discussion recently about **Traceroute**, and how it was not
necessarily as simple as it seemed, and that it caused a lot of confusion on the
user side and a lot of frustration on the network admin side. So I decided to
write this little guide on how `traceroute` and `mtr` work, how to use it, and
how to <s>read the tea leaves</s> interpret the output.

### How it works

`traceroute` and `mtr` (and similar tools) all work the same way: they send
packets with low TTL (Time to Live, the number of hops a packet will be
transmitted for before <s>dying</s> being dropped), and rely on the routers on
each step of the way to send an ICMP Type 11 packet (TTL Expired). `traceroute`
sends UDP by default, whereas `mtr` sends ICMP, but the idea is the same: first
you send a packet with a TTL of 1, it expires on the first hop, which tells you
it did. Then you send a packet with a TTL of 2, and the second router along the
way tells you it expired. And you do that again and again until you get to the
target.

The layer 4 protocol you're using doesn't matter (in general), because the TTL
is an IP-level option, so you'll get an answer anyway. But you can switch which
one you're using to debug different problems, whether it is reachability in
general or on a specific TCP port, or something else.

`traceroute` only has a 'report mode', in that it immediately outputs to the
terminal and tries three times, and that's it. `mtr`, on the other hand, uses
a curses interface by default, and tries until you tell it to stop, gathering
stats along the way, but it can also do reporting similarly to `traceroute`, and
can try multiple times even in report mode.

### How to use it

`traceroute` and `mtr` are pretty simple to use, you point them to your
destination and shoot. Here are a few common and useful flags:

#### `traceroute`:

- `-4`/`-6`: use IPv4/IPv6 (it will use **v4** by default)
- `-I`: use ICMP instead of UDP packets
- `-T`: use TCP SYN instead of UDP packets
- `-U`: use UDP but keep the port consistent (by default, the port is
incremented with each packet sent)
- `-n`: do not use reverse DNS to get hostnames in the results. Useful if your
DNS is broken.
- `-p <port>`: destination port for TCP or UDP with `-U`
- `-A`: lookup and show AS number of each hop
- `-N`: selects the number of packets sent simultaneously (default is 16. too
few will be slow, too many might get filtered)

#### `mtr`

- `-4`/`-6`: use IPv4/IPv6 (it will use **v6** by default)
- `-r`/`-w`: generate a report instead of going into the interactive interface
(`-w` is for the "wide" mode, which doesn't cut hostnames)
- `-j`/`-x`/`-C`: output json/xml/csv, respectively
- `-n`: do not use reverse DNS to get hostnames in the results.
- `-z`: lookup and show AS number of each hop
- `-c`: number of cycles to run for
- `-s <size>`: specify packet size
- `-u`: use UDP instead of ICMP packets
- `-T`: use TCP instead of ICMP packets
- `-P <port>`: destination port for UDP and TCP

`mtr` also has an interactive mode (in fact, it's the default). A few useful
shortcuts for that mode:

- `p` will pause display updates, `<SPACE>` will unpause
- `d` will switch display mode between statistics and two per-packet displays
- `n` will toggle reverse DNS resolution on/off
- `r` will reset the display, dropping all history and starting from scratch
- `y` will toggle IP info and cycle between AS number lookup, IP address
display, country, RIR, and date of registration of the network.
- `q` will quit (useful to know 😁)

### How to interpret the output (the most important part)

So, now that we know all of that... how do we read the output?

``` shell
> traceroute wxcafe.net
traceroute to wxcafe.net (62.210.115.205), 30 hops max, 60 byte packets
1 bowser.wx (10.0.42.1) 0.224 ms 0.272 ms 0.324 ms
2 * * *
3 B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 5.967 ms B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48) 2.234 ms B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 5.959 ms
4 * * *
5 0.ae3.BR2.NYC4.ALTER.NET (140.222.1.59) 4.696 ms 4.692 ms 0.ae2.BR2.NYC4.ALTER.NET (140.222.229.93) 4.618 ms
6 verizon.com.customer.alter.net (152.179.78.154) 4.245 ms 3.719 ms 3.251 ms
7 ae-2-3211.edge7.Paris1.Level3.net (4.69.133.238) 112.460 ms 111.249 ms 109.206 ms
8 212.3.235.202 (212.3.235.202) 87.401 ms 87.113 ms 86.841 ms
9 49e-s202b-1-dc2-a9k1.dc2.poneytelecom.eu (195.154.1.29) 86.806 ms 86.919 ms 87.126 ms
10 51.158.8.83 (51.158.8.83) 87.125 ms 86.566 ms 88.011 ms
11 wxcafe.net (62.210.115.205) 87.847 ms 87.766 ms 87.778 ms
```

Here's an example of a traceroute from my laptop to `wxcafe.net`. Just at a
glance, we can see a few things: I'm on verizon's network. The first hop is my
private router (it has a private IPv4 address). The next hop does not send us
ICMP TTL Expired packets for some reason. After that, we got three answers:
verizon does some load balancing, we're going over multiple paths. Then once
again a hop that doesn't answer, then two ansers from verizon core (this NSFNET
block is now verizon's...), then again verizon core, and suddenly we went over
the atlantic and we're on Level3's Paris1 router! Then another Level3 IP that
doesn't have a reverse DNS entry, and we enter Online.net's network
(poneytelecom is their ISP name). Finally we see the server's gateway, and the
server itself!.

The numbers after the host part all show round trip time (there are three
because traceroute sends three packets to each host by default), so we can spot
very clearly the moment we went from the US over to France even without looking
at the router names: when it goes from 4.2ms to 112ms, it's because the packet
took a trip in some submarine cables. We can also see that some later hops have
lower RTT than some earlier ones (for example hop 5 has a lower RTT than hop 3,
and hops 8, 9, 10 and 11 all have lower RTTs than hop 7). This is due to the
fact that traceroute gets data from each host independently: the replies from
host 8 have no link with the replies from hop 7, and in general network devices
are much faster at forwarding packets than they are at generating ICMP TTL
Expired replies. Thus the packets we got back from hop 7 didn't take necessarily
take longer to travel back to us, they just took longer to be generated (though
they **can** sometimes take longer to travel back: the path the packets take
from our machine to the target is not necessarily the same that they take to get
from some random hop on the way back to our machine!)

Now, let's see another one:

``` shell
> sudo traceroute -T -p 22 imaginair.es
traceroute to imaginair.es (188.40.106.245), 30 hops max, 60 byte packets
1 bowser.wx (10.0.42.1) 0.169 ms 0.165 ms 0.206 ms
2 * * *
3 B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48) 5.816 ms 5.821 ms 5.868 ms
4 * * *
5 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131) 3.578 ms 0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107) 3.513 ms 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131) 3.572 ms
6 ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145) 2.903 ms 3.714 ms 3.695 ms
7 et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226) 85.106 ms 84.522 ms 83.907 ms
8 46.33.77.6 (46.33.77.6) 88.457 ms 88.430 ms 89.192 ms
9 core21.fsn1.hetzner.com (213.239.245.217) 98.676 ms 99.107 ms 99.088 ms
10 ex9k1.dc13.fsn1.hetzner.com (213.239.245.238) 99.047 ms 97.777 ms ex9k1.dc13.fsn1.hetzner.com (213.239.245.242) 97.651 ms
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
```

Here, we can see that it starts the same way as before, except it goes through
frankfurt and germany instead of paris, but then it stops inside hetzner's
network... why? because the firewall of the target (imaginair.es) filters TCP
port 22, and won't accept it nor forward it. So it's dropped, and there's no
ICMP TTL Expired for traceroute to receive! As it doesn't know what happens, it
goes up to its maximum TTL (30 by default) and then gives up.

Alright, let's move to `mtr`...

``` shell
> mtr -w -T -P 5050 wxcafe.net
Start: 2019-12-13T18:54:21-0500
HOST: cwh Loss% Snt Last Avg Best Wrst StDev
1.|-- bowser.wx 0.0% 10 0.5 0.4 0.3 0.6 0.1
2.|-- tunnel536764.tunnel.tserv4.nyc4.ipv6.he.net 0.0% 10 5.0 5.3 4.3 6.1 0.6
3.|-- ve422.core1.nyc4.he.net 0.0% 10 2.8 3.2 2.6 4.1 0.5
4.|-- 100ge4-1.core1.par2.he.net 50.0% 10 89.8 88.0 73.8 97.3 9.0
5.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
6.|-- 2001:bc8:400:1::8e 0.0% 10 75.1 75.0 74.4 75.8 0.4
7.|-- 2001:bc8:400:100::7f 20.0% 10 74.6 126.5 74.4 591.5 163.4
8.|-- wxcafe.net 0.0% 10 74.3 78.3 73.8 113.9 12.5
```

Here I try to `mtr` from my machine to `wxcafe.net`, over TCP port 5050. At
first glance we can see that I use Hurricane Electric's tunnel service to get
IPv6 (because Verizon won't provide v6 yet... Come on, it's 2019...), and that
this time most of the way goes through HE's network (up to Paris). This is
probably because they peer directly with Online.net/Illiad in Paris, and don't
want to pay for the traffic by sending it to one of their transits when they can
transport it over to the peering point.

We can also see that there's a lot more info visible, and that the layout looks
a lot better! Here the fields are, in order: hostname, Loss percentage, number
of packets sent, RTT of the last packet, average RTT, best RTT, worst RTT, and
standard deviation of the RTTs.

From that we can deduce that it sent 10 packets, and thus the
Last/Average/Best/Worst/Standard Deviation fields are a lot more useful than the
simple three RTT values we got from `traceroute`!

We also notice that in the Loss% column, besides the host that didn't answer our
probes, there's also two hops that have respectively 50% and 20% loss. Now, we
could jump to the conclusion that this means these hops dropped our packets, and
that something's wrong with them! But on closer inspection, later hops don't
show that drop, and everything works well... That's weird.

The reason why that's happening is simple: sometimes, routers have other things
to do with their time than reply to any rando's packet that has an expired TTL.
Replying with an ICMP TTL Expired packet is actually very low priority for
routers, and when they have other stuff going on they sometimes simply don't
answer. This obviously doesn't mean that there's something *actually* wrong on
the path, or the "Loss" would continue down to the later hops! This is actually
a very common error.

Let's look at a last one:

``` shell
> mtr -4 -wbz -T -P 22 imaginair.es
Start: 2019-12-13T19:44:34-0500
HOST: cwh Loss% Snt Last Avg Best Wrst StDev
1. AS??? bowser.wx (10.0.42.1) 0.0% 10 0.6 0.5 0.3 0.7 0.1
2. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
3. AS701 B3447.NYCMNY-LCR-22.verizon-gni.net (100.41.130.50) 0.0% 10 5.6 6.6 4.4 8.9 1.4
AS701 B3447.NYCMNY-LCR-21.verizon-gni.net (100.41.130.48)
4. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
5. AS??? 0.ae5.BR1.NYC1.ALTER.NET (140.222.228.107) 0.0% 10 63.9 9.7 3.1 63.9 19.1
AS??? 0.ae6.BR1.NYC1.ALTER.NET (140.222.228.131)
6. AS3257 4436ae13.cr0-nyc2.ip4.gtt.net (173.205.47.145) 0.0% 10 3.8 4.2 2.5 6.9 1.3
7. AS3257 4436et-0-0-49.cr11-fra2.ip4.gtt.net (89.149.180.226) 0.0% 10 84.0 85.8 83.7 94.8 3.4
8. AS3257 443646.33.77.6 0.0% 10 89.0 100.1 89.0 162.4 22.6
9. AS24940 core21.fsn1.hetzner.com (213.239.245.217) 0.0% 10 98.8 100.4 98.1 115.4 5.3
AS24940 core22.fsn1.hetzner.com (213.239.245.178)
10. AS24940 ex9k1.dc13.fsn1.hetzner.com (213.239.245.242) 0.0% 10 100.0 98.8 97.7 100.0 0.7
AS24940 ex9k1.dc13.fsn1.hetzner.com (213.239.245.238)
11. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
```

Here again we chose a target that won't work, `imaginair.es` on TCP port 22. In
this case, though, we can see that there is no long trail of `* * *`, mtr simply
shows `AS??? ??? 100.0`, 100% loss. It's clear what's happening, if the last hop
is unknown with 100% loss, clearly it's blocked somewhere.

We can also see multiple addresses for some hops, once again these are due to
load-balancing. Some of the ASN lookups failed, and that happens sometimes.

There was also some display error on hops 6, 7 and 8, probably because the AS
lookup code got two results and displayed both, breaking the display... :/ here
the right address for hop 8 is `46.33.77.6`.

---

Anyway, if you want to report a network problem to an engineer... generally,
you're better off running `mtr -wbz <target>` and letting the person on the
other hand figure it out. And don't open a report if you're not sure it's
a network error!

Loading…
Cancel
Save