Tuesday, May 20, 2014

10G > 1G

10Gb/s ain't what it used to be

It was only a few years ago that 10Gb/s kit cost 10's of thousands of dollars and needed massive XenPaks to plug in as optics. It's now 2014, 10Gb/s SFPs cost about $200 each, and the closest I get to a XenPak is the broken one I use as a bottle opener.

Because it's so cheap, it's a no-brainer to put 10Gb/s NICs in your servers, but there's no guarantee that your network can support 10Gb/s the whole way through. You might think that a 1Gb/s bottleneck in your network isn't a big deal, and that TCP can fumble around and find the top speed for your connection, but you might be disappointed to hear that it's not that easy.

TCP is dumb

TCP doesn't have any parameters, internal or external, for how fast it's sending data. It has a window of how much unaccounted data has been sent, and this window moves along as acknowledgements are received. The size of this window sets an upper bound on the average speed (based on latency and packet loss, feel free to explore this), but not on the maximum speed. This becomes a problem as bandwidth and latency both increase.

Long fat networks

The TCP window keeps track of every byte "in flight" - in other words, all data that has been sent and not acknowledged. It can't send any more data until the first lot of data has been acknowledged, and it needs a buffer (window) to track this. The smallest this buffer can be is latency x bandwidth, and this number can get very big very quickly. If you're trying to send at 1Gb/s to a destination 160ms away, you need a window of 20MB - if you want to do this at 10Gb/s, you need a window of 200MB! Compared to the 128GB flash drives that clip onto your keyring, this doesn't seem like a huge amount, but to the switches and routers in your network, this is a lot to soak up if your traffic has to go across a slow part of the network

How 10 goes into 1

If your sender and receiver have 10Gb/s connections, but the network has a 1Gb/s segment in the middle, you can run into interesting problems. With 160ms of latency in the way, your sender dumps 20MB onto the network at 10Gb/s - in the time it takes to arrive at the start of the 1Gb/s segment, the 1Gb/s segment can only send 2MB - leaving 18MB to be dealt with. If you have big enough buffers, then this will eke out into the network at 1Gb/s and everything will be fine!

However, 20MB is a big buffer - it holds 160ms of data. We've all seen buffer bloat (when buffers fill up and stay full and add extra latency to the network), and hear about it being a bad thing, but this is an instance where buffers are *very* important. If you have no buffers, your TCP stream starts up and immediately drops 90% of its packets, and things go very bad, very fast.

Labbing this up

You can simulate this yourself between two Linux machines with tc and iperf. First, plug them into each other at 10Gb/s, make sure they're tuned (net.ipv4.tcp_rmem/wmem need to have the last parameter set to about 64MB), and test between them. Assuming sub-millisecond latency, you should see very close to 10Gb/s of TCP throughput (if not, the servers are mistuned, or underpowered).

box1: iperf -i1 -s
box2: iperf -i1 -c box1 -t300

Looking good? If not, you're out of luck in this instance - TCP tuning is outside the scope of this post, go and ask ESnet what to do.

Assuming this is all working, we'll add some latency as an egress filter on box1, and see what happens

sudo tc qdisc add dev eth0 netem delay 50ms limit 10000

Try the iperf again - is it still working well? If you're not averaging 7-8Gb/s then you might want to do some tuning, and come back when it's looking better.

Now we've got a known-good at 50ms, let's try simulating a 1Gb/s bottleneck in the network. Apply an ingress policer to box2 as follows:

sudo tc qdisc add dev eth0 handle ffff: ingress
sudo tc filter add dev eth0 parent ffff: protocol ip prio 1 u32 match ip src police rate 1000mbit burst 100k mtu 100k drop flowid :1

Try your iperf again - how does it look? When I tested this, I couldn't get more than 10Mb/s - something is seriously wrong! Let's try and send some UDP through to see what's happening

box1: iperf -i1 -s -u -l45k
box2: iperf -i1 -u -l45k -c box1 -t300 -b800m

Odd... we can send 800Mb/s of UDP with little or no packet loss, but can't get more than 20Mb/s of TCP?!

The fix for this is adding a shaper to make sure nothing gets dropped by the policer. We can add this on box1 in this instance as follows:

sudo tc qdisc del dev eth0 root
sudo tc qdisc add dev eth0 root handle 1: tbf rate 900mbit burst 12000 limit 500m mtu 100000
sudo tc qdisc add dev eth0 parent 1: netem delay 50ms limit 10000

You'll notice we deleted and then re-added the latency - this is just a limitation of how we chain these qdiscs together. But give it a shot - try an iperf between the two boxes with TCP, and magic - you can get 850Mb/s now!

A sustainable fix

We're not going to add shapers everywhere just because we know that some parts of our network have bad bottlenecks. It's okay though - smart people are working on a fix. By adding a knob to TCP where we explicitly say how fast we want it to go, we can make TCP pace itself, instead of emptying out its window onto the network, and then getting upset when it doesn't get magically taken care of. This is still experimental, but I'm keen to hear if anyone has had any luck with it - this is a very important step forward for TCP, and will become gradually more important as our networks get longer and fatter.


  1. 1) netem is hard to trust and combining it with other qdiscs dicey
    2) Policers suck, due to inbound buffer limits forcing drops
    3) sch_fq on the hosts works really well now
    4) fq_codel on the 1GB/sec link will also work really well
    but you have to actually have a 1GB/sec box in the middle, and in most
    cases you'll end up dropping at the switch rather than at the box.

    I'd be rather interested in results with sch_fq and fq_codel in your setup.

  2. Second, your statement about "With 160ms of latency in the way, your sender dumps 20MB onto the network at 10Gb/s - in the time it takes to arrive at the start of the 1Gb/s segment, the 1Gb/s segment can only send 2MB - leaving 18MB to be dealt with. If you have big enough buffers, then this will eke out into the network at 1Gb/s and everything will be fine!"

    is not correct. tcp will ramp up to sending a little over 1Gb/sec + buffering, then have a drop, hopefully long before it has a 10x1 bandwidth disparity.

    1. That all looks call, codel looks quite interesting. Yes, I have oversimplified how TCP works here, but it does illustrate the idea that TCP is good at sending bursty traffic, and bad at sending at a constant rate.

      The reason I'm chasing this particular setup is because I've seen it a few times on WANs in the last few years, and it's common when purchasing a 1Gb/s circuit for the rate limit to be enforced by a policer and leave queueing up to the customer. The take-home message is that rate shapers with deep buffers are a good thing (tm) for elephant flows, and this needs to be balanced against buffer bloat - it's just interesting looking at where and how you apply shapers

  3. Your policer is incorrect for ipv6 and other traffic...

    As for using fq_codel instead on a policer here is a simple setup for a test. I used 800mbit because I don't have 10gigE available to step down from.

    I think this gives the best of both worlds - low latency AND deep enough buffers for high throughput.

    IFACE=lo # pick your interface

    modprobe ifb0
    modprobe sch_fq_codel
    tc qdisc del dev $IFACE handle ffff: ingress
    tc qdisc add dev $IFACE handle ffff: ingress

    tc qdisc del dev $DEV root
    tc qdisc add dev $DEV root handle 1 htb default 10
    # you can probably tune the htb quantum some and/or add per ip buckets
    tc class add dev $DEV parent 1: classid 1:1 htb rate 800mbit
    tc class add dev $DEV parent 1:1 classid 1:10 htb rate 800mbit ceil 800mbit
    tc qdisc add dev $DEV parent 1:10 handle 110: fq_codel
    ifconfig $DEV up # if you forget to do this you'll lock up the interface

    # this matches both ipv6 and ipv4 traffic (and arp, etc)

    tc filter add dev $IFACE parent ffff: protocol all prio 10 u32 \
    match u32 0 0 flowid 1:1 action mirred egress redirect dev $DEV

  4. using your policer I get 9mbit/sec out of localhost. Using the one above (and configured for 1000mbit) I get 940mbits. I'll try it myself on real hardware later, but it is my hope you'll do better on your hardware with fq_codel enabled (more with ecn enabled)

    combining it with netem is a bit dicy, you should do that on another interface or box to get a correct result.

  5. I'm not trying to make a better policer here. It's really common to buy a n Gb/s circuit from a supplier and find they've put a very aggressive policer on it, and then have to find workarounds. If I change the parameters on my policer and reconfigure the burst from 100kB to 1MB or 10MB then the problem also goes away.

    Having a bad policer in the middle is the whole point of this lab, as it demonstrates a common problem in the ISP world.

  6. My point being you can point them at a script that actually works well as a policer.... (but I haven't had a chance to get it going on real hardware yet)

  7. Like I said, raising the burst size to 1MB or 10MB on the simple policer fixes the problem too - codel looks to be useful for some things, but in the real world a managed circuit would be protected by policers with small burst sizes on vendor hardware, not by a kernel patch on a linux box