Yahoo’s L3 Direct Server Return (DSR), an alternative to LVS-TUN, explored.

DSR Published on 6 mins Last updated

I had the opportunity to play with Yahoo’s implementation of L3 DSR using LVS and iptables, it was great fun so I thought I had better write a blog on using it.

So, what exactly is L3 DSR and why should I care when I already have LVS-TUN?

L3 DSR is an alternative technique to achieve direct server return (DR or DSR) at Layer 3. Instead of using an IPIP tunnel like LVS-TUN it changes the destination IP address like LVS-NAT when sending the traffic to the real server (in fact we'll be using LVS in NAT mode!). At the same time, it also marks the traffic using the DSCP header so it can then have the destination address rewritten back to the VIP address when arriving at the real server. This ability is not native to iptables but is provided as a module by Yahoo: https://github.com/yahoo/l3dsr

The process is completed by solving the ARP issue as you would for normal LVS-DR - we'll be adding the VIP address to the loopback adapter and then configuring it not to respond to ARP requests. The reply traffic works just the same as LVS-TUN and LVS-DR routing back to the client sourced from the VIP, so you'll need to make sure any router/firewall on the path back is relaxed when it comes to spoofed or martian packets.

Why you should care is a harder question to answer..

Is it faster? I don’t know...but probably.

I’m unsure on the true overhead of an IPIP tunnel vs LVS-NAT and iptables manipulation.

If anyone knows or can guess better than me please do comment below! It does however according to the Yahoo presentation allow a higher MTU, so this should yield better performance through less fragmentation of packets.

Check out the cool presentation:

L3-DSR-Yahoo

It also discusses one other drawback of this method. The DSCP header is normally used for QoS on your network so this method is not compatible in environments where QoS is also needed!

LVS Director

So, for the LVS Director you’ll need iptables and we’ll be configuring the VIP in a standard LVS-NAT configuration but marking packets using an iptables rule.

My LVS setup looks like this:

[root@lbmaster ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=32768)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.72.28:80 wlc persistent 300
  -> 192.168.65.223:80            Masq    100    0          0

You can use the following commands to create a similar LVS configuration:

[root@lbmaster ~]# echo "
-A -t 192.168.72.28:80 -s wlc -p 300
-a -t 192.168.72.28:80 -r 192.168.65.223:80 -m -w 100" | ipvsadm -R

My iptables, look like this

[root@lbmaster ~]# iptables -t mangle -L -n -v
Chain PREROUTING (policy ACCEPT 932 packets, 115K bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DSCP       tcp  --  *      *       0.0.0.0/0            192.168.72.28       tcp dpt:80 DSCP set 0x01

I added the iptables rule using the following rule example, just modify for your VIP and choose a unique number for the mark (it’s just like using firewall marks!):

iptables -t mangle -A PREROUTING -p tcp -d 192.168.72.28 --dport 80 -j DSCP --set-dscp 1

Note: This method can be used on our Loadbalancer.org Enterprise appliance!. Simply configure a Layer 4 NAT mode VIP as normal, add the firewall rule as above to the load balancer under Maintenance > Firewall Script and complete the real server setup as defined below.

Example load balancer Firewall Script entry:

VIP1=192.168.72.28
VIP1_PRT=80
VIP1_DSCP=1
iptables -t mangle -A PREROUTING -p tcp -d ${VIP1} --dport ${VIP1_PRT} -j DSCP --set-dscp ${VIP1_DSCP}

Real Server

The real server, this is arguably the other drawback. You’ll need to compile the Yahoo iptables module to get this to work. This wasn’t too much of a chore, I've tried it on CentOS 5/6/7, for CentOS 6/7 please follow the instructions below:

1. Install dependencies:

[root@localhost ~]# yum install iptables-services kernel-devel iptables-devel gcc make git

2. Replace firewall with iptables:

[root@localhost ~]# systemctl disable firewalld.service && systemctl enable iptables.service && systemctl stop firewalld.service && systemctl start iptables.service

3. Clone Yahoo's l3dsr repo using GIT:

[root@localhost ~]# git clone https://github.com/yahoo/l3dsr.git

4. Make and install iptables extension:

[root@localhost ~]# cd l3dsr/linux/
[root@localhost linux]# make
[root@localhost linux]# make libdir=/usr/lib64 install

5. Test iptables rule using the new extension:

[root@localhost linux]# iptables -t mangle -A INPUT -m dscp --dscp 1 -j DADDR --set-daddr=192.168.72.28
[root@localhost linux]# iptables -t mangle -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
DADDR      all  --  anywhere             anywhere             DSCP match 0x01DADDR set 192.168.72.28

6. Add the iptables rule to your firewall script:

[root@localhost ~]# vi /etc/sysconfig/iptables

Example /etc/sysconfig/iptables:

*mangle
:PREROUTING ACCEPT [187:15126]
:INPUT ACCEPT [187:15126]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [124:16284]
:POSTROUTING ACCEPT [124:16284]
-A INPUT -m dscp --dscp 1 -j DADDR --set-daddr 172.16.202.201
COMMIT
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [184:27992]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT

Note: I experienced an annoying problem when using "iptables-save". When it saves the rules it omits a space between "-j DADDR" and "--set-daddr" causing iptables not to reload!

7. Solve the ARP issue:

[root@localhost ~]# echo "
DEVICE=lo:1
IPADDR=192.168.72.28
NETMASK=255.255.255.255
ONBOOT=yes" > /etc/sysconfig/network-scripts/ifcfg-lo:1
[root@localhost ~]# systemctl restart network
[root@localhost ~]# echo "
net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.enp0s3.arp_ignore=1
net.ipv4.conf.all.arp_announce=2
net.ipv4.conf.enp0s3.arp_announce=2" > /etc/sysctl.d/98-arp-values.conf
[root@localhost ~]# sysctl --system

Note: Change "enp0s3" to your primary network adapter, the adapter traffic will be received on.

Okay lets see if it works...

We'll connect using a web browser and see if we can catch the connection with ipvsadm and tcpdump:

[root@lbmaster ~]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=32768)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.72.28:80 wlc persistent 300
  -> 192.168.65.223:80             Masq    100    3          3
[root@lbmaster ~]# ipvsadm -lcn
IPVS connection entries
pro expire state       source             virtual            destination
TCP 15:00  ESTABLISHED 192.168.72.30:33906 192.168.72.28:80  192.168.65.223:80
TCP 00:07  CLOSE       192.168.72.30:33896 192.168.72.28:80  192.168.65.223:80
TCP 00:29  CLOSE_WAIT  192.168.72.30:33842 192.168.72.28:80  192.168.65.223:80
TCP 00:07  CLOSE       192.168.72.30:33888 192.168.72.28:80  192.168.65.223:80
TCP 04:57  NONE        192.168.72.30:0    192.168.72.28:80  192.168.65.223:80
TCP 15:00  ESTABLISHED 192.168.72.30:33904 192.168.72.28:80  192.168.65.223:80
TCP 15:00  ESTABLISHED 192.168.72.30:33902 192.168.72.28:80  192.168.65.223:80
[root@localhost ~]# tcpdump -i enp0s3 '(ip and (ip[1] & 0xfc) >> 2 == 1)' -vvv
tcpdump: listening on enp0s3, link-type EN10MB (Ethernet), capture size 262144 bytes
23:57:43.212659 IP (tos 0x4, ttl 64, id 62694, offset 0, flags [DF], proto TCP (6), length 462)
    192.168.72.30.33906 > localhost.localdomain.http: Flags [P.], cksum 0x8ed3 (correct), seq 1639366993:1639367403, ack 1125292737, win 229, options [nop,nop,TS val 219873249 ecr 4457346], length 410: HTTP, length: 410
        GET / HTTP/1.1
        Host: 192.168.72.28
        Connection: keep-alive
        Cache-Control: max-age=0
        Upgrade-Insecure-Requests: 1
        User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36
        Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
        Accept-Encoding: gzip, deflate
        Accept-Language: en-US,en;q=0.8,en-GB;q=0.6

Note: In the above example "1" after ">> 2 ==" is the DSCP value we used to mark traffic, also remember to use your interface in place of "enp0s3".

So, in conclusion...

This is a pretty cool method of achieving direct server return across a router hop! I think LVS-TUN is slightly easier to set up but modifying the DSCP header instead may offer some performance benefits over using tunnels.

If this sounds desirable to you then give it a try and let us know how you get on by commenting below!