Using Linux TCP Splicing with HAProxy Willy Tarreau - 2007/01/06 - Alexandre Cassen has started a project called Linux Layer7 Switching (L7SW), whose goal is to provide kernel services to help userland proxies achieving very high performance. Right now, the project consists in a loadable kernel module providing TCP Splicing under Linux. TCP Splicing is a method by which a userland proxy can tell the kernel that it considers it has no added value on the data part of a connection, and that the kernel can perform the transfers it itself, thus relieving the proxy from a potentially heavy job. There are two advantages to this method : - it reduces the number of process wakeups - it reduces the number of data copies between user-space and kernel buffers This method is particularly suited to protocols in which data is sent till the end of the session. This is the case for FTP data for instance, and it is also the case for the BODY part of HTTP/1.0. The great news is that haproxy has been designed from the beginning with a clear distinction between the headers and the DATA phase, so it was a child's game to add hooks to Alex's library in it Be careful! Both versions are to be considered BETA software ! Run them on your systems if you want, but do not complain if it crashes twice a day ! Anyway, it seems stable on our test machines. In order to use TCP Splicing on haproxy, you need : - Linux Layer7 Switching code version 0.1.1 : [ http://linux-l7sw.sf.net/ ] - Haproxy version 1.3.5 : [ http://haproxy.1wt.eu/download/1.3/src/ ] Then, you must untar both packages in any location, let's assume you'll be using /tmp. First extract l7sw and : $ cd /tmp $ tar zxf layer7switch-0.1.1.tar.gz $ cd layer7switch-0.1.1 L7SW currently only supports Linux kernel 2.6.19+. If you prefer to use it on a more stable kernel, such as 2.6.16.X, you can apply this patch to the L7SW directory : [ http://haproxy.1wt.eu/download/patches/tcp_splice-0.1.1-linux-2.6.16.diff ] $ patch -p1 -d kernel < tcp_splice-0.1.1-linux-2.6.16.diff Alternatively, if you prefer to run it on 2.4.33+, you can apply this patch to the L7SW directory : [ http://haproxy.1wt.eu/download/patches/tcp_splice-0.1.1-linux-2.4.33.diff ] $ patch -p1 -d kernel < tcp_splice-0.1.1-linux-2.4.33.diff Then build the kernel module as described in the L7SW README. Basically, you just have to do this once your tree has been patched : $ cd kernel $ make You can either install the resulting module (tcp_splice) or load it now. During early testing periods, it might be preferable to avoid installing anything and just load it manually : $ sudo insmod tcp_splice.*o $ cd .. Now that the module is loaded, you need to build the libtcpsplice library on which haproxy currently relies : $ cd userland/libtcpsplice $ make $ cd .. For the adventurous, there's also a proof of concept in the userlan/switchd directory, it may be useful if you encounter problems with haproxy for instance. But it is not needed at all here. OK, L7SW is ready. Now you have to extract haproxy and tell it to build using libtcpsplice : $ cd /tmp $ tar zxf haproxy-1.3.5.tar.gz $ cd haproxy-1.3.5 $ make USE_TCPSPLICE=1 TCPSPLICEDIR=/tmp/layer7switch-0.1.1/userland/libtcpsplice There are other options to make, which are hugely recommended, such as CPU=, REGEX=, and above all, TARGET= so that you use the best syscalls and functions for your system. Generally you will use TARGET=linux26, but 2.4 users with an epoll-patched kernel will use TARGET=linux24e. This is very important because failing to specify those options will disable important optimizations which might hide the tcpsplice benefits ! Please consult the haproxy's README. Now that you have haproxy built with support for tcpsplice, and that the module is loaded, you have to write a config. There is an example in the 'examples' directory. Basically, you just have to add the "option tcpsplice" keyword BOTH in the frontend AND in the backend sections that you want to accelerate. If the option is specified only in the frontend or in the backend, then no acceleration will be used. It is designed this way to allow some front-back combinations to use it without forcing others to use it. Of course, if you use a single "listen" section, you just have to specify it once. As of now (l7sw-0.1.1 and haproxy-1.3.5), you need the CAP_NETADMIN capability to START and to RUN. For human beings, it means that you have to start haproxy as root and keep it running as root, so it must not drop its priviledges. This is somewhat annoying, but we'll try to find a solution later. Also, l7sw-0.1.1 does not yet support TCP window scaling nor SACK. So you have to disable both features on the proxy : $ sudo sysctl -w net.ipv4.tcp_window_scaling=0 $ sudo sysctl -w net.ipv4.tcp_sack=0 $ sudo sysctl -w net.ipv4.tcp_dsack=0 $ sudo sysctl -w net.ipv4.tcp_tw_recycle=1 You can now check that everything works as expected. Run "vmstat 1" or "top" in one terminal, and haproxy in another one : $ sudo ./haproxy -f examples/tcp-splicing-sample.cfg Transfering large file through it should not affect it much. You should observe something like 10% CPU instead of 95% when transferring 1 MB files at full speed. You can play with the tcpsplice option in the configuration to see the effects. Troubleshooting --------------- This software is still beta, and you will probably encounter some caveats. I personnally ran into a few issues that we'll try to address with Alex. First of all, I had occasionnal lockups on my SMP machine which I never had on an UP one. So if you get problems on an SMP machine, please reboot it in UP and do not lose your time on this. I also noticed that sometimes, some sessions remained established even after the end of the program. You might also see some situtations where even after the proxy's exit, the traffic still passes through the system. It may happen when you have a limited source port range and that you reuse a TIME_WAIT session matching exactly the same source and destinations. This will need to be addressed too. You can play with tcp_splice variables and timeouts here in /proc/sys/net/ : $ ls /proc/sys/net/tcp_splice/ debug_level timeout_established timeout_listen timeout_synsent timeout_close timeout_finwait timeout_synack timeout_timewait timeout_closewait timeout_lastack timeout_synrecv $ sysctl net/tcp_splice net.tcp_splice.debug_level = 0 net.tcp_splice.timeout_synack = 120 net.tcp_splice.timeout_listen = 120 net.tcp_splice.timeout_lastack = 30 net.tcp_splice.timeout_closewait = 60 net.tcp_splice.timeout_close = 10 net.tcp_splice.timeout_timewait = 120 net.tcp_splice.timeout_finwait = 120 net.tcp_splice.timeout_synrecv = 60 net.tcp_splice.timeout_synsent = 120 net.tcp_splice.timeout_established = 900 You can also consult the full session list here : $ head /proc/net/tcp_splice_conn FromIP FPrt ToIP TPrt LocalIP LPrt DestIP DPrt State Expires 0A000301 4EBB 0A000302 1F40 0A000302 817B 0A000301 0050 CLOSE 7 0A000301 4E9B 0A000302 1F40 0A000302 8165 0A000301 0050 CLOSE 7 Since a session exists at least in CLOSE state for 10 seconds, you just have to consult this entry less than 10 seconds after a test to see a session. Please report your successes, failures, suggestions or fixes to the L7SW mailing list here (do not use the list to report other haproxy bugs) : https://lists.sourceforge.net/lists/listinfo/linux-l7sw-devel Motivations ----------- I've always wanted haproxy to be the fastest and most reliable software load balancer available. L7SW is an opportunity to make get a huge performance boost on high traffic sites (eg: photo sharing, streaming, ...). In turn, I find it a shame that Alex wastes his time redevelopping a proxy as a proof of concept for his kernel code. While it is a fun game to enter into, it really becomes harder when you need to get close to customers' needs. So by porting haproxy early to L7SW, I get both the opportunity to get an idea of what it will soon be capable of, and help Alex spend more time on the complex kernel part. Have fun ! Willy