HAProxy

The Reliable, High Performance TCP/HTTP Load Balancer

2009/08/23 - Quick test of version 1.4-dev2 : barrier of 100k HTTP req/s crossed

Introduction

The first test only accepts a new connection, reads the request, parses it, checks an ACL, sends a redirect and closes. A session rate of 132000 connections per second could even be measured in pure TCP mode, but this is not very useful :
.

The second test forwards the request to a real server instead, and fetches a 64-byte object :

These improvements are due to the ability to tell the system to merge some carefully chosen TCP packets at critical phases of the session. This results in lower number of packets per session, which in turn saves both bandwidth and CPU cycles. The smallest session is now down to 5-6 packets on each side, down from 9 initially.

2009/04/18 - New benchmark of HAProxy at 10 Gbps using Myricom's 10GbE NICs (Myri-10G PCI-Express)

Introduction

Precisely

one year ago

high performance 10GbE NICs

Myricom

Lab setup

myri10ge

Hardware / software setup :

Machine Role Mobo CPU Kernel myri drv myri fw software

AMD2 Client ASUS M3A32MVP AMD Phenom X4/3GHz 2.6.27smp-wt5 1.4.3-1.358 1.4.36 inject31

C2D Proxy ASUS P5E intel C2D E8200/2.66GHz 2.6.27smp-wt5 1.4.3-1.358 1.4.36 haproxy-1.3.17-12

AMD1 Server ASUS M3A32MVP AMD X2/3.2GHz 2.6.27smp-wt5 1.4.3-1.358 1.4.36 httpterm 1.3.0

The tests are quite simple. An HTTP request generator runs on the faster Phenom (amd2) since this soft is the heaviest of the chain. HAProxy runs on the Core2Duo (c2d). The web server runs on the smaller Athlon (amd1). Connections are point-to-point since I still have no 10GbE switch (donations accepted ;-)). Haproxy is configured to use kernel splicing in the response path :

	listen http-splice
		bind	:8000
		option	splice-response
		server	srv1 1.0.0.2:80

Here is a photo of the machines connected together.

Tests methodology

A script calls the request generator for object size from 64 bytes to 10 megs. The request generator continuously connects to HAProxy to fetch the selected object from the server in loops for 1 minute, with 500 to 1000 concurrent connections. Statistics are collected every second, so we have 60 measures. The 5 best ones are arbitrarily eliminated because they may include some noise. The next 20 best ones are averaged and used as the test's result. This means that the 35 remaining values are left unused. This is not a problem because they include values collected during ramping up/down. In practise, tests show that using 20 to 40 values report the same results. Note that network bandwidth is measured at the HTTP level and does not account for TCP acks nor TCP headers. Other captures at the network level show slightly higher throughput.

The collected values are then passed to another script which produces a GNUPLOT script, which when run, produces a PNG graph. The graph shows in green the number of hits per second, which also happens to be the connection rate since haproxy does only one hit per connection. In red, we have the data rate (HTTP headers+data only) reached for each object size. In general, the larger the object, the smaller the connection overhead and the higher the bandwidth.

Tests in single-process mode, 8kB buffers, TCP splicing, LRO enabled, Jumbo frames

splicing

9.950 Gbps

38628 sessions/s

55% more

gigabit is reached for objects larger than about 4kB

The CPU usage has dropped a lot since the introduction of LRO and TCP splicing. Forwarding 9.95 Gbps of Ethernet traffic consumes less than 20% of the CPU :

	root@c2d:tmp# vmstat 1
	procs                      memory      swap          io     system       cpu
	r  b  w   swpd   free   buff  cache   si   so    bi    bo    in    cs  us sy id
	1  0  0      0 1951376   3820  19868    0    0     0     0 25729 23965  1 15 84
	0  0  0      0 1950652   3820  19868    0    0     0     0 25744 23818  3 17 80
	0  0  0      0 1950632   3820  19868    0    0     0     0 25720 24652  1 18 80
	0  0  0      0 1949512   3820  19868    0    0     0     0 25531 24047  3 16 81
	1  0  0      0 1948484   3820  19868    0    0     0     0 25911 22706  2 19 79
	0  0  0      0 1949388   3820  19868    0    0     0     0 26189 23757  3 15 82
	1  0  0      0 1948460   3820  19868    0    0     0     0 25811 23766  1 20 79

Tests in single-process mode, 8kB buffers, TCP splicing, LRO enabled, standard frames

jumbo frame converter

8.55 Gbps

9.2 Gbps

Concerning the CPU usage, at full load (9.2 Gbps at 1500-byte frames), the CPU usage was extremely low : 2% user, 15% system. That means that the LRO feature of the Myri-10G NIC is extremely efficient at offloading the system :

	root@c2d:tmp# vmstat 1
	procs                      memory      swap          io     system       cpu
	r  b  w   swpd   free   buff  cache   si   so    bi    bo    in    cs  us sy id
	0  0  0      0 1811764   3820  19844    0    0     0     0 26289 19089  2 14 84
	0  0  0      0 1787924   3820  19844    0    0     0     0 26268 19578  1 13 86
	1  0  0      0 1772704   3820  19844    0    0     0     0 26288 18517  3 13 84
	0  0  0      0 1774260   3820  19848    0    0     0     0 26284 19522  2 15 83
	1  0  0      0 1766144   3820  19848    0    0     0     0 26260 19042  3 15 83
	1  0  0      0 1734412   3820  19848    0    0     0     0 26270 18603  4 15 81
	0  0  0      0 1719864   3820  19848    0    0     0     0 26294 18886  1 14 85

Session setup/teardown rates

	      acl blacklist src 1.0.0.0/8
	      tcp-request content reject if blacklist

The session rate has broken the symbolic 100k/s barrier with 105931 sessions per second

HTTP session rate

Many other people are interested in the HTTP session rate on the client side. The difference with previous test is that we have to completely parse the HTTP request before taking a decision. There are a number of circumstances where a client request may not be routed to a server after being parsed. This happens when the request is invalid, blocked by ACLs or redirected to an external server. So in order to make the test the most representative of real usage, the configuration has been set to perform a redirect when an ACL detects that the local servers are being shut down for maintenance :

	      acl service_down nbsrv -lt 2
	      redirect prefix http://backup.site if service_down

82702 HTTP requests per second

What's really interesting here is that the user CPU usage oscillates between 23 and 30%, meaning that the HTTP parser takes less than one third of a core to process 82700 requests/s. This extrapolates to a potential of about 250-300000 HTTP requests per second being processed when keep-alive is supported. Of course this does not take into account the added work induced by network traffic.

Conclusion

Another nice feature set is haproxy's ability to act as an MTU converter for 10GbE networks. Alteon did that 10 years ago to permit 1500 and 9000 bytes MTU to coexist on gigabit networks. Now the same issue happens with 10GbE. Some servers will be set to use jumbo frames but the outer networks will not support them. Setting haproxy in full transparent mode between two interfaces should help a lot. Note that the same could be true for IPv6/IPv4 translation at full line rate.

- revisit last year's benchmarks
⇐ Back to HAProxy

Contacts

Feel free to contact me at for any questions or comments :

Main site : http://1wt.eu/
e-mail :

HAProxy

The Reliable, High Performance TCP/HTTP Load Balancer

Quick links

2009/08/23 - Quick test of version 1.4-dev2 : barrier of 100k HTTP req/s crossed

2009/04/18 - New benchmark of HAProxy at 10 Gbps using Myricom's 10GbE NICs (Myri-10G PCI-Express)

Contacts