Egg Timer

Functionality is easy, performance is hard, part two: Wait Time

Egg Timer

CPU performance; Wall time, CPU time, wait time

Last time we looked at how a typical system deals with purely CPU-bound applications and how the webserver, OS and CPU cores work together to produce the maximum throughput for your system. We started by looking at how we measure the ‘speed’ of your application by taking the ‘wall time’ (i.e. the time it takes to run your app as measured by your wall clock) and then dividing it into ‘CPU time’ (I.e the amount of time the CPU is actually using cycles on your app) and ‘wait time’ (the amount of time the CPU is waiting for something to happen). In this post we’ll look at the wait time of an application so if you havent already done so you should read the first post on cpu time.

What is wait time?

Simply put the CPU is waiting around for something to happen. That’s not to say you have a feral CPU, rather you have told it to wait for something. For example it could be waiting for:

  • A disk to read or write some data
  • An event to trigger (perhaps from the user)
  • Another process to return
  • A database query to respond
  • A response from a web-service or REST API

Probably the most common form of waiting in a web-app is when you issue a request across a network to retrieve some data, be that from a database or some sort of API. Once your app has established a connection to the remote server and initiated the request, the CPU doesnt really have to do anything; it’s just sat there waiting for the network to respond. As we will see, this kind of behaviour scales very differently to the CPU time we looked at last time.

Scaling Wait Time

To simulate waiting for a network resource we can create a simple script that ‘sleeps’:


This will make the CPU sleep for 200ms, that’s it! If you run this script using curl and time then you should see approximately 200ms reported:

rathers@geoff:~$ time curl

real    0m0.214s
user    0m0.009s
sys     0m0.003s

If you remember from last time we had a formula that could work out the max requests per second for our application based on response time and number of cores:

Max Req/s  =  No Cores /  Response Time

So again using my 4 core system we should expect to see 4 / 0.2 = 20 requests per second. Right? Well let’s find out! First let’s try it with a concurrency of 4, which gave us 20 req/s last time:

rathers@geoff:~$ ab -c 4 -n 100
Requests per second:    19.87 [#/sec] (mean)

So far so good! So last time even when we doubled the concurrency to 8 we still only managed 20 req/s. What about now?

rathers@geoff:~$ ab -c 8 -n 100
Requests per second:    39.72 [#/sec] (mean)

Much better! So how high will this go? Will req/s keep on doubling as I double the concurrency?

rathers@geoff:~$ ab -c 32 -n 1000
Requests per second:    158.22 [#/sec] (mean)
rathers@geoff:~$ ab -c 128 -n 10000
Requests per second:    627.36 [#/sec] (mean)
rathers@geoff:~$ ab -c 256 -n 10000
Requests per second:    1189.24 [#/sec] (mean)
rathers@geoff:~$ ab -c 512 -n 10000
Requests per second:    663.02 [#/sec] (mean)

Bascially, yes! We have linear scalability!

This graph shows nicely shows what is going on (click the graph to go to the raw data). The CPU is having no problems whatsoever initially; response time remains a constant 200ms and req/s scale up linearly. That is until we hit a concurrency of 256. At this point the req/s starts tailing off and response times starts increasing, both exponentially!



If you remember from last time, when we had reached our maximum throughput of 20 req/s our CPU cores were totally maxed out and we could see that in dstat. This time even at nearly 1200 req/s my CPU cores are still nowhere near maxed out as this dstat trace shows:

usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq|
 23   9  65   0   1   2: 22   9  65   0   1   3: 25   9  63   0   1   2: 25   8  64   0   1   2|
 19  12  64   0   2   3: 16  12  66   0   1   5: 20  11  65   0   1   3: 21  10  64   0   1   4|
 20  12  62   0   1   5: 26  12  57   0   1   4: 31  12  48   3   1   4: 24  12  60   0   1   3|
 18  22  55   0   2   3: 23  12  61   0   1   3: 21  11  61   0   2   5: 21  14  60   0   1   5|
 31  16  47   0   2   4: 29  11  55   0   1   4: 26  13  56   0   1   4: 27  13  54   0   2   4|
 21  13  59   0   2   5: 22  12  60   0   2   4: 20  15  59   0   1   5: 20  14  60   0   1   5|
 29  13  53   0   1   4: 29  13  53   0   1   4: 28  12  55   0   1   4: 24  11  60   0   1   4|
 35  11  49   0   2   4: 23  13  59   0   1   4: 31  11  53   0   1   4: 23  12  59   1   1   4|

All 4 cores are around 60% idle! Hopefully this illustrates how different CPU-time and wait-time are.

The Bottleneck

So what is so special about the number 256? Things were exactly as you might expect up to this number; the CPU was quite happy to sit and do 256 waits in parallel, waiting is afterall pretty easy to do! Even I could probably manage to do 256 lots of nothing in parallel! The answer is nothing really to do with the application, or the number of CPUs or RAM or anything like that; it’s simply an apache config value. The apache config on my system limits apache to 256 worker processes. This means as soon as you ask for more than 256 requests in parallel things start to go wrong as all the worker processes are busy and cannot accept the extra work.

Optimising Webserver Performance for Wait Time

As we’ve seen so far because our application is just waiting we are limited by the number of processes that the webserver will give us. This value will become the maximum number of requests that our application can handle concurrently.

So how do we scale further? Well it really depends on your webserver. In all the examples so far I have been using the Apache webserver. To scale our waiting app further we need to look into some of the internals of Apache and something that Apache calls the ‘Multi-Processing Module’ or MPM.

Apache MPMs

If you’re using Apache in a standard configuration, the likelihood is that you’re using the ‘prefork’ MPM already. This is a simple but robust MPM that uses a single master process that forks worker processes. Each worker process is single threaded.

The distinction between threads and processes is important here. A process is a stand alone task governed by the OS. It is given its own block of memory that only it can access and if it tries to access the memory belonging to another process the OS will shut it down. This is known as a segmentation fault (or segfault, illegal operation, crash etc). Many processes can run in parallel as the OS will manage swapping processes on and off the various CPU cores based on demand and relative process priorities. A thread is analogous to a process but it is a stream of operation within the process itself. The process is responsible for managing the threads within it. So the OS can manage multiple processes and a process can manage multiple threads within it.

So when we say prefork MPM produces single threaded processes this means that each process can only do one thing at once. Therefore to serve an HTTP request, a spare process must be available to accept and process the request. Initially apache will start with very few worker processes (5 on my system). When it detects that it is running out of processes it will spawn more. Because it is computationally expensive to create more processes apache limits the rate at which more are spawned. However the rate increases exponentially until the number of processes reaches a hard limit set by the MaxServers directive (256 on my system). You can see this happen quite clearly if you use the tools I gave you in my previous post. If you use apachebench to batter apache with loads of requests and monitor the number of processes running you will see it slowly creep up, then the rate of increase will keep rising until eventually the number of processes plateaus.

Tuning prefork MPM

For the most part apache will take care of itself. The only value you need to really worry about is the MaxClients directive which as we’ve seen caps the number of worker processes that can be spawned. Set this value too low and you wont be able to serve enough traffic, set it too high and you’ll run out of RAM. It also depends on your application and your machine as to how you tune this value. If you recall our CPU-limited example from last time you’ll remember that no matter how hard we tried we couldnt squeeze more than 20 req/s out of it. If this were all we had to serve there wouldnt be much point in setting MaxClients much beyond 20 as the CPU simply cant cope with more than 20 requests in parallel. Allowing it to go higher is only going to make things worse. Now consider our wait-limited example above. This is happily going up to 256 req/s and this is in fact limited by MaxClients itself. So it seems logical to allow MaxClients to go higher if we have enough RAM to do so.

Other Apache MPMs

The main MPMs besides prefork are called worker, event and winnt. As event is experimental and winnt is for use on Windows (and running servers on Windows is the reserve of the insane) we shall look briefly at worker MPM.

Worker MPM is essentially the same as prefork MPM in terms of how the apache processes are spawned and killed but each process can have multiple threads running within it. This allows you to have fewer processes running to achieve the same level of concurrency as with prefork. The problem with this approach though is that many of the scripting languages that can be embedded into apache (PHP in our case) aren’t guaranteed to be thread-safe and so to use worker MPM you need to interface with your scripting language over FastCGI or similar.

Worker MPM is not a panacea however, you will still have a maximum number of threads that can be in operation at anyone time as set by your apache config. For a wait-heavy application you will probably be able to squeeze more out of your server using worker than with prefork but you’ll still have a limit.

Event based servers

If your application does a lot of waiting then you might want to consider using an event driven platform such as node.js. The event-driven architecture takes a fundamentally different approach to dealing with having to wait for things to happen. Rather then make a whole process or thread block until something has happened you instead register event handlers. When something happens that you are interested in (e.g. your API query has got a response) then an event is raised and any registered event handlers are executed. This gives a totally non-linear, asynchronous flow and makes good use of system resources when dealing with lots of waiting around.

Real-world Applications

The examples I have used across this post and the last were deliberately simple and one-dimensional but real-world applications are much more complicated. That said they are essentially made of these two components; time you are hammering the CPU and time when you’re waiting around. Hopefully you now have an understanding of the effect these two things have on your server and how to recognise them.

Optimising Applications

I started my previous post with the hypothetical situation of having to support 1000 req/s for a given application. So if you benchmark your app and you’re only getting 500 req/s out of it, how do you squeeze the extra performance out? Well you first need to work out what your limits are. If at that 500 req/s limit your CPUs are maxed out then you’re probably limited by CPU time. If the CPUs aren’t maxed out but all your apache processes are busy then you are likely limited by wait time.

The details of specific optimisations are beyond the scope of this article but to reduce your CPU time you’re going to have to look at using less CPU cycles somehow. That maybe by streamlining your code, efficiency improvements, caching or maybe even compiling your PHP code. To reduce your wait time you’ll need to look at increasing the speed of whatever is holding you up. That could be a disk raid setup, database or query optimisation, faster network between you and an API server etc, or if you can you should look at caching the output of the slow parts so you dont have to wait for them all the time.

Premature Optimisation

It is important to know what the limits of your application are and the nature of them before you start to optimise it. Imagine your code does lots of disk writes and is not meeting your non-functional requirements for throughput in req/s. You might at first think you are CPU-time limited so you start stripping out for loops and using more efficient algorithms in your code. It wont make any difference and is arguably wasted effort. If you monitored your server while benchmarking you should have been able to see the massive amounts going onto the disk and you would have seen the write speed of the disk plateau as you push the load upwards. So what you should really be doing is looking at is making the disks go faster!


We’ve seen how CPUs and webservers respond to high load in terms of being asked to do a lot of stuff at once and also being told to wait around for other stuff to happen. We’ve looked at a few simple tools to monitor what your system is doing and how to optimise some aspects of the server for your situation. Next time you are asked to make something ‘faster’, ask yourself what that actually means in terms of your application and be sure to establish exactly what the limiting factors are before you try and optimise around them.

Egg Timer

Functionality is Easy, Performance is Hard, Part one: CPU Time

Egg Timer

Make It Work

Software developers are really good at getting stuff working, they really are! Even when faced with absurd deadlines, scatter-brained product owners and conflicting demands. Sure some people need a little guidance here and there and some may produce more bugs than others but on the whole functionality is delivered without too much fuss. A lot of the applications that we build are relatively simple in terms of functionality so I have confidence in most developers to be able to get stuff working to meet the functional requirements of a project. A working system, fully tested, fully signed off, perfect, job done. Oh what’s that? It needs to serve 1000 pages per second? Errr, help! This sort of non-functional requirement often gets overlooked until the last minute and dealing with it, in my experience, involves a lot more fuss than the functional stuff, especially when left this late in the day.

Make it scale

In the web development industry it’s easy to see how people are more adept at making things work than making them scale. Perhaps you start out doing front-end stuff, then move onto some backend programming using PHP or Python, maybe moving onto Java or C#. Maybe you start by building sites for your mates then get a job writing business process systems. It is probably fairly rare that you get asked to support massive volumes of traffic, to query huge volumes of data, to reduce page response times or to reduce server resource utilisation. ‘The Server’ may well just be a black box that just copes with everything magically.

Rather than sit here whining I’ve decided to explain some of the core concepts behind performance-tuning of web applications. First off; how we make the most of the CPU.

Performance Tuning CPU Utilisation in Web Applications

The CPU is the heart of our server and for many applications it is likely to be the first bottleneck that you hit. First let’s examine how we measure the CPU performance of our application

Wall Time

Simply put this is the time it takes to render a particular page of your application, as you might measure it with your wall clock. You can easily get a rough idea of this by doing something like this:

time curl http://localhost/myapp.php

(I’m assuming many things here, like you’re using Linux, PHP, you’re hosting your app locally etc)

There are two main components to the wall time that we’ll look at. ‘CPU time’ is how much time your process was actually swapped onto the CPU consuming it’s cycles. ‘I/O time’ or ‘wait time’ is the amount of time that your process was basically waiting for something to happen, be that waiting for a response from a database or API, or waiting for local resources to become available (e.g. disk, CPU etc).

Performance tuning CPU time

An application that uses pure CPU is fairly rare but a very simplistic app would be something that just counts up to a very big number:

for ($i=0; $i<1000000000; $i++) {
//do nothing

If you run this code you will see your CPU ‘spike’ to 100% (or near as damn-it). If you have a multi-core system you will notice it only spikes one core (more on that later). Let’s turn the big number down a bit to simulate a realistic application. If you change it to 4 million then on the system I’m currently using this takes about 200ms of wall time. Because this is basically a pure CPU application we can assume that the CPU time is essentially the same as the wall time (not strictly true but true enough for this illustration), so we’re consuming 200ms of CPU time to render this page. That’s still fairly quick right? A user probably wouldn’t notice 200ms of delay, with all the network latency and asset loading of the web page to do, this will pale into insignificance right? Plus the page isn’t doing much so we’ll be able to serve loads of them right? Nope.

Think that we’re using the entire CPU for 200ms. Nothing else can realistically use that CPU core for a period of 200ms. So how many of these things can we actually serve up? Well doing some simple maths if we continually hit this page, serially requesting a new page as soon as one has been served up then we’ll be constantly eating up the CPU time of a single core. With purely serial requests like this against a single core you’ll be able to get up to 5 requests per second. Simple really it’s just 1/0.2

5 requests per second ain’t so great. That’s basically saying that some guy could melt your server by just pressing F5 as fast as he can. As soon as your server is trying to serve more than 5 req/s it is going to be in trouble if that is sustained over a long period.

So how do we scale it? Well the obvious thing is to get a faster CPU. If I double my CPU speed then my CPU time is going to come down to 100ms and I’ll be able to serve 10 req/s. Awesome! But there is a limit to how high you can go with this kind of ‘vertical’ scaling. More prudent would be to add more cores. If I add a second core then the operating system will be able to run two of these page requests in parallel at the same speed as before. My CPU time will remain at 200ms but I’ll be able to do 2 at a time so I’ll be able to get to 10 req/s just as with the faster CPU scenario. This is horizontal scaling and it is much more sustainable. Push it up to 8 cores and I’ve got 40 req/s. 32 cores (not unrealistic for some servers!) and I’ve got 160 req/s. We’re flying now! For a purely CPU-bound application the following formula quite neatly sums this up:

Max Req/s  =  Number of Cores /  Response Time

Monitoring CPU time

When you have multiple cores and multiple parallel requests attacking them it can get quite complex to work out what is going on from the point of view of the user agent, you really need to get inside the server and have a look (from an OS perspective that is!). To illustrate the scaling I’ve just described working in practice there are some handy tools we can use. For flinging lots of requests at our page we can use apachebench. To monitor CPU utilisation you can use any of the standard linux tools such as vmstat, top, iostat, but I personally recommend dstat. These both work fantastically on Linux. If you’re using Mac or Windows you’re on your own I’m afraid; you should have installed a proper OS! 😛

If you run dstat -f you’ll see loads of numbers printing down the screen (make sure your terminal is wide enough to fit it all in). Mine looks like this:

-------cpu0-usage--------------cpu1-usage--------------cpu2-usage--------------cpu3-usage------ --dsk/sda-----dsk/sdb-- --net/eth0- ---paging-- ---system--
usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq| read  writ: read  writ| recv  send|  in   out | int   csw 
  7   3  87   4   0   0:  7   2  90   1   0   0:  6   1  92   1   0   0:  6   2  90   2   0   0| 416k   40k:  36k   28B|   0     0 |   0     0 |1529  2318 
  6   0  93   0   1   0:  5   1  93   0   0   1:  4   2  93   0   1   0:  5   1  94   0   0   0|   0     0 :   0     0 |   0     0 |   0     0 |1143  1887 
  7   3  90   0   0   0:  4   1  95   0   0   0:  6   1  93   0   0   0:  7   2  91   0   0   0|   0     0 :   0     0 |   0     0 |   0     0 |1224  1920 
  8  13  79   0   1   0:  5   2  92   0   1   0:  7   1  92   0   0   0:  3   0  97   0   0   0|   0     0 :   0     0 |   0     0 |   0     0 |1454  2230 
 10   3  87   0   1   0:  5   1  94   0   0   0:  5   1  94   0   0   0:  4   1  95   0   0   0|   0     0 : 132k    0 | 196B   98B|   0     0 |1395  2289

There’s a lot to take in there but we’re really only interested in a few of the columns. Dstat is printing a line every second with loads of info about what is going on in your computer. Stuff like CPU usage, disk and network activity. If you look at the first row you’ll see the top level headings and you’ll see one for each CPU core you have (you can see that I have 4 cores). Each core then lists 6 different columns, most important of which is ‘idl’ which is a percentage measure of how much time the CPU is idling. If you run the cpu-spiking loop we made before but with the number shoved back up to a billion or so you’ll be able to see the spike in dstat:

usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq
  5   1  94   0   0   0:  9   1  90   0   0   0:  3   1  96   0   0   0
  4   2  93   0   1   0: 18   0  82   0   0   0:  6   0  93   0   1   0
  6   1  93   0   0   0:100   0   0   0   0   0:  5   2  93   0   0   0
 10   0  90   0   0   0:100   0   0   0   0   0: 10   0  90   0   0   0
  6   1  92   0   1   0: 99   0   0   0   1   0:  6   1  93   0   0   0
  7   1  92   0   0   0:100   0   0   0   0   0:  3   1  95   0   0   1
  6   1  92   0   1   0: 67   1  32   0   0   0:  4   1  95   0   0   0
  2   3  95   0   0   0:  3   1  96   0   0   0:  4   0  95   0   1   0
  5   9  85   0   0   1:  7   0  93   0   0   0:  3   1  96   0   0   0

Watch cpu1 as the ‘idl’ value drops to 0 and the ‘usr’ value goes up to 100. This is the process ‘spiking’ that CPU. Notice how the other CPUs aren’t bothered by this at all; they sit there mostly idling. If you run the big loop script in parallel in two different terminals you’ll see that it will spike two CPUs, 3 loop processes will max out 3 CPUs etc.

If you stick the loop script under a webserver such as apache we can now watch the scaling example I presented in practice. Based on our previous numbers we should expect to be able to serve 5 requests per second per core with our ~200ms script. On a 4 core box like mine we’d expect to see four times that number, i.e. about 20 req/s. So let’s see! Apachebench is a simple application that allows you to fire requests at a URL and control how many requests to complete and also how many to send in parallel. Let’s try only 1 at a time and 100 requests:

rathers@geoff:~$ ab -n 100                                                                                                             
This is ApacheBench, Version 2.3                                                                                                               
Copyright 1996 Adam Twiss, Zeus Technology Ltd,                                                                                            
Licensed to The Apache Software Foundation,                                                                                                  
Benchmarking (be patient).....done

Server Software:        Apache/2.2.22
Server Hostname:
Server Port:            80

Document Path:          /cpu.php
Document Length:        0 bytes

Concurrency Level:      1
Time taken for tests:   19.835 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      20200 bytes
HTML transferred:       0 bytes
Requests per second:    5.04 [#/sec] (mean)
Time per request:       198.353 [ms] (mean)
Time per request:       198.353 [ms] (mean, across all concurrent requests)
Transfer rate:          0.99 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   191  198   4.0    197     211
Waiting:      191  198   4.0    197     211
Total:        191  198   4.0    197     211

Percentage of the requests served within a certain time (ms)
  50%    197
  66%    197
  75%    198
  80%    201
  90%    205
  95%    206
  98%    211
  99%    211
 100%    211 (longest request)

A lot of output, but the figure we’re most interested in at the moment is “Requests per second”. This is at 5.04 req/s which is quite a bit lower than our estimate of 20! This is because we’re only running one request in parallel. Apachebench is firing off a request and waiting for that one to complete before it sends another. We are only allowing the server to process 1 loop at a time so 5 req/s is our theoretical max. We still have lots of headroom with our additional cores though so we should be able to get much higher. By doubling the concurrency we should get to 10 req/s:

rathers@geoff:~$ ab -c 2 -n 100

Requests per second:    10.09 [#/sec] (mean)

Nice. Let’s keep going. Concurrency of 3:

rathers@geoff:~$ ab -c 3 -n 100

Requests per second:    15.08 [#/sec] (mean)

Ok, what about concurrency of 4? This should give us 20 in theory, which also should be our theoretical max:

rathers@geoff:~$ ab -c 4 -n 100

Requests per second:    20.23 [#/sec] (mean)

Pretty close! So what happens if we push it too far? What if we try 8 concurrency?

rathers@geoff:~$ ab -c 8 -n 100

Requests per second:    20.33 [#/sec] (mean)

It’s basically the same! How can this be so? Well if you look at some of the other stats you’ll see why. If you look at the “Time per request” metric you’ll see that the mean was 200ms when concurrency was 4, but when concurrency was 8 the time per request was 400ms; double the amount! So we see that the CPU can’t churn out any more than 20 pages per second but if you ask it to do more in parallel it will just take longer to complete the requests. The OS is using time multiplexing to share 4 cores out amongst 8 processes. Each process only gets the CPU for half the time so it takes twice as long to complete its task.

Will this keep going forever? What if we hit it with a concurrency of 10 times our sweet spot? 100 times? Well you’ll not get above 20 req/s but the average request time will just keep going up and up. Eventually you’ll start hitting bottlenecks in other areas and the OS will be spending an increasing amount of resources swapping processes on and off the CPU cores. This means it becomes increasingly inefficient. So you’ll see the number of req/s actually start to decrease as you push the concurrency higher.

Here is a graph of actual results gathered on my box using our example script:


As you can see we hit the 20 req/s plateau very quickly and the throughput remains pretty much constant no matter what concurrency we throw at it, although you can start to see it tailing off slightly at higher concurrencies. The response time on the other hand starts off on the 200ms plateau but very soon starts rising linearly. The full results are publicly available on google docs

Dstat and apachebench Together

If you monitored dstat while doing the above tests you’ll see that the single concurrency example doesn’t just use one core, and the 2 concurrency example uses more than 2 cores. This is because the OS is choosing a different CPU core to process your request each time so on average with the single concurrency example you should see 25% utilisation across four cores rather than 100% on one and 0% on the others.

Monitoring Apache worker processes

When we were hitting the server with a concurrency greater than 1 the OS was spreading the load across multiple processes across multiple cores. This is because apache is (in some configurations anyway) set to spawn multiple webserver processes. Each process is effectively independent of the others and your OS can move them around CPU cores as it pleases. Because they are separate processes as far as the OS is concerned you can quite easily ask the OS to tell you how many apache processes are running. Run this command to see all your apache processes (you might need to grep for ‘httpd’ rather than apache on some systems):

rathers@geoff:~$ ps aux |grep apache
root      1369  0.0  0.2 127408  7964 ?        Ss   11:36   0:00 /usr/sbin/apache2 -k start
www-data  1391  0.0  0.1 127432  5012 ?        S    11:36   0:00 /usr/sbin/apache2 -k start
www-data  1392  0.0  0.1 127432  5012 ?        S    11:36   0:00 /usr/sbin/apache2 -k start
www-data  1393  0.0  0.1 127432  5012 ?        S    11:36   0:00 /usr/sbin/apache2 -k start
www-data  1394  0.0  0.1 127432  5012 ?        S    11:36   0:00 /usr/sbin/apache2 -k start
www-data  1395  0.0  0.1 127432  5012 ?        S    11:36   0:00 /usr/sbin/apache2 -k start
rathers   3191  0.0  0.0  28076   844 pts/1    S+   12:26   0:00 grep --color=auto apache

This should show you the grep command itself, the master apache process (running as root) and the worker processes (running as a low-privilege user). On my system I have 5 worker processes. So to monitor how many apache worker processes are running at any one time you can run this nifty little command:

watch -n 1 "ps aux |grep apache |wc -l"

This actually reports an artificially high number for reasons I don’t fully understand but it gives you a good tool to be able to monitor the increase and decrease in the number of apache processes. If you set this running and then run one of the high concurrency apachebench tests (concurrency of 20 say) then you’ll see the number or processes steadily rise until reaching a plateau. Once apachebench has finished its test then you’ll see the number slowly decrease back down to a low plateau. I’ll look at this in more detail later but this should give you an idea of the how apache is spawning more processes to deal with the concurrency (the number of worker processes should roughly match the concurrency) and how the OS is distributing the processes around your CPU cores.


We’ve seen how to spike a single CPU core and then watch how the OS starts to spread the work amongst less busy cores when further CPU intensive processes are started. We’ve seen the effect of pure CPU time on the requests per second that an app can achieve and it’s relationship to response time. We then experimented with scaling our CPUs vertically and horizontally. Finally we looked briefly at how apache and the OS work together to spawn worker processes and spread them out across available CPUs in order to make the most of your system’s resources.

In my next post we’ll look at the performance implication of wait time which is a totally different beast to CPU time.