Tag: benchmark

  • Jetty 12 performance figures

    TL;DR

    This is a quick blog to share the performance figures of Jetty 12 and to compare them to Jetty 11, updated for the release of 12.0.2. The outcome of our benchmarks is that Jetty 12 with EE10 Servlet-6.0 is 5% percent faster than Jetty 11 EE9 Servlet-5.0. The Jetty-12 Core API is 22% faster.

    These percentages are calculated from the P99 integrals, which is the total latency excluding the 1% slowest requests from each of the 180 sample periods. The max latency for the 1% slowest requests in each sample shows a modest improvement for Jetty-12 servlets and a significant improvement for core.

    Introduction

    We use one server (powered by a 16 cores Intel Core i9-7960X) connected with a 10GbE link to a dedicated switch, upon which four clients are connected with a single 1GbE link each. The server runs a single instance of the Jetty Server running with the default of 200 threads and a 32 GB heap managed by ZGC, and the clients each run a single instance of the Jetty Load Generator running with an 8 GB heap also managed by ZGC. This is all automated, and the source of this benchmark is public too.

    Each client applies a load of 60,000 requests per second for a total of 240,000 requests per second that the server handles alone. We make sure there is no saturation anywhere: CPU consumption, network links, RAM, JVM heap, HTTP response status are monitored to ensure this is the case. We also make sure no errors are reported in HTTP response statuses or on the network interface.

    We collect all processing times of the Jetty Server (i.e.: the time it takes from reading the 1st byte of the request from the socket to writing the last byte of the response to the socket) in histograms from which we then graph the minimum, P99, and maximum latencies, and calculate integrals (i.e.: the sum of all measurements) for latter two.

    Jetty 12

    In the following three benchmarks, the CPU consumption of the server machine stayed around 40%.

    Below are plots of the maximum (graph in yellow), minimum to P99 (graph in red) processing times for Jetty 12:

    ee10

    The yellow graph shows a single early peak at 4500 µs, and then most peaks at less than 1000µs and an average sample peak of 434µs.

    The red graph shows a minimum processing time of 7 µs, a P99 processing time of 34.8 µs.

    core

    The yellow graph shows a few peaks at a 2000 µs with the average sample peak being 335µs.

    The red graph shows a minimum processing time of 6 µs, a P99 processing time of 30.0 µs.

    Comparing to Jetty 11

    Comparing apples to apples required us to slightly modify our benchmark to:

    In the following benchmark, the CPU consumption of the server machine stayed around 40%, like with Jetty 12.

    Below are the same plots of the maximum (graph in yellow), minimum to P99 (graph in red) processing times for Jetty 11:

    The yellow graph shows a few peaks at a 4000µs with an average sample peak of 487µs.

    The red graph shows a minimum processing time of 8 µs, a P99 processing time of 36.6 µs.

  • Introducing Jetty Load Generator

    The Jetty Project just released the Jetty Load Generator, a Java 11+ library to load-test any HTTP server, that supports both HTTP/1.1 and HTTP/2.
    The project was born in 2016, with specific requirements. At the time, very few load-test tools had support for HTTP/2, but Jetty’s HttpClient did. Furthermore, few tools supported web-page like resources, which were important to model in order to compare the multiplexed HTTP/2 behavior (up to ~100 concurrent HTTP/2 streams on a single connection) against the HTTP/1.1 behavior (6-8 connections). Lastly, we were more interested in measuring quality of service, rather than throughput.
    The Jetty Load Generator generates requests asynchronously, at a specified rate, independently from the responses. This is the Jetty Load Generator core design principle: we wanted the request generation to be constant, and measure response times independently from the request generation. In this way, the Jetty Load Generator can impose a specific load on the server, independently of the network round-trip and independently of the server-side processing time. Adding more load generators (on the same machine if it has spare capacity, or using additional machines) will allow the load against the server to increase linearly.
    Using this core principle, you can setup the load testing by having N load generator loaders that impose the load on the server, and 1 load generator probe that imposes a very light load and measures response times.
    For example, you can have 4 loaders that impose 20 requests/s each, for a total of 80 requests/s seen by the server. With this load on the server, what would be the experience, in terms of response times, of additional users that make requests to the server? This is exactly what the probe measures.
    If the load on the server is increased to 160 requests/s, what would the probe experience? The same response times? Worse? And what are the probe response times if the load on the server is increased to 240 requests/s?
    Rather than trying to measure some form of throughput (“what is the max number of requests/s the server can sustain?”), the Jetty Load Generator measures the quality of service seen by the probe, as the load on the server increases. This is, in practice, what matters most for HTTP servers: knowing that, when your server has a load of 1024 requests/s, an additional user can still see response times that are acceptable. And knowing how the quality of service changes as the load increases.
    The Jetty Load Generator builds on top of Jetty’s HttpClient features, and offers:

    • A builder-style Java API, to embed the load generator into your own code and to have full access to all events emitted by the load generator
    • A command-line tool, similar to Apache’s ab or wrk2, with histogram reporting, for ease of use, scripting, and integration with CI servers.

    Download the latest command-line tool uber-jar from: https://repo1.maven.org/maven2/org/mortbay/jetty/loadgenerator/jetty-load-generator-starter/

    $ cd /tmp
    $ curl -O https://repo1.maven.org/maven2/org/mortbay/jetty/loadgenerator/jetty-load-generator-starter/1.0.2/jetty-load-generator-starter-1.0.2-uber.jar
    

    Use the --help option to display the available command line options:

    $ java -jar jetty-load-generator-starter-1.0.2-uber.jar --help
    

    Then run it, for example:

    $ java -jar jetty-load-generator-starter-1.0.2-uber.jar --scheme https --host your_server --port 443 --resource-rate 1 --iterations 60 --display-stats
    

    You will obtain an output similar to the following:

    ----------------------------------------------------
    -------------  Load Generator Report  --------------
    ----------------------------------------------------
    https://your_server:443 over http/1.1
    resource tree     : 1 resource(s)
    begin date time   : 2021-02-02 15:38:39 CET
    complete date time: 2021-02-02 15:39:39 CET
    recording time    : 59.657 s
    average cpu load  : 3.034/1200
    histogram:
    @                     _  37 ms (0, 0.00%)
    @                     _  75 ms (0, 0.00%)
    @                     _  113 ms (0, 0.00%)
    @                     _  150 ms (0, 0.00%)
    @                     _  188 ms (0, 0.00%)
    @                     _  226 ms (0, 0.00%)
    @                     _  263 ms (0, 0.00%)
    @                     _  301 ms (0, 0.00%)
                       @  _  339 ms (46, 76.67%) ^50%
       @                  _  376 ms (7, 11.67%) ^85%
      @                   _  414 ms (5, 8.33%) ^95%
    @                     _  452 ms (1, 1.67%)
    @                     _  489 ms (0, 0.00%)
    @                     _  527 ms (0, 0.00%)
    @                     _  565 ms (0, 0.00%)
    @                     _  602 ms (0, 0.00%)
    @                     _  640 ms (0, 0.00%)
    @                     _  678 ms (0, 0.00%)
    @                     _  715 ms (0, 0.00%)
    @                     _  753 ms (1, 1.67%) ^99% ^99.9%
    response times: 60 samples | min/avg/50th%/99th%/max = 303/335/318/753/753 ms
    request rate (requests/s)  : 1.011
    send rate (bytes/s)        : 189.916
    response rate (responses/s): 1.006
    receive rate (bytes/s)     : 41245.797
    failures          : 0
    response 1xx group: 0
    response 2xx group: 60
    response 3xx group: 0
    response 4xx group: 0
    response 5xx group: 0
    ----------------------------------------------------
    

    Use the Jetty Load Generator for your load testing, and report comments and issues at https://github.com/jetty-project/jetty-load-generator. Enjoy!

  • Do Looms Claims Stack Up? Part 2: Thread Pools?

    “Project Loom aims to drastically reduce the effort of writing, maintaining, and observing high-throughput concurrent applications that make the best use of available hardware. … The problem is that the thread, the software unit of concurrency, cannot match the scale of the application domain’s natural units of concurrency — a session, an HTTP request, or a single database operation. …  Whereas the OS can support up to a few thousand active threads, the Java runtime can support millions of virtual threads. Every unit of concurrency in the application domain can be represented by its own thread, making programming concurrent applications easier. Forget about thread-pools, just spawn a new thread, one per task.” – Ron Pressler, State of Loom, May 2020

    In this series of blogs, we are examining the new Loom virtual thread features now available in OpenJDK 16 early access releases. In part 1 we saw that Loom’s claim of 1,000,000 virtual threads was true, but perhaps a little misleading, as that only applies to threads with near-empty stacks.  If threads actually have deep stacks, then the achieved number of virtual threads is bound by memory and is back to being the same order of magnitude as kernel threads.  In this part, we will further examine the claims and ramifications of Project Loom, specifically if we can now forget about Thread Pools. Spoiler: Cheap threads can do expensive things!
    All the code from this blog is available in our loom-trial project and has been run on my dev machine (Intel® Core™ i7-6820HK CPU @ 2.70GHz × 8, 32GB memory,  Ubuntu 20.04.1 LTS 64-bit, OpenJDK Runtime Environment (build 16-loom+9-316)) with no specific tuning and default settings unless noted otherwise. 

    Matching the scale?

    Project Loom makes the claim that applications need threads because kernel threads “cannot match the scale of the application domain’s natural units of concurrency”!
    Really???  We’ve seen that without tuning, we can achieve 32k of either type of thread on my laptop.  We think it would be fair to assume that with careful tuning, that could be stretched to beyond 100k for either technology.  Is this really below the natural scale of most applications?  How many applications have a natural scale of more than 32k simultaneous parallel tasks?  Don’t get me wrong, there are many apps that do exceed those scales and Jetty has users that put an extra 0 on that, but they are the minority and in reality very few applications are ever going to see that demand for concurrency.
    So if the vast majority of applications would be covered by blocking code with a concurrency of 32k, then what’s the big deal? Why do those apps need Loom? Or, by the same argument, why would they need to be written in asynchronous style?
    The answer is that you rarely see any application deployed with 10,000s of threads; instead, threads are limited by a thread pool, typically to 100s or 1000s of threads.  The default thread pool size in jetty is 200, which we sometimes see increased to 1000s, but we have never seen a 32k thread pool even though my un-tuned laptop could supposedly support it!
    So what’s going on? Why are thread pools typically so limited and what about the claim that Loom means we can “Forget about thread pools”?

    Why Thread Pools?

    One reason we are told that thread pools are used is because kernel threads are slow to start, thus having a bunch of them pre-started, waiting for a task in a pool improves latency.  Loom claims their virtual threads are much faster to start, so let’s test that with StartThreads, which reports:

    kStart(ns) ave:137,903 from:1,000 min:47,466 max:6,048,540
    vStart(ns) ave: 10,881 from:1,000 min: 4,648 max:  486,078

    So that claim checks out. Virtual threads start an order of magnitude faster than kernel threads.  If start time was the only reason for thread pools, then Loom’s claim of forgetting about thread pools would hold.
    But start time only explains why we have thread pools, but it doesn’t explain why thread pools are frequently sized far below the systems capacity for threads: 100s instead of 10,000s?  What is the reason that thread pools are sized as they are?

    Why Small Thread Pools?

    Giving a thread a task to do is a resource commitment. It is saying that a flow of control may proceed to consume CPU, memory and other resources that will be needed to run to completion or at least until a blocking point, where it can wait for those resources.  Most of those resources are not on the stack,  thus limiting the number of available threads is a way to limit a wide range of resource consumption and give quality of service:

    • If your back-end services can only handle 100s of simultaneous requests, then a thread pool with 100s of threads will avoid swamping them with too much load. If your JDBC driver only has 100 pooled connections, then 1,000,000 threads hammering on those connections or other locks are going to have a lot of contention.
    • For many applications a late response is a wrong response, thus it may well be better to handle 1000 tasks in a timely way with the 1001st task delayed, rather than to try to run all 1001 tasks together and have them all risk being late.
    • Graceful degradation under excess load.  Processing a task will need to use heap memory and if too much memory is demanded an OutOfMemeoryException is fatal for all java applications.  Limiting the number of threads is a coarse grained way of limiting a class of heap usage.  Indeed in part 1, we saw that it was heap memory that limited the number of virtual threads.

    Having a limited thread pool allows an application to be tested to that limit so that it can be proved that an application has the memory and other resources necessary to service all of those threads.  Traditional thinking has been that if the configured number of threads is insufficient for the load presented, then either the excess load must wait, or the application should start using asynchronous techniques to more efficiently use those threads (rather than increase the number of threads beyond the resource capacity of the machine).
    A limited thread pool is a coarse grained limit on all resources, not only threads.  Limiting the number of threads puts a limit on concurrent lock contention, memory consumption and CPU usage.

    Virtual Threads vs Thread Pool

    Having established that there might be some good reasons to use thread pools, let’s see if Loom gives us any good reasons not to use them?   So we have created a FakeDataBase class which simulates a JDBC connection pool of 100 connections with a semaphore and then in ManyTasks we run 100,000 tasks that do 2 selects and 1 insert to the database, with a small amount of CPU consumed both with and without the semaphore acquired.   The core of the thread pool test is:

     for (int i = 0; i < tasks; i++)
       pool.execute(newTask(latch));

    and this is compared against the Loom virtual thread code of:

     for (int i = 0 ; i < tasks; i++)
       Thread.builder().virtual().task(newTask(latch)).start();

    And the results are…. drum roll… pretty much the same for both types of thread:

    Pooled  K Threads 33,729ms
    Spawned V Threads 34,482ms

    The pooled kernel thread does appear to be consistently a little bit better, but this test is not that rigorous so let’s call it the same, which is kind of expected as the total duration is pretty much going to be primarily constrained by the concurrency of the database.
    So were there any difference at all?  Here is the system monitor graph during both runs: kernel threads with a pool are the left hand first period (60-30s) and then virtual threads after a change over peak (30s – 0s):

    Kernel threads with thread pool do not stress the CPU at all, but virtual threads alone use almost twice as much CPU! There is also a hint of more memory being used.
    The thread pool has 100k tasks in the thread pool queue, 100 kernel threads that take tasks, 100 at a time, and each task takes one of 100 semaphores permits 3 times, with little or no contention.
    The Loom approach has 100k independent virtual threads that each contend 3 times for the 100 semaphore permits, with up to 99,900 threads needing to be added then removed 3 times from the semaphore’s wake up queue.  The extra queuing for virtual threads could easily explain the excess CPU needed, but more investigation is needed to be definitive.
    However, tasks limited by a resource like JDBC are not really the highly concurrent tasks that Loom is targeted at.  To truly test Loom (and async), we need to look at a type of task that just won’t scale with blocking threads dispatched from a thread pool.

    Virtual Threads vs Async APIs

    One highly concurrent work load that we often see on Jetty is chat room style interaction (or games) written on CometD and/or WebSocket.  Such applications often have many 10,000s or even 100,000s of connections to the server that are mostly idle, waiting for a message to receive or an event to send. Currently we achieve these scales only by asynchronous threadless waiting, with all its ramifications of complex async APIs into the application needing async callbacks.  Luckily, CometD was originally written when there was only async servlets and not async IO, thus it still has the option to be deployed using blocking I/O reads and writes.  This gives it good potential to be a like for like comparison between async pooled kernel threads vs blocking virtual threads.
    However, we still have a concern that this style of application/load will not be suitable for Loom because each message to a chat room will fan out to the 10s, 100s or even 1000s of other users waiting in that room.  Thus a single read could result in many blocking write operations, which are typically done with deep stacks (parsing, framework, handling, marshalling, then writing) and other resources (buffers, locks etc). You can see in the following flame graph from a CometD load test using Loom virtual threads, that even with a fast client the biggest block of time is spent in the blue peak on the left, that is writing with deep stacks. It is this part of the graph that needs to scale if we have either more and/or slower clients:

    Jetty with CometD chat on Loom

    To fairly test Loom, it is not sufficient to just replace the limited pool of kernel threads with infinite virtual threads.  Jetty goes to lots of effort with its eat what you kill scheduling using reserved threads to ensure that whenever a selector thread calls a potentially blocking task, another selector thread has been executed.  We can’t just put Loom virtual threads on top of this, else it will be paying the cost and complexity of core Jetty plus the overheads of Loom.  Moreover, we have also learnt the risk of Thread Starvation that can result in highly concurrent applications if you defer important tasks (e.g. HTTP/2 flow control).  Since virtual threads can be postponed (potentially indefinitely) by CPU bound applications or the use of non-Loom-aware locks (such as the synchronized keyword), they are not suitable for all tasks within Jetty.
    Thus we think a better approach is to keep the core of Jetty running on kernel threads, but to spawn a virtual thread to do the actual work of reading, parsing, and calling the application and writing the response.  If we flag those tasks with InvocationType.NON_BLOCKING, then they will be called directly by the selector thread, with no executor overhead. These tasks can then spawn a new virtual thread to proceed with the reading, parsing,  handling, marshalling, writing and blocking.  Thus we have created the jetty-10.0.x-loom branch, to use this approach and hopefully give a good basis for fair comparisons.
    Our initial runs with our CometD benchmark with just 20 clients resulted in long GCs followed by out of memory failures! This is due to the usage of ThreadLocal for gathering latency statistics and each virtual thread was creating a latency capture data structure, only to use it once and then throw it away!  While this problem is solvable by changing the CometD benchmark code, it reaffirms that threads use resources other than stack and that Loom virtual threads are not a drop in replacement for kernel threads.
    We are aware that the handling of ThreadLocal is a well known problem in Loom, but until solved it may be a surprisingly hard problem to cope with, since you don’t typically know if a library your application depends on uses ThreadLocal or not.
    With the CometD benchmark modified to not use ThreadLocal, we can now take Loom/Jetty/CometD to a moderate number of clients (1000 which generated the flame graph above) with the following results:

    CLIENT: Async Jetty/CometD server
    ========================================
    Testing 1000 clients in 100 rooms, 10 rooms/client
    Sending 1000 batches of 10x50 bytes messages every 10000 µs
    Elapsed = 10015 ms
    - - - - - - - - - - - - - - - - - - - -
    Outgoing: Rate = 990 messages/s - 99 batches/s - 12.014 MiB/s
    Incoming: Rate = 99829 messages/s - 35833 batches/s(35.89%) - 26.352 MiB/s
                    @     _  3,898 µs (112993, 11.30%)
                       @  _  7,797 µs (141274, 14.13%)
                       @  _  11,696 µs (136440, 13.65%)
                       @  _  15,595 µs (139590, 13.96%) ^50%
                       @  _  19,493 µs (142883, 14.29%)
                      @   _  23,392 µs (130493, 13.05%)
                    @     _  27,291 µs (112283, 11.23%) ^85%
            @             _  31,190 µs (59810, 5.98%) ^95%
      @                   _  35,088 µs (12968, 1.30%)
     @                    _  38,987 µs (4266, 0.43%) ^99%
    @                     _  42,886 µs (2150, 0.22%)
    @                     _  46,785 µs (1259, 0.13%)
    @                     _  50,683 µs (910, 0.09%)
    @                     _  54,582 µs (752, 0.08%)
    @                     _  58,481 µs (567, 0.06%)
    @                     _  62,380 µs (460, 0.05%) ^99.9%
    @                     _  66,278 µs (365, 0.04%)
    @                     _  70,177 µs (232, 0.02%)
    @                     _  74,076 µs (82, 0.01%)
    @                     _  77,975 µs (13, 0.00%)
    @                     _  81,873 µs (2, 0.00%)
    Messages - Latency: 999792 samples
    Messages - min/avg/50th%/99th%/max = 209/15,095/14,778/35,815/78,184 µs
    Messages - Network Latency Min/Ave/Max = 0/14/78 ms
    SERVER: Async Jetty/CometD server
    ========================================
    Operative System: Linux 5.8.0-33-generic amd64
    JVM: Oracle Corporation OpenJDK 64-Bit Server VM 16-ea+25-1633 16-ea+25-1633
    Processors: 12
    System Memory: 89.26419% used of 31.164349 GiB
    Used Heap Size: 73.283676 MiB
    Max Heap Size: 2048.0 MiB
    - - - - - - - - - - - - - - - - - - - -
    Elapsed Time: 10568 ms
       Time in Young GC: 5 ms (2 collections)
       Time in Old GC: 0 ms (0 collections)
    Garbage Generated in Eden Space: 3330.0 MiB
    Garbage Generated in Survivor Space: 4.227936 MiB
    Average CPU Load: 397.78314/1200
    ========================================
    Jetty Thread Pool:
        threads:                174
        tasks:                  302146
        max concurrent threads: 34
        max queue size:         152
        queue latency avg/max:  0/11 ms
        task time avg/max:      1/3316 ms
    

     

    CLIENT: Loom Jetty/CometD server
    ========================================
    Testing 1000 clients in 100 rooms, 10 rooms/client
    Sending 1000 batches of 10x50 bytes messages every 10000 µs
    Elapsed = 10009 ms
    - - - - - - - - - - - - - - - - - - - -
    Outgoing: Rate = 990 messages/s - 99 batches/s - 13.774 MiB/s
    Incoming: Rate = 99832 messages/s - 41201 batches/s(41.27%) - 27.462 MiB/s
                     @    _  2,718 µs (99690, 9.98%)
                       @  _  5,436 µs (116281, 11.64%)
                       @  _  8,155 µs (115202, 11.53%)
                       @  _  10,873 µs (108572, 10.87%)
                      @   _  13,591 µs (106951, 10.70%) ^50%
                       @  _  16,310 µs (117139, 11.72%)
                       @  _  19,028 µs (114531, 11.46%)
                    @     _  21,746 µs (94080, 9.42%) ^85%
                @         _  24,465 µs (71479, 7.15%)
          @               _  27,183 µs (34358, 3.44%) ^95%
      @                   _  29,901 µs (11526, 1.15%) ^99%
     @                    _  32,620 µs (4513, 0.45%)
    @                     _  35,338 µs (2123, 0.21%)
    @                     _  38,056 µs (988, 0.10%)
    @                     _  40,775 µs (562, 0.06%)
    @                     _  43,493 µs (578, 0.06%) ^99.9%
    @                     _  46,211 µs (435, 0.04%)
    @                     _  48,930 µs (187, 0.02%)
    @                     _  51,648 µs (31, 0.00%)
    @                     _  54,366 µs (27, 0.00%)
    @                     _  57,085 µs (1, 0.00%)
    Messages - Latency: 999254 samples
    Messages - min/avg/50th%/99th%/max = 192/12,630/12,476/29,704/54,558 µs
    Messages - Network Latency Min/Ave/Max = 0/12/54 ms
    SERVER: Loom Jetty/CometD server
    ========================================
    Operative System: Linux 5.8.0-33-generic amd64
    JVM: Oracle Corporation OpenJDK 64-Bit Server VM 16-loom+9-316 16-loom+9-316
    Processors: 12
    System Memory: 88.79622% used of 31.164349 GiB
    Used Heap Size: 61.733116 MiB
    Max Heap Size: 2048.0 MiB
    - - - - - - - - - - - - - - - - - - - -
    Elapsed Time: 10560 ms
       Time in Young GC: 23 ms (8 collections)
       Time in Old GC: 0 ms (0 collections)
    Garbage Generated in Eden Space: 8068.0 MiB
    Garbage Generated in Survivor Space: 3.6905975 MiB
    Average CPU Load: 413.33084/1200
    ========================================
    Jetty Thread Pool:
        threads:                14
        tasks:                  0
        max concurrent threads: 0
        max queue size:         0
        queue latency avg/max:  0/0 ms
        task time avg/max:      0/0 ms
    

    The results here are a bit mixed, but there are some positives for Loom:

    • Both approaches easily achieved the 1000 msg/s sent to the server and 99.8k msg/s received from the server (messages have an average fan-out of a factor 100).
    • The Loom version broke up those messages into 41k responses/s whilst the async version used bigger batches at 35k responses/s, which each response carrying more messages. We need to investigate why, but we think Loom is faster at starting to run the task (no time in the thread pool queue, no time to “wake up” an idle thread).
    • Loom had better latency, both average (~12.5 ms vs ~14.8 ms) and max (~54.6 ms vs ~78.2 ms)
    • Loom used more CPU: 413/1200 vs 398/1200 (4% more)
    • Loom generated more garbage: ~8068.0 MiB vs ~3330.0 MiB and less objects made it to survivor space.

    This is an interesting but inconclusive result.  It is at a low scale on a fast loopback network with a client unlikely to cause blocking, so not really testing either approach.  We now need to scale this test to many 10,000s of clients on a real network, which will require multiple load generation machines and careful measurement.  This will be the subject of part 3 (probably some weeks away).

    Conclusion (part 2) – Cheap threads can do expensive things

    It is good that Project Loom adds inexpensive and fast spawning/blocking virtual threads to the JVM.  But cheap threads can do expensive things!
    Having 1,000,000 concurrent application entities is going to take memory, CPU and other resources, no matter if they block or use async callbacks. It may be that entirely different programming styles are needed for Loom, as is suggested by Loom Structured Concurrency, however we have not yet seen anything that provides limitations on resources that can be used by unlimited spawning of virtual threads. There are also indications that Loom’s flexible stack management comes with a CPU cost.   However, it has been moderately simple to update Jetty to experiment with using Loom to call a blocking application and we’d very much encourage others to load test their application on the jetty-10.0.x-loom branch.
    Many of Loom’s claims have stacked up: blocking code is much easier to write, virtual threads are very fast to start and cheap to block. However, other key claims either do not hold up or have yet to be substantiated: we do not think virtual threads give natural scaling as threads themselves are not the limiting factor, rather it is the resources that are used that determines the scaling.  The suggestion to “Forget about thread-pools, just spawn a new thread…” feels like an invitation to create unstable applications unless other substantive resource management strategies are put into place.
    Given that Duke’s “new clothes” woven by Loom are not one-size-fits-all, it would be a mistake to stop developing asynchronous APIs for things such as DNS and JDBC on the unsubstantiated suggestion that Loom virtual threads will make them unnecessary.

  • Eat What You Kill

    A producer consumer pattern for Jetty HTTP/2 with mechanical sympathy

    Developing scalable servers in Java now requires careful consideration of mechanical sympathetic issues to achieve both high throughput and low latency.  With the introduction of HTTP/2 multiplexed semantics to Jetty, we have taken the opportunity to introduce a new execution strategy, named  “eat what you kill”[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n], which is: avoiding dispatch latency; running tasks with hot caches; reducing contention and parallel slowdown; reducing memory footprint and queue depth.

    The problem

    The problem we are trying to solve is the producer consumer pattern, where one process produces tasks that need to be run to be consumed. This is a common pattern with two key examples in the Jetty Server:

    • a NIO Selector produces connection IO events that need to be consumed
    • a multiplexed HTTP/2 connection produces HTTP requests that need to be consumed by calling the Servlet Container

    For the purposes of this blog, we will consider the problem in general, with the producer represented by following interface:

    public interface Producer
    {
        Runnable produce();
    }

    The optimisation task that we trying to solve is how to handle potentially many producers, each producing many tasks to run, and how to run the tasks that they produce so that they are consumed in a timely and efficient manner.

    Produce Consume

    The simplest solution to this pattern is to iteratively produce and consume as follows:

    while (true)
    {
        Runnable task = _producer.produce();
        if (task == null)
            break;
        task.run();
    }

    This strategy iteratively produces and consumes tasks in a single thread per Producer:

    Threading-PCIt has the advantage of simplicity, but suffers the fundamental flaw of head-of-line blocking (HOL):  If one of the tasks blocks or executes slowly (e.g. task C3 above), then subsequent tasks will be held up. This is actually good for a HTTP/1 connection where responses must be produced in the order of request, but is unacceptable for HTTP/2 connections where responses must be able to return in arbitrary order and one slow request cannot hold up other fast ones. It is also unacceptable for the NIO selection use-case as one slow/busy/blocked connection must not prevent other connections from being produced/consumed.

    Produce Execute Consume

    To solve the HOL blocking problem, multiple threads must be used so that produced tasks can be executed in parallel and even if one is slow or blocks, the other threads can progress the other tasks. The simplest application of threading is to place every task that is produced onto a queue to be consumed by an Executor:

    while (true)
    {
        Runnable task = _producer.produce();
        if (task == null)
            break;
        _executor.execute(task);
    }

    This strategy could be considered the canonical solution to the producer consumer problem, where producers are separated from consumers by a queue and is at the heart of architectures such as SEDA. This strategy solves well the head of line blocking issue, since all tasks produced can complete independently in different threads (or cached threads):

    Threading-PEC

    However, while it solves the HOL blocking issue, it introduces a number of other significant issues:

    • Tasks are produced by one thread and then consumed by another thread. This means that tasks are consumed on CPU cores with cold caches and that extra CPU time is required (indicated above in orange) while the cache loads the task related data. For example, when producing a HTTP request, the parser will identify the request method, URI and fields, which will be in the CPU’s cache. If the request is consumed by a different thread, then all the request data must be loaded into the new CPU cache. This is an aspect of Parallel Slowdown which Jetty has needed to avoid previously as it can cause a considerable impact on the server throughput.
    • Slow consumers may cause an arbitrarily large queue of tasks to build up as the producers may just keep adding to the queue faster than tasks can be consumed.  This means that no back pressure is given to the production of tasks and out of memory problems can result. Conversely, if the queue size is limited with a blocking queue, then HOL blocking problems can re-emerge as producers are prevented for queuing tasks that could be executed.
    • Every task produced will experience a dispatch latency as it is passed to a new thread to be consumed. While extra latency does not necessarily reduce the throughput of the server, it can represent a reduction in the quality of service.  The diagram above shows the total 5 tasks completing sooner than ProduceConsume, but if the server was busy then tasks may need to wait some time in the queue before being allocated a thread.
    • Another aspect of parallel slowdown is the contention between related tasks which a single producer may produce. For example a single HTTP/2 connection is likely to produce requests for the same client session, accessing the same user data. If multiple requests from the same connection are executed in parallel on different CPU cores, then they may contend for the same application locks and data and therefore be less efficient.  Another way to think about this is that if a 4 core machine is handling 8 connections that each produce 4 requests, then each core will handle 8 requests.  If each core can handle 4 requests from each of 2 connections then there will be no contention between cores.  However, if each core handles 1 requests from each of 8 connections, then the chances of contention will be high.  It is far better for total throughput for a single connections load to not be spread over all the systems cores.

    Thus the ProduceExecuteConsume strategy has solved the HOL blocking concern but at the expense of very poor performance on both latency (dispatch times) and execution (cold caches), as well as introducing concerns of contention and back pressure. Many of these additional concerns involve the concept of Mechanical Sympathy, where the underlying mechanical design (i.e. CPU cores and caches) must be considered when designing scalable software.

    How Bad Is It?

    Pretty Bad! We have written a benchmark project that compares the Produce Consume and Produce Execute Consume strategies (both described above). The Test Connection used simulates a typical HTTP request handling load where the production of the task equates to parsing the request and created the request object and the consumption of the task equates to handling the request and generating a response.

    ewyk1

    It can be seen that the ProduceConsume strategy achieves almost 8 times the throughput of the ProduceExecuteConsume strategy.   However in doing so, the ProduceExecuteConsume strategy is using a lot less CPU (probably because it is idle during the dispatch delays). Yet even when the throughput is normalised to what might be achieved if 60% of the available CPU was used, then this strategy reduces throughput by 30%!  This is most probably due to the processing inefficiencies of cold caches and contention between tasks in the ProduceExecuteConsume strategy. This clearly shows that to avoid HOL blocking, the ProduceExecuteConsume strategy is giving up significant throughput when you consider either achieved or theoretical measures.

    What Can Be Done?

    Disruptor ?

    Consideration of the SEDA architecture led to the development of the Disruptor pattern, which self describes as a “High performance alternative to bounded queues for exchanging data between concurrent threads”.  This pattern attacks the problem by replacing the queue between producer and consumer with a better data structure that can greatly improve the handing off of tasks between threads by considering the mechanical sympathetic concerns that affect the queue data structure itself.

    While replacing the queue with a better mechanism may well greatly improve performance, our analysis was that it in Jetty it was the parallel slowdown of sharing the task data between threads that dominated any issues with the queue mechanism itself. Furthermore, the problem domain of a full SEDA-like architecture, whilst similar to the Jetty use-cases is not similar enough to take advantage of some of the more advanced semantics available with the disruptor.

    Even with the most efficient queue replacement, the Jetty use-cases will suffer from some dispatch latency and parallel slow down from cold caches and contending related tasks.

    Work Stealing ?

    Another technique for avoiding parallel slowdown is a Work Stealing scheduling strategy:

    In a work stealing scheduler, each processor in a computer system has a queue of work items to perform…. New items are initially put on the queue of the processor executing the work item. When a processor runs out of work, it looks at the queues of other processors and “steals” their work items.

    This concept initially looked very promising as it appear that it would allow related tasks to stay on the same CPU core and avoid the parallel slowdown issues described above.
    It would require the single task queue to be broken up in to multiple queues, but there are suitable candidates for finer granularity queues available (e.g. the connection).

    Unfortunately, several efforts to implement it within Jetty failed to find an elegant solution because it is not generally possible to stick a queue or thread to a processor and the interaction of task queues vs thread pool queues added an additional level of complexity. More over, because the approach still involves queues it does not solve the back pressure issues and the execution of tasks in a queue may flush the cache between production and consumption.

    However consideration of the principles of Work Stealing inspired the creation of a new scheduling strategy that attempt to achieve the same result but without any queues.

    Eat What You Kill!

    The “Eat What You Kill”[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n] strategy (which could have been more prosaicly named ExecuteProduceConsume) has been designed to get the best of both worlds of the strategies presented above. It is nick named after the hunting movement that says a hunter should only kill an animal they intend to eat. Applied to the producer consumer problem this policy says that a thread must only produce (kill) a task if it intends to consume (eat) it immediately. However, unlike the ProduceConsume strategy that adheres to this principle, EatWhatYouKill still performs dispatches, but only to recruit new threads (hunters) to produce and consume more tasks while the current thread is busy eating !

    private volatile boolean _threadPending;
    private AtomicBoolean _producing = new AtomicBoolean(false);
    ...
        _threadPending = false;
        while (true)
        {
            if (!_producing.compareAndSet(false, true))
                break;
            Runnable task;
            try
            {
                task = _producer.produce();
            }
            finally
            {
                _producing.set(false);
            }
            if (task == null)
              break;
            if (!_threadPending)
            {
                _threadPending = true;
                _executor.execute(this);
            }
             
            task.run();
        }

    This strategy can still operate like ProduceConsume using a loop to produce and consume tasks with a hot cache. A dispatch is performed to recruit a new thread to produce and consume, but on a busy server where the delay in dispatching a new thread may be large, the extra thread may arrive after all the work is done. Thus the extreme case on a busy server is that this strategy can behave like ProduceConsume with an extra noop dispatch:

    Threading-EPC-busy

    Serial queueless execution like this is optimal for a servers throughput:  There is not queue of produced tasks wasting memory, as tasks are only produced when needed; tasks are always consumed with hot caches immediately after production.  Ideally each core and/or thread in a server is serially executing related tasks in this pattern… unless of course one tasks takes too long to execute and we need to avoid HOL blocking.

    EatWhatYouKill avoids HOL blocking as it is able to recruit additional threads to iterate on production and consumption if the server is less busy and the dispatch delay is less than the time needed to consume a task.  In such cases, a new threads will be recruited to assist with producing and consuming, but each thread will consume what they produced using a hot cache and tasks can complete out of order:

    Threading-EPCOn a mostly idle server, the dispatch delay may always be less than the time to consume a task and thus every task may be produced and consumed in its own dispatched thread:

    Threading-EPC-idleIn this idle case there is a dispatch for every task, which is exactly the same dispatch cost of ProduceExecuteConsume.  However this is only the worst case dispatch overhead for EatWhatYouKill and only happens on a mostly idle server, which has spare CPU. Even with the worst case dispatch case, EatWahtYouKill still has the advantage of always consuming with a hot cache.

    An alternate way to visualise this strategy is to consider it like ProduceConsume, but that it dispatches extra threads to work steal production and consumption. These work stealing threads will only manage to steal work if the server is has spare capacity and the consumption of a task is risking HOL blocking.

    This strategy has many benefits:

    • A hot cache is always used to consume a produced task.
    • Good back pressure is achieved by making production contingent on either another thread being available or prior consumption being completed.
    • There will only ever be one outstanding dispatch to the thread pool per producer which reduces contention on the thread pool queue.
    • Unlike ProduceExecuteConsume, which always incurs the cost of a dispatch for every task produced, ExecuteProduceConsume will only dispatch additional threads if the time to consume exceeds the time to dispatch.
    • On systems where the dispatch delay is of the same order of magnitude as consuming a task (which is likely as the dispatch delay is often comprised of the wait for previous tasks to complete), then this strategy is self balancing and will find an optimal number of threads.
    • While contention between related tasks can still occur, it will be less of a problem on busy servers because related task will tend to be consumed iteratively, unless one of them blocks or executes slowly.

    How Good Is It ?

    Indications from the benchmarks is that it is very good !

    ewyk2

    For the benchmark, ExecuteProduceConsume achieved better throughput than ProduceConsume because it was able to use more CPU cores when appropriate. When normalised for CPU load, it achieved near identical results to ProduceConsume, which is to be expected since both consume tasks with hot caches and ExecuteProduceConsume only incurs in dispatch costs when they are productive.

    This indicates that you can kill your cake and eat it too! The same efficiency of  ProduceConsume can be achieved with the same HOL blocking prevention of ProduceExecuteConsume.

    Conclusion

    The EatWhatYouKill (aka ExecuteProduceConsume) strategy has been integrated into Jetty-9.3 for both NIO selection and HTTP/2 request handling. This makes it possible for the following sequence of events to occur within a single thread of execution:

    1. A selector thread T1 wakes up because it has detected IO activity.
    2. (T1) An ExecuteProduceConsume strategy processes the selected keys set.
    3. (T1) An EndPoint with input pending is produced from the selected keys set.
    4. Another thread T2 is dispatched to continue producing from the selected keys set.
    5. (T1) The EndPoint with input pending is consumed by running the HTTP/2 connection associated with it.
    6. (T1) An ExecuteProduceConsume strategy processes the I/O for the HTTP/2 connection.
    7. (T1) A HTTP/2 frame is produced by the HTTP/2 connection.
    8. Another thread T3 is dispatched to continue producing HTTP/2 frames from the HTTP/2 connection.
    9. (T1) The frame is consumed by possibly invoking the application to produce a response.
    10. (T1) The thread returns from the application and attempts to produce more frames from the HTTP/2 connection, if there is I/O left to process.
    11. (T1) The thread returns from HTTP/2 connection I/O processing and attempts to produce more EndPoints from the selected keys set, if there is any left.

    This allows a single thread with hot cache to handle a request from I/O selection, through frame parsing to response generation with no queues or dispatch delays. This offers maximum efficiency of handling while avoiding the unacceptable HOL blocking.

    Early indications are that Jetty-9.3 is indeed demonstrating a significant step forward in both low latency and high throughput.   This site has been running on EWYK Jetty-9.3 for some months.  We are confident that with this new execution strategy, Jetty will provide the most performant and scalable HTTP/2 implementation available in Java.

  • CometD 2.4.0 WebSocket Benchmarks

    Slightly more than one year has passed since the last CometD 2 benchmarks, and more than three years since the CometD 1 benchmark. During this year we have done a lot of work on CometD, both by adding features and by continuously improving performance and stability to make it faster and more scalable.
    With the upcoming CometD 2.4.0 release, one of the biggest changes is the implementation of a WebSocket transport for both the Java client and the Java server.
    The WebSocket protocol is finalizing at the IETF, major browsers all support various draft versions of the protocol (and Jetty supports all draft versions), so while WebSocket is slowly picking up, it is interesting to compare how WebSocket behaves with respect to HTTP for the typical scenarios that use CometD.
    We conducted several benchmarks using the CometD load tools on Amazon EC2 instances.

    HTTP Benchmark Results

    Below you can find the benchmark result graph when using the CometD long-polling transport, based on plain HTTP.

    Differently from the previous benchmark, where we reported the average latency, this time we report the median latency, which is a better indicator of the latencies seen by the clients.
    Comparison with the previous benchmark would be unfair, since the hosts were different (both in number and computing power), and the JVM also was different.
    As you can see from the graph above, the median latency is pretty much the same no matter the number of clients, with the exception of 50k clients at 50k messages/s.
    The median latency stays well under 200 ms even at more than 50k messages/s, and it is in the range of 2-4 ms until 10k messages/s, and around 50 ms for 20k messages/s, even for 50k clients.
    The result for 50k clients and 50k messages/s is a bit strange, since the hosts (both server and clients) had plenty of CPU available and plenty of threads available (which rules out locking contention issues in the code that would have bumped up threads use).
    Could it be possible that at that message rate we hit some limit of the EC2 platform ? It might be possible and this blog post confirms that indeed there are limits in the virtualization of the network interfaces between host and guest. I have words from other people who have performed benchmarks on EC2 that they also hit limits very close to what the blog post above describes.
    In any case, one server with 20k clients serving 50k messages/s with 150 ms median latency is a very good result.
    For completeness, the 99th percentile latency is around 350 ms for 20k and 50k clients at 20k messages/s and around 1500 ms for 20k clients at 50k messages/s, and much less–quite close to the median latency–for the other results.

    WebSocket Benchmark Results

    The results for the same benchmarks using the WebSocket transport were quite impressive, and you can see them below.

    Note that this graph uses a totally different scale for latencies and number of clients.
    Whereas for HTTP we had a 800 ms as maximum latency (on the Y axis), for WebSocket we have 6 ms (yes you read that right); and whereas for HTTP we somehow topped at 50k clients per server, here we could go up to 200k.
    We did not merge the two graphs into a single one to avoid that the WebSocket resulting trend lines were collapsed onto the X axis.
    With HTTP, having more than 50k clients on the server was troublesome at any message rate, but with WebSocket 200k clients were stable up to 20k messages/s. Beyond that, we probably hit EC2 limits again, and the results were unstable–some runs could complete successfully, others could not.

    • The median latencies, for almost any number of clients and any message rate, are below 10 ms, which is quite impressive.
    • The 99th percentile latency is around 300 ms for 200k clients at 20k messages/s, and around 200 ms for 50k clients at 50k messages/s.

    We have also conducted some benchmarks by varying the payload size from the default of 50 bytes to 500 bytes to 2000 bytes, but the results we obtained with different payload size were very similar, so we can say that payload size has a very little impact (if any) on latencies in this benchmark configuration.
    We have also monitored memory consumption in “idle” state (that is, with clients connected and sending meta connect requests every 30 seconds, but not sending messages):

    • HTTP: 50k clients occupy around 2.1 GiB
    • WebSocket: 50k clients occupy around 1.2 GiB, and 200k clients occupy 3.2 GiB.

    The benefits of WebSocket being a lighter weight protocol with respect to HTTP are clear in all cases.

    Conclusions

    The conclusions are:

    • The work the CometD project has done to improve performances and scalability were worth the effort, and CometD offers a truly scalable solution for server-side event driven web applications, for both HTTP and WebSocket.
    • As the WebSocket protocol gains adoption, CometD can leverage the new protocol without any change required to applications; they will just perform faster.
    • Server-to-server CometD communication can now be extremely fast by using WebSocket. We have already updated the CometD scalability cluster Oort to take advantage of these enhancements.

    Appendix–Benchmark Details

    The server was one EC2 instance of type “m2.4xlarge” (67 GiB RAM, 8 cores Intel(R) Xeon(R) X5550 @2.67GHz) running Ubuntu Linux 11.04 (2.6.38-11-virtual #48-Ubuntu SMP 64-bit).
    The clients were 10 EC2 instances of type “c1.xlarge” (7 GiB RAM, 8 cores Intel Xeon E5410 @2.33GHz) running Ubuntu Linux 11.04 (2.6.38-11-virtual #48-Ubuntu SMP 64-bit).
    The JVM used was Oracle’s Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) version 1.7.0 for both clients and server.
    The server was started with the following options:

    -Xmx32g -Xms32g -Xmn16g -XX:-UseSplitVerifier -XX:+UseParallelOldGC -XX:-UseAdaptiveSizePolicy -XX:+UseNUMA

    while the clients were started with the following options:

    -Xmx6g -Xms6g -Xmn3g -XX:-UseSplitVerifier -XX:+UseParallelOldGC -XX:-UseAdaptiveSizePolicy -XX:+UseNUMA

    The OS was tuned for allowing a larger number of file descriptors, as described here.