Category: performance

  • Thread Starvation with Eat What You Kill

    This is going to be a blog of mixed metaphors as I try to explain how we avoid thread starvation when we use Jetty’s eat-what-you-kill[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n] scheduling strategy.
    Jetty has several instances of a computing pattern called ProduceConsume, where a task is run that produces other tasks that need to be consumed. An example of a Producer is the HTTP/1.1 Connection, where the Producer task looks for IO activity on any connection. Each IO event detected is a Consumer task which will read the handle the IO event (typically a HTTP request). In Java NIO terms, the Producer in this example is running the NIO Selector and the Consumers are handling the HTTP protocol and the applications Servlets. Note that the split between Producing and Consuming can be rather arbitrary and we have tried to have the HTTP protocol as part of the Producer, but as we have previously blogged, that split has poor mechanical sympathy. So the key abstract about the Producer Consumer pattern for Jetty is that we use it when the tasks produced can be executed in any order or in parallel: HTTP requests from different connections or HTTP/2 frames from different streams.

    Eat What You Kill

    Mechanical Sympathy not only affects where the split is between producing and consuming, but also how the Producer task and Consumer tasks should be executed (typically by a thread pool) and such considerations can have a dramatic effect on server performance. For example, if one thread produced a task then it is likely that the CPU’s cache is now hot with all the data relating to that task, and so it is best that the same CPU consumes that task using the hot cache. This could be achieved with complex core locking mechanism, but it is far more straight-forward to consume the task using the same thread.
    Jetty has an ExecutionStrategy called Eat-What-You-Kill (EWYK), that has excellent mechanical sympathy properties. We have previously explained  this strategy in detail, but in summary it follows the hunters ethic[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n] that one should only kill (produce) something that you intend to eat (consume). This strategy allows a thread to only run the producing task if it is immediately able to run any consumer task that is produced (using the hot CPU cache). In order to allow other consumer task to run in parallel, another thread (if available) is dispatched to do more producing and consuming.

    Thread Starvation

    EWYK is an excellent execution strategy that has given Jetty significant better throughput and reduced latency. That said, it is susceptible to thread starvation when it bites off more than it can chew.
    The issue is that EWYK works by using the same thread that produced a task to immediately consume the task and it is possible (even likely) that the consumer task will block as it is often calling application code which may do blocking IO or which is set to wait for some other event. To ensure this does not block the entire server, EWYK will dispatch another task to the thread pool that will do more producing.
    The problem is that if the thread pool is empty (because all the threads are in blocking application code) then the last non-blocked producing thread may produce a task which it then calls and also blocks. A task to do more producing will have been dispatched to the thread pool, but as it was generated from the last available thread, the producing task will be waiting in the job queue for an available thread. All the threads are blocking and it may be that they are all blocking on IO operations that will only be unblocked if some data is read/written.  Unless something calls the NIO Selector, the read/write will not been seen. Since the Selector is called by the Producer task, and that is waiting in the queue, and the queue is stalled because of all the threads blocked waiting for the selector the server is now dead locked by thread starvation!

    Always two there are!

    Jetty’s clever solution to this problem is to not only run our EWYK execution strategy, but to also run the alternative ProduceExecuteConsume strategy, where one thread does all the producing and always dispatches any produced tasks to the thread pool. Because this is not mechanically sympathetic, we run the producer task at low priority. This effectively reserves one thread from the thread pool to always be a producer, but because it is low priority it will seldom run unless the server is idle – or completely stalled due to thread starvation. This means that Jetty always has a thread available to Produce, thus there is always a thread available to run the NIO Selector and any IO events that will unblock any threads will be detected. This needs one more trick to work – the producing task must be able to tell if a detected IO task is non-blocking (i.e. a wakeup of a blocked read or write), in which case it executes it itself rather than submitting the task to any execution strategy. Jetty uses the InvocationType interface to tag such tasks and thus avoid thread starvation.
    This is a great solution when a thread can be dedicated to always Producing (e.g. NIO Selecting). However Jetty has other Producer-Consumer patterns that cannot be threadful. HTTP/2 Connections are consumers of IO Events, but are themselves producers of parsed HTTP/2 frames which may be handled in parallel due to the multiplexed nature of HTTP/2. So each HTTP/2 connection is itself a Produce-Consume pattern, but we cannot allocate a Producer thread to each connection as a server may have many tens of thousands connections!
    Yet, to avoid thread starvation, we must also always call the Producer task for HTTP/2. This is done as it may parse HTTP/2 flow control frames that are necessary to unblock the IO being done by applications threads that are blocked and holding all the available threads from the pool.
    Even if there is a thread reserved as the Producer/Selector by a connector, it may detect IO on a HTTP/2 connection and use the last thread from the thread pool to Consume that IO. If it produces a HTTP/2 frame and EWYK strategy is used, then the last thread may Consume that frame and it too may block in application code. So even if the reserved thread detects more IO, there are no more available threads to consume them!
    So the solution in HTTP/2 is similar to the approach with the Connector. Each HTTP/2 connection has two executions strategies – EWYK, which is used when the calling thread (the Connector’s consumer) is allowed to block, and the traditional ProduceExecuteConsume strategy, which is used when the calling thread is not allowed to block. The HTTP/2 Connection then advertises itself as an InvocationType of EITHER to the Connector. If the Connector is running normally a EWYK strategy will be used and the HTTP/2 Connection will do the same. However, if the Connector is running the low priority ProduceExecutionConsume strategy, it invokes the HTTP/2 connection as non-blocking. This tells the HTTP/2 Connection that when it is acting as a Consumer of the Connectors task, it must not block – so it uses its own ProduceExecuteConsume strategy, as it knows the Production will parse the HTTP/2 frame and not perform the Consume task itself (which may block).
    The final part is that the HTTP/2 frame Producer can look at the frames produced. If they are not frames that will block when handled (i.e. Flow Control) they are handled by the Producer and not submitted to any strategy to be Consumed. Thus, even if the Server is on it’s last thread, Flow Control frames will be detected, parsed and handled – unblocking other threads and avoiding starvation!

  • Jetty-9.3 Features!

    Jetty 9.3.0 is almost ready and Release Candidate 1 is available for download and testing!  So this is just a quick blog to introduce you to what is new and encourage you to try it out!

    HTTP2

    The headline feature in Jetty-9.3 is HTTP/2 support. This protocol is now a proposed standard from the IETF and described in RFC7540. The Jetty team has been closely involved with the development of this standard, and while we have some concerns about the result, we believe that there are significant quality of service gains to be had by deploying HTTP/2.   The protocol has features that can greatly reduce the time to render a web page, which is good for clients; plus it has some good economies in using a fewer connections, which is good for servers.

    Jetty has comprehensive support for HTTP/2: Client, Server with negotiated, upgraded and direct connections and the protocol is already supported by the majority of current browsers. Since HTTP2 is substantially based on the SPDY protocol, we have dropped SPDY support from Jetty-9.3.

    Deploying HTTP/2 in the server is just the same as configuring a https connector : java -jar $JETTY_HOME/start.jar --add-to-startd=http2 will get you going (more blogs and doco coming)!

    Webtide is actively seeking users interested in deploying HTTP2 and collaborating on analysis of load, latency, configuration and optimisations.

    ALPN

    To support standard based negotiation of protocols over new connections (eg. to select HTTP2 or HTTPS),  Jetty-9.3 supports the Application Layer Protocol Negotiation mechanism which replaces our previous support for NPN.

    ALPN will automatically be enabled when HTTP2 is enable with start.jar, which downloads a non-eclipse jar containing our own extension to Open JVM and is not covered by the eclipse licenses.

    SNI

    Jetty-9.3 also supports Server Name Indications during TLS/SSL negotiation.  This allows the key store to contain multiple server certificates that have a specific or wild card domain(s) encoded in their distinguished name or by the Subject Alternate Name X.509 extension.     This allows a server with many virtual hosts/contexts to pick the appropriate TLS/SSL certificate for a connection.

    Enabling SNI support is a simple as adding the multiple certificates to your keystore file!

    Java 8

    Jetty-9.3 is built and targeted for Java 8.  This change was prompted by the SNI extension reliance on a Java 8 API and the HTTP2 specification need for TLS ciphers that are only available in Java 8.  It is possible to build Jetty-9.3 for Java 7 and we were considering releasing it as such with a few configuration tricks to enable the few classes that require java 8, however we decided that since java 7 is end-of-life is was not worth the complication to support it directly in the release.   If you really need java 7, then please speak to Webtide about a build of 9.3 for 7.

    Eat What You Kill

    It is impossible to change the protocol as server speaks without dramatic changes on how it is optimized to scale to high loads and low through puts.  The support of HTTP2 requires some fundamental changes to the core scheduling strategies, specifically with regards to the challenge of handling multiplexed requests from a single connection.   Jetty 9.3 contains a new scheduling strategy nicked named Eat What You Kill that makes 9.3 faster out of the box and gives us the opportunity to continue to improve throughput and latency as we tune the algorithm.

    Reactive Asynchronous IO Flows?

    Jetty 9.2 already supports the Servlet Asynchronous IO API and Asynchronous Servlets.  However, in Jetty 9.3 that support has been made even more fundamental and all IO in Jetty is now fundamentally asynchronous from the connector to the servlet streams and robust under arbitrary access from non container managed threads.

    So Jetty-9.3 is a good basis on which to develop with the servlet asynchronous APIs, however as we have some concerns with the complexity of those APIs, we are actively experimenting with better APIs based on Reactive Programming and specifically on the Flow abstraction developed by Doug Lea as a candidate class for Java 9.   We have a working prototype that runs on Jetty-9.3 which we hope to release soon.  Please contact us if you are interested in  participating in this development, as real use-cases are required to test these abstractions!

  • G1 Garbage Collector at GeeCON 2015

    I had the pleasure to speak at the GeeCON 2015 Conference in Kraków, Poland, where I presented a HTTP/2 session and a new session about the G1 garbage collector (slides below).

    I have to say that GeeCON has become one of my favorite conferences, along with the Codemotion Conference.
    While Codemotion is bigger and spans more themes (Java, JavaScript/WebUI/WebUX, IoT, Makers and Embedded), GeeCON is smaller and more focused on Java and Scala.
    The GeeCON organizers have made a terrific effort, and the results are, in my opinion, superb.
    The conference venue is a bit far from the center of Kraków (where most speaker hotels were), but easily reachable by bus. When the conference is over, you still have a couple of hours of light to visit the city center, which is gorgeous.
    The sessions I followed were very interesting and the speakers top quality.
    The accommodation was good, food and beverages at the conference were excellent, and on the second day there was a party for all conference attendees in a huge beer bar, with a robo war tournament for those that like to flip someone else’s table robot. Fantastic !
    Definitely mark GeeCON on your calendars for the next year.
    The slides about the G1 session aim at explaining some of the less known areas of G1 and present tuning advices along with a real-world use case of migration from CMS to G1.
    Contact us if you want to tune the GC behavior of your Jetty applications, either deployed to a standalone Jetty server or coded using Jetty embedded.

  • Eat What You Kill

    A producer consumer pattern for Jetty HTTP/2 with mechanical sympathy

    Developing scalable servers in Java now requires careful consideration of mechanical sympathetic issues to achieve both high throughput and low latency.  With the introduction of HTTP/2 multiplexed semantics to Jetty, we have taken the opportunity to introduce a new execution strategy, named  “eat what you kill”[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n], which is: avoiding dispatch latency; running tasks with hot caches; reducing contention and parallel slowdown; reducing memory footprint and queue depth.

    The problem

    The problem we are trying to solve is the producer consumer pattern, where one process produces tasks that need to be run to be consumed. This is a common pattern with two key examples in the Jetty Server:

    • a NIO Selector produces connection IO events that need to be consumed
    • a multiplexed HTTP/2 connection produces HTTP requests that need to be consumed by calling the Servlet Container

    For the purposes of this blog, we will consider the problem in general, with the producer represented by following interface:

    public interface Producer
    {
        Runnable produce();
    }

    The optimisation task that we trying to solve is how to handle potentially many producers, each producing many tasks to run, and how to run the tasks that they produce so that they are consumed in a timely and efficient manner.

    Produce Consume

    The simplest solution to this pattern is to iteratively produce and consume as follows:

    while (true)
    {
        Runnable task = _producer.produce();
        if (task == null)
            break;
        task.run();
    }

    This strategy iteratively produces and consumes tasks in a single thread per Producer:

    Threading-PCIt has the advantage of simplicity, but suffers the fundamental flaw of head-of-line blocking (HOL):  If one of the tasks blocks or executes slowly (e.g. task C3 above), then subsequent tasks will be held up. This is actually good for a HTTP/1 connection where responses must be produced in the order of request, but is unacceptable for HTTP/2 connections where responses must be able to return in arbitrary order and one slow request cannot hold up other fast ones. It is also unacceptable for the NIO selection use-case as one slow/busy/blocked connection must not prevent other connections from being produced/consumed.

    Produce Execute Consume

    To solve the HOL blocking problem, multiple threads must be used so that produced tasks can be executed in parallel and even if one is slow or blocks, the other threads can progress the other tasks. The simplest application of threading is to place every task that is produced onto a queue to be consumed by an Executor:

    while (true)
    {
        Runnable task = _producer.produce();
        if (task == null)
            break;
        _executor.execute(task);
    }

    This strategy could be considered the canonical solution to the producer consumer problem, where producers are separated from consumers by a queue and is at the heart of architectures such as SEDA. This strategy solves well the head of line blocking issue, since all tasks produced can complete independently in different threads (or cached threads):

    Threading-PEC

    However, while it solves the HOL blocking issue, it introduces a number of other significant issues:

    • Tasks are produced by one thread and then consumed by another thread. This means that tasks are consumed on CPU cores with cold caches and that extra CPU time is required (indicated above in orange) while the cache loads the task related data. For example, when producing a HTTP request, the parser will identify the request method, URI and fields, which will be in the CPU’s cache. If the request is consumed by a different thread, then all the request data must be loaded into the new CPU cache. This is an aspect of Parallel Slowdown which Jetty has needed to avoid previously as it can cause a considerable impact on the server throughput.
    • Slow consumers may cause an arbitrarily large queue of tasks to build up as the producers may just keep adding to the queue faster than tasks can be consumed.  This means that no back pressure is given to the production of tasks and out of memory problems can result. Conversely, if the queue size is limited with a blocking queue, then HOL blocking problems can re-emerge as producers are prevented for queuing tasks that could be executed.
    • Every task produced will experience a dispatch latency as it is passed to a new thread to be consumed. While extra latency does not necessarily reduce the throughput of the server, it can represent a reduction in the quality of service.  The diagram above shows the total 5 tasks completing sooner than ProduceConsume, but if the server was busy then tasks may need to wait some time in the queue before being allocated a thread.
    • Another aspect of parallel slowdown is the contention between related tasks which a single producer may produce. For example a single HTTP/2 connection is likely to produce requests for the same client session, accessing the same user data. If multiple requests from the same connection are executed in parallel on different CPU cores, then they may contend for the same application locks and data and therefore be less efficient.  Another way to think about this is that if a 4 core machine is handling 8 connections that each produce 4 requests, then each core will handle 8 requests.  If each core can handle 4 requests from each of 2 connections then there will be no contention between cores.  However, if each core handles 1 requests from each of 8 connections, then the chances of contention will be high.  It is far better for total throughput for a single connections load to not be spread over all the systems cores.

    Thus the ProduceExecuteConsume strategy has solved the HOL blocking concern but at the expense of very poor performance on both latency (dispatch times) and execution (cold caches), as well as introducing concerns of contention and back pressure. Many of these additional concerns involve the concept of Mechanical Sympathy, where the underlying mechanical design (i.e. CPU cores and caches) must be considered when designing scalable software.

    How Bad Is It?

    Pretty Bad! We have written a benchmark project that compares the Produce Consume and Produce Execute Consume strategies (both described above). The Test Connection used simulates a typical HTTP request handling load where the production of the task equates to parsing the request and created the request object and the consumption of the task equates to handling the request and generating a response.

    ewyk1

    It can be seen that the ProduceConsume strategy achieves almost 8 times the throughput of the ProduceExecuteConsume strategy.   However in doing so, the ProduceExecuteConsume strategy is using a lot less CPU (probably because it is idle during the dispatch delays). Yet even when the throughput is normalised to what might be achieved if 60% of the available CPU was used, then this strategy reduces throughput by 30%!  This is most probably due to the processing inefficiencies of cold caches and contention between tasks in the ProduceExecuteConsume strategy. This clearly shows that to avoid HOL blocking, the ProduceExecuteConsume strategy is giving up significant throughput when you consider either achieved or theoretical measures.

    What Can Be Done?

    Disruptor ?

    Consideration of the SEDA architecture led to the development of the Disruptor pattern, which self describes as a “High performance alternative to bounded queues for exchanging data between concurrent threads”.  This pattern attacks the problem by replacing the queue between producer and consumer with a better data structure that can greatly improve the handing off of tasks between threads by considering the mechanical sympathetic concerns that affect the queue data structure itself.

    While replacing the queue with a better mechanism may well greatly improve performance, our analysis was that it in Jetty it was the parallel slowdown of sharing the task data between threads that dominated any issues with the queue mechanism itself. Furthermore, the problem domain of a full SEDA-like architecture, whilst similar to the Jetty use-cases is not similar enough to take advantage of some of the more advanced semantics available with the disruptor.

    Even with the most efficient queue replacement, the Jetty use-cases will suffer from some dispatch latency and parallel slow down from cold caches and contending related tasks.

    Work Stealing ?

    Another technique for avoiding parallel slowdown is a Work Stealing scheduling strategy:

    In a work stealing scheduler, each processor in a computer system has a queue of work items to perform…. New items are initially put on the queue of the processor executing the work item. When a processor runs out of work, it looks at the queues of other processors and “steals” their work items.

    This concept initially looked very promising as it appear that it would allow related tasks to stay on the same CPU core and avoid the parallel slowdown issues described above.
    It would require the single task queue to be broken up in to multiple queues, but there are suitable candidates for finer granularity queues available (e.g. the connection).

    Unfortunately, several efforts to implement it within Jetty failed to find an elegant solution because it is not generally possible to stick a queue or thread to a processor and the interaction of task queues vs thread pool queues added an additional level of complexity. More over, because the approach still involves queues it does not solve the back pressure issues and the execution of tasks in a queue may flush the cache between production and consumption.

    However consideration of the principles of Work Stealing inspired the creation of a new scheduling strategy that attempt to achieve the same result but without any queues.

    Eat What You Kill!

    The “Eat What You Kill”[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n] strategy (which could have been more prosaicly named ExecuteProduceConsume) has been designed to get the best of both worlds of the strategies presented above. It is nick named after the hunting movement that says a hunter should only kill an animal they intend to eat. Applied to the producer consumer problem this policy says that a thread must only produce (kill) a task if it intends to consume (eat) it immediately. However, unlike the ProduceConsume strategy that adheres to this principle, EatWhatYouKill still performs dispatches, but only to recruit new threads (hunters) to produce and consume more tasks while the current thread is busy eating !

    private volatile boolean _threadPending;
    private AtomicBoolean _producing = new AtomicBoolean(false);
    ...
        _threadPending = false;
        while (true)
        {
            if (!_producing.compareAndSet(false, true))
                break;
            Runnable task;
            try
            {
                task = _producer.produce();
            }
            finally
            {
                _producing.set(false);
            }
            if (task == null)
              break;
            if (!_threadPending)
            {
                _threadPending = true;
                _executor.execute(this);
            }
             
            task.run();
        }

    This strategy can still operate like ProduceConsume using a loop to produce and consume tasks with a hot cache. A dispatch is performed to recruit a new thread to produce and consume, but on a busy server where the delay in dispatching a new thread may be large, the extra thread may arrive after all the work is done. Thus the extreme case on a busy server is that this strategy can behave like ProduceConsume with an extra noop dispatch:

    Threading-EPC-busy

    Serial queueless execution like this is optimal for a servers throughput:  There is not queue of produced tasks wasting memory, as tasks are only produced when needed; tasks are always consumed with hot caches immediately after production.  Ideally each core and/or thread in a server is serially executing related tasks in this pattern… unless of course one tasks takes too long to execute and we need to avoid HOL blocking.

    EatWhatYouKill avoids HOL blocking as it is able to recruit additional threads to iterate on production and consumption if the server is less busy and the dispatch delay is less than the time needed to consume a task.  In such cases, a new threads will be recruited to assist with producing and consuming, but each thread will consume what they produced using a hot cache and tasks can complete out of order:

    Threading-EPCOn a mostly idle server, the dispatch delay may always be less than the time to consume a task and thus every task may be produced and consumed in its own dispatched thread:

    Threading-EPC-idleIn this idle case there is a dispatch for every task, which is exactly the same dispatch cost of ProduceExecuteConsume.  However this is only the worst case dispatch overhead for EatWhatYouKill and only happens on a mostly idle server, which has spare CPU. Even with the worst case dispatch case, EatWahtYouKill still has the advantage of always consuming with a hot cache.

    An alternate way to visualise this strategy is to consider it like ProduceConsume, but that it dispatches extra threads to work steal production and consumption. These work stealing threads will only manage to steal work if the server is has spare capacity and the consumption of a task is risking HOL blocking.

    This strategy has many benefits:

    • A hot cache is always used to consume a produced task.
    • Good back pressure is achieved by making production contingent on either another thread being available or prior consumption being completed.
    • There will only ever be one outstanding dispatch to the thread pool per producer which reduces contention on the thread pool queue.
    • Unlike ProduceExecuteConsume, which always incurs the cost of a dispatch for every task produced, ExecuteProduceConsume will only dispatch additional threads if the time to consume exceeds the time to dispatch.
    • On systems where the dispatch delay is of the same order of magnitude as consuming a task (which is likely as the dispatch delay is often comprised of the wait for previous tasks to complete), then this strategy is self balancing and will find an optimal number of threads.
    • While contention between related tasks can still occur, it will be less of a problem on busy servers because related task will tend to be consumed iteratively, unless one of them blocks or executes slowly.

    How Good Is It ?

    Indications from the benchmarks is that it is very good !

    ewyk2

    For the benchmark, ExecuteProduceConsume achieved better throughput than ProduceConsume because it was able to use more CPU cores when appropriate. When normalised for CPU load, it achieved near identical results to ProduceConsume, which is to be expected since both consume tasks with hot caches and ExecuteProduceConsume only incurs in dispatch costs when they are productive.

    This indicates that you can kill your cake and eat it too! The same efficiency of  ProduceConsume can be achieved with the same HOL blocking prevention of ProduceExecuteConsume.

    Conclusion

    The EatWhatYouKill (aka ExecuteProduceConsume) strategy has been integrated into Jetty-9.3 for both NIO selection and HTTP/2 request handling. This makes it possible for the following sequence of events to occur within a single thread of execution:

    1. A selector thread T1 wakes up because it has detected IO activity.
    2. (T1) An ExecuteProduceConsume strategy processes the selected keys set.
    3. (T1) An EndPoint with input pending is produced from the selected keys set.
    4. Another thread T2 is dispatched to continue producing from the selected keys set.
    5. (T1) The EndPoint with input pending is consumed by running the HTTP/2 connection associated with it.
    6. (T1) An ExecuteProduceConsume strategy processes the I/O for the HTTP/2 connection.
    7. (T1) A HTTP/2 frame is produced by the HTTP/2 connection.
    8. Another thread T3 is dispatched to continue producing HTTP/2 frames from the HTTP/2 connection.
    9. (T1) The frame is consumed by possibly invoking the application to produce a response.
    10. (T1) The thread returns from the application and attempts to produce more frames from the HTTP/2 connection, if there is I/O left to process.
    11. (T1) The thread returns from HTTP/2 connection I/O processing and attempts to produce more EndPoints from the selected keys set, if there is any left.

    This allows a single thread with hot cache to handle a request from I/O selection, through frame parsing to response generation with no queues or dispatch delays. This offers maximum efficiency of handling while avoiding the unacceptable HOL blocking.

    Early indications are that Jetty-9.3 is indeed demonstrating a significant step forward in both low latency and high throughput.   This site has been running on EWYK Jetty-9.3 for some months.  We are confident that with this new execution strategy, Jetty will provide the most performant and scalable HTTP/2 implementation available in Java.

  • HTTP/2 Push Demo

    I have recently presented “HTTP/2 and Java: Current Status” at a few conferences (slides below).

    The HTTP/2 protocol has two big benefits over HTTP/1.1: Multiplexing and HTTP/2 Push.
    The first feature, Multiplexing, gives an edge to modern web sites that perform ~100 requests per page to one or more domains.
    With HTTP/1.1 these web sites had to open 6-8 connections to a domain, and then send only 6-8 requests at a time.
    With a network roundtrip of 150ms, it takes ~15 roundtrips to perform ~100 requests, or 15 * 150ms = 2250ms, in the best conditions and without taking into account the download time.
    With HTTP/2 Multiplexing, those ~100 requests can be sent all at once to the server, reducing the roundtrip time from 2250ms to ideally just 150ms.
    The second feature, HTTP/2 Push, allows the server to preemptively push to clients not only the primary resource that has been requested (typically an HTML page), but also secondary resources associated with it (typically CSS files, JavaScript files, image files, etc.).
    HTTP/2 Push complements Multiplexing by saving the roundtrips needed to fetch all resources required to render a page.
    The net result of these two features is a vastly improved web site performance which, following well known studies, can be directly related to more page views, and eventually to more revenue for your business.
    The Jetty Project first implemented these features in SPDY in 2012, and improved them in HTTP/2.
    We are now promoting this work to become part of the Servlet 4.0 specification.
    If you are interested in speeding up your web site (even if it is in PHP – Jetty can host any PHP site, including WordPress), contact us.
    The presentation I gave at conferences includes a demo that shows an example of HTTP/2 versus HTTP/1.1, available at GitHub.

  • Jetty 9 Quick Start

    The auto discovery features of the Servlet specification can make deployments slow and uncertain. Working in collaboration with Google AppEngine, the Jetty team has developed the Jetty quick start mechanism that allows rapid and consistent starting of a Servlet server.   Google AppEngine has long used Jetty for it’s footprint and flexibility, and now fast and predictable starting is a new compelling reason to use jetty in the cloud. As Google App Engine is specifically designed with highly variable, bursty workloads in mind – rapid start up time leads directly to cost-efficient scaling.

    jetty-quickstart

    Servlet Container Discovery Features

    The last few iterations of the Servlet 3.x specification have added a lot of features related to making development easier by auto discovery and configuration of Servlets, Filters and frameworks:

    Discover / From: Selected
    Container
    Jars
    WEB-INF/
    classes/
    *
    WEB-INF/
    lib/
    *.jar
    Annotated Servlets & Filters Y Y Y
    web.xml fragments Y Y
    ServletContainerInitializers Y Y
    Classes discoverd by HandlesTypes Y Y Y
    JSP Taglib descriptors Y Y

    Slow Discovery

    Auto discovery of Servlet configuration can be useful during the development of a webapp as it allows new features and frameworks to be enabled simply by dropping in a jar file.  However, for deployment, the need to scan the contents of many jars can have a significant impact of the start time of a webapp.  In the cloud, where server instances are often spun up on demand, having a slow start time can have a big impact on the resources needed.

    Consider a cluster under load that determines that an extra node is desirable, then every second the new node spends scanning its own classpath for configuration is a second that the entire cluster is overloaded and giving less than optimal performance.  To counter the inability to quickly bring on new instances, cluster administrators have to provision more idle capacity that can handle a burst in load while more capacity is brought on line. This extra idle capacity is carried for all time while the application is running and thus a slowly starting server increases costs over the whole application deployment.

    On average server hardware, a moderate webapp with 36 jars (using spring, jersey & jackson) took over 3,158ms to deploy on Jetty 9.    Now this is pretty fast and some Servlet containers (eg Glassfish) take more time to initialise their logging.  However  3s is still over a reasonable thresh hold for making a user wait whilst dynamically starting an instance.  Also many webapps now have over 100 jar dependencies, so scanning time can be a lot longer.

    Using the standard meta-data complete option does not significantly speed up start times, as TLDs and HandlersTypes classes still must be discovered even with meta data complete.    Unpacking the war file saves a little more time, but even with both of these, Jetty still takes 2,747ms to start.

    Unknown Deployment

    Another issue with discovery is that the exact configuration of the a webapp is not fully known until runtime. New Servlets and frameworks can be accidentally deployed if jar files are upgraded without their contents being fully investigated.   This give some deployers a large headache with regards to security audits and just general predictability.

    It is possible to disable some auto discovery mechanisms by using the meta-data-complete setting within web.xml, however that does not disable the scanning of HandlesTypes so it does not avoid the need to scan, nor avoid auto deployment  of accidentally included types and annotations.

    Jetty 9 Quickstart

    From release 9.2.0 of Jetty, we have included the quickstart module that allows a webapp to be pre-scanned and preconfigured.  This means that all the scanning is done prior to deployment and all configuration is encoded into an effective web.xml, called WEB-INF/quickstart-web.xml, which can be inspected to understand what will be deployed before deploying.

    Not only does the quickstart-web.xml contain all the discovered Servlets, Filters and Constraints, but it also encodes as context parameters all discovered:

    • ServletContainerInitializers
    • HandlesTypes classes
    • Taglib Descriptors

    With the quickstart mechanism, jetty is able to entirely bypass all scanning and discovery modes and start a webapp in a predictable and fast way.

    Using Quickstart

    To prepare a jetty instance for testing the quickstart mechanism is extremely simple using the jetty module system.  Firstly, if JETTY_HOME is pointing to a jetty distribution >= 9.2.0, then we can prepare a Jetty instance for a normal deployment of our benchmark webapp:

    > JETTY_HOME=/opt/jetty-9.2.0.v20140526
    > mkdir /tmp/demo
    > cd /tmp/demo
    > java -jar $JETTY_HOME/start.jar
      --add-to-startd=http,annotations,plus,jsp,deploy
    > cp /tmp/benchmark.war webapps/

    and we can see how long this takes to start normally:

    > java -jar $JETTY_HOME/start.jar
    2014-03-19 15:11:18.826:INFO::main: Logging initialized @255ms
    2014-03-19 15:11:19.020:INFO:oejs.Server:main: jetty-9.2.0.v20140526
    2014-03-19 15:11:19.036:INFO:oejdp.ScanningAppProvider:main: Deployment monitor [file:/tmp/demo/webapps/] at interval 1
    2014-03-19 15:11:21.708:INFO:oejsh.ContextHandler:main: Started o.e.j.w.WebAppContext@633a6671{/benchmark,file:/tmp/jetty-0.0.0.0-8080-benchmark.war-_benchmark-any-8166385366934676785.dir/webapp/,AVAILABLE}{/benchmark.war}
    2014-03-19 15:11:21.718:INFO:oejs.ServerConnector:main: Started ServerConnector@21fd3544{HTTP/1.1}{0.0.0.0:8080}
    2014-03-19 15:11:21.718:INFO:oejs.Server:main: Started @2579ms

    So the JVM started and loaded the core of jetty in 255ms, but another 2324 ms were needed to scan and start the webapp.

    To quick start this webapp, we need to enable the quickstart module and use an example context xml file to configure the benchmark webapp to use it:

    > java -jar $JETTY_HOME/start.jar --add-to-startd=quickstart
    > cp $JETTY_HOME/etc/example-quickstart.xml webapps/benchmark.xml
    > vi webapps/benchmark.xml

    The benchmark.xml file should be edited to point to the benchmark.war file:

    <?xml version="1.0"  encoding="ISO-8859-1"?>
    <!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure_9_0.dtd">
    <Configure class="org.eclipse.jetty.quickstart.QuickStartWebApp">
      <Set name="autoPreconfigure">true</Set>
      <Set name="contextPath">/benchmark</Set>
      <Set name="war"><Property name="jetty.webapps" default="."/>/benchmark.war</Set>
    </Configure>

    Now the next time the webapp is run, it will be preconfigured (taking a little bit longer than normal start):

    > java -jar $JETTY_HOME/start.jar
    2014-03-19 15:21:16.442:INFO::main: Logging initialized @237ms
    2014-03-19 15:21:16.624:INFO:oejs.Server:main: jetty-9.2.0.v20140526
    2014-03-19 15:21:16.642:INFO:oejdp.ScanningAppProvider:main: Deployment monitor [file:/tmp/demo/webapps/] at interval 1
    2014-03-19 15:21:16.688:INFO:oejq.QuickStartWebApp:main: Quickstart Extract file:/tmp/demo/webapps/benchmark.war to file:/tmp/demo/webapps/benchmark
    2014-03-19 15:21:16.733:INFO:oejq.QuickStartWebApp:main: Quickstart preconfigure: o.e.j.q.QuickStartWebApp@54318a7a{/benchmark,file:/tmp/demo/webapps/benchmark,null}(war=file:/tmp/demo/webapps/benchmark.war,dir=file:/tmp/demo/webapps/benchmark)
    2014-03-19 15:21:19.545:INFO:oejq.QuickStartWebApp:main: Quickstart generate /tmp/demo/webapps/benchmark/WEB-INF/quickstart-web.xml
    2014-03-19 15:21:19.879:INFO:oejsh.ContextHandler:main: Started o.e.j.q.QuickStartWebApp@54318a7a{/benchmark,file:/tmp/demo/webapps/benchmark,AVAILABLE}
    2014-03-19 15:21:19.893:INFO:oejs.ServerConnector:main: Started ServerConnector@63acac21{HTTP/1.1}{0.0.0.0:8080}
    2014-03-19 15:21:19.894:INFO:oejs.Server:main: Started @3698ms

    After preconfiguration, on all subsequent starts it will be quick started:

    > java -jar $JETTY_HOME/start.jar
    2014-03-19 15:21:26.069:INFO::main: Logging initialized @239ms
    2014-03-19 15:21:26.263:INFO:oejs.Server:main: jetty-9.2.0-SNAPSHOT
    2014-03-19 15:21:26.281:INFO:oejdp.ScanningAppProvider:main: Deployment monitor [file:/tmp/demo/webapps/] at interval 1
    2014-03-19 15:21:26.941:INFO:oejsh.ContextHandler:main: Started o.e.j.q.QuickStartWebApp@559d6246{/benchmark,file:/tmp/demo/webapps/benchmark/,AVAILABLE}{/benchmark/}
    2014-03-19 15:21:26.956:INFO:oejs.ServerConnector:main: Started ServerConnector@4a569e9b{HTTP/1.1}{0.0.0.0:8080}
    2014-03-19 15:21:26.956:INFO:oejs.Server:main: Started @1135ms

    So quickstart has reduced the start time from 3158ms to approx 1135ms!

    More over, the entire configuration of the webapp is visible in webapps/benchmark/WEB-INF/quickstart-web.xml.   This file can be examine and all the deployed elements can be easily audited.

    Starting Faster!

    Avoiding TLD scans

    The jetty 9.2 distribution switched to using the apache Jasper JSP implementation from the glassfish JSP engine. Unfortunately this JSP implementation will always scan for TLDs, which turns out takes a significant time during startup.    So we have modified the standard JSP initialisation to skip TLD parsing altogether if the JSPs have been precompiled.

    To let the JSP implementation know that all JSPs have been precompiled, a context attribute needs to be set in web.xml:

    <context-param>
      <param-name>org.eclipse.jetty.jsp.precompiled</param-name>
      <param-value>true</param-value>
    </context-param>

    This is done automagically if you use the Jetty Maven JSPC plugin. Now after the first run, the webapp starts in 797ms:

    > java -jar $JETTY_HOME/start.jar
    2014-03-19 15:30:26.052:INFO::main: Logging initialized @239ms
    2014-03-19 15:30:26.245:INFO:oejs.Server:main: jetty-9.2.0.v20140526
    2014-03-19 15:30:26.260:INFO:oejdp.ScanningAppProvider:main: Deployment monitor [file:/tmp/demo/webapps/] at interval 1
    2014-03-19 15:30:26.589:INFO:oejsh.ContextHandler:main: Started o.e.j.q.QuickStartWebApp@414fabe1{/benchmark,file:/tmp/demo/webapps/benchmark/,AVAILABLE}{/benchmark/}
    2014-03-19 15:30:26.601:INFO:oejs.ServerConnector:main: Started ServerConnector@50473913{HTTP/1.1}{0.0.0.0:8080}
    2014-03-19 15:30:26.601:INFO:oejs.Server:main: Started @797ms

    Bypassing start.jar

    The jetty start.jar  is a very powerful and flexible mechanism for constructing a classpath and executing a configuration encoded in jetty XML format.   However, this mechanism does take some time to build the classpath.    The start.jar mechanism can be bypassed by using the –dry-run option to generate and reuse a complete command line to start jetty:

    > RUN=$(java -jar $JETTY_HOME/start.jar --dry-run)
    > eval $RUN
    2014-03-19 15:53:21.252:INFO::main: Logging initialized @41ms
    2014-03-19 15:53:21.428:INFO:oejs.Server:main: jetty-9.2.0.v20140526
    2014-03-19 15:53:21.443:INFO:oejdp.ScanningAppProvider:main: Deployment monitor [file:/tmp/demo/webapps/] at interval 1
    2014-03-19 15:53:21.761:INFO:oejsh.ContextHandler:main: Started o.e.j.q.QuickStartWebApp@7a98dcb{/benchmark,file:/tmp/demo/webapps/benchmark/,AVAILABLE}{file:/tmp/demo/webapps/benchmark/}
    2014-03-19 15:53:21.775:INFO:oejs.ServerConnector:main: Started ServerConnector@66ca206e{HTTP/1.1}{0.0.0.0:8080}
    2014-03-19 15:53:21.776:INFO:oejs.Server:main: Started @582ms

    Classloading?

    With the quickstart mechanism, the start time of the jetty server is dominated by classloading, with over 50% of the CPU being profiled within URLClassloader, and the next hot spot is 4% in XML Parsing.   Thus a small gain could be made by pre-parsing the XML into byte code calls, but any more significant progress will probably need examination of the class loading mechanism itself.  We have experimented with combining all classes to a single jar or a classes directory, but with no further gains.

    Conclusion

    The Jetty-9 quick start mechanism provides almost an order of magnitude improvement is start time.   This allows fast and predictable deployment, making Jetty the ideal server to be used in dynamic clouds such as Google App Engine.

  • Jetty-9 Iterating Asynchronous Callbacks

    While Jetty has internally used asynchronous IO since 7.0, Servlet 3.1 has added asynchronous IO to the application API and Jetty-9.1 now supports asynchronous IO in an unbroken chain from application to socket. Asynchronous APIs can often look intuitively simple, but there are many important subtleties to asynchronous programming and this blog looks at one important pattern used within Jetty.  Specifically we look at how an iterating callback pattern is used to avoid deeps stacks and unnecessary thread dispatches.

    Asynchronous Callback

    Many programmers wrongly believe that asynchronous programming is about Futures. However Futures are a mostly broken abstraction and could best be described as a deferred blocking API rather than an Asynchronous API.    True asynchronous programming is about callbacks, where the asynchronous operation calls back the caller when the operation is complete.  A classic example of this is the NIO AsynchronousByteChannel write method:

    <A> void write(ByteBuffer src,
                   A attachment,
                   CompletionHandler<Integer,? super A> handler);
    public interface CompletionHandler<V,A>
    {
      void completed(V result, A attachment);
      void failed(Throwable exc, A attachment);
    }

    With an NIO asynchronous write, a CompletionHandler instance is pass that is called back once the write operation has completed or failed.   If the write channel is congested, then no calling thread is held or blocked whilst the operation waits for the congestion to clear and the callback will be invoked by a thread typically taken from a thread pool.

    The Servlet 3.1 Asynchronous IO API is syntactically very different, but semantically similar to NIO. Rather than have a callback when a write operation has completed the API has a WriteListener API that is called when a write operation can proceed without blocking:

    public interface WriteListener extends EventListener
    {
        public void onWritePossible() throws IOException;
        public void onError(final Throwable t);
    }

    Whilst this looks different to the NIO write CompletionHandler, effectively a write is possible only when the previous write operation has completed, so the callbacks occur on essentially the same semantic event.

    Callback Threading Issues

    So that asynchronous callback concept looks pretty simple!  How hard could it be to implement and use!   Let’s consider an example of asynchronously writing the data obtained from an InputStream.  The following WriteListener can achieve this:

    public class AsyncWriter implements WriteListener
    {
      private InputStream in;
      private ServletOutputStream out;
      private AsyncContext context;
      public AsyncWriter(AsyncContext context,
                         InputStream in,
                         ServletOutputStream out)
      {
        this.context=context;
        this.in=in;
        this.out=out;
      }
      public void onWritePossible() throws IOException
      {
        byte[] buf = new byte[4096];
        while(out.isReady())
        {
          int l=in.read(buf,0,buf.length);
          if (l<0)
          {
            context.complete();
            return;
          }
          out.write(buf,0,l);
        }
      }
      ...
    }

    Whenever a write is possible, this listener will read some data from the input and write it asynchronous to the output. Once all the input is written, the asynchronous Servlet context is signalled that the writing is complete.

    However there are several key threading issues with a WriteListener like this from both the caller and callee’s point of view.  Firstly this is not entirely non blocking, as the read from the input stream can block.  However if the input stream is from the local file system and the output stream is to a remote socket, then the probability and duration of the input blocking is much less than than of the output, so this is substantially non-blocking asynchronous code and thus is reasonable to include in an application.  What this means for asynchronous operations providers (like Jetty), is that you cannot trust any code you callback to not block and thus you cannot use an important thread (eg one iterating over selected keys from a Selector) to do the callback, else an application may inadvertently block other tasks from proceeding.  Thus Asynchronous IO Implementations thus must often dispatch a thread to perform a callback to application code.

    Because dispatching threads is expensive in both CPU and latency, Asynchronous IO implementations look for opportunities to optimise away thread dispatches to callbacks.  There Servlet 3.1 API has by design such an optimisation with the out.isReady() call that allows iteration of multiple operations within the one callback. A dispatch to onWritePossible only happens when it is required to avoid a blocking write and often many write iterations can proceed within a single callback. An NIO CompletionHandler based implementation of the same task is only able to perform one write operation per callback and must wait for the invocation of the complete handler for that operation before proceeding:

    public class AsyncWriter implements CompletionHandler<Integer,Void>
    {
      private InputStream in;
      private AsynchronousByteChannel out;
      private CompletionHandler<Void,Void> complete;
      private byte[] buf = new byte[4096];
      public AsyncWriter(InputStream in,
                         AsynchronousByteChannel out,
                         CompletionHandler<Void,Void> complete)
      {
        this.in=in;
        this.out=out;
        this.complete=complete;
        completed(0,null);
      }
      public void completed(Integer w,Void a) throws IOException
      {
        int l=in.read(buf,0,buf.length);
        if (l<0)
          complete.completed(null,null);
        else
          out.write(ByteBuffer.wrap(buf,0,l),this);
      }
      ...
    }

    Apart from an unrelated significant bug (left as an exercise for the reader to find), this version of the AsyncWriter has a significant threading challenge.  If the write can trivially completes without blocking, should the callback to CompletionHandler be dispatched to a new thread or should it just be called from the scope of the write using the caller thread?  If a new thread is always used, then many many dispatch delays will be incurred and throughput will be very low.  But if the callback is invoked from the scope of the write call, then if the callback does a re-entrant call to write, it may call a callback again which calls write again etc. etc. and a very deep stack will result and often a stack overflow can occur.

    The JVM’s implementation of NIO resolves this dilemma by doing both!  It performs the callback in the scope of the write call until it detects a deep stack, at which time it dispatches the callback to a new thread.    While this does work, I consider it a little bit of the worst of both worlds solution: you get deep stacks and you get dispatch latency.  Yet it is an accepted pattern and Jetty-8 uses this approach for callbacks via our ForkInvoker class.

    Jetty-9 IO Callbacks

    For Jetty-9, we wanted the best of all worlds.  We wanted to avoid deep re entrant stacks and to avoid dispatch delays.  In a similar way to Servlet 3.1 WriteListeners, we wanted to substitute iteration for reentrancy when ever possible.    Thus Jetty does not use NIO asynchronous IO channel APIs, but rather implements our own asynchronous IO pattern using the NIO Selector to implement our own EndPoint abstraction and a simple Callback interface:

    public interface EndPoint extends Closeable
    {
      ...
      void write(Callback callback, ByteBuffer... buffers)
        throws WritePendingException;
      ...
    }
    public interface Callback
    {
      public void succeeded();
      public void failed(Throwable x);
    }

    One key feature of this API is that it supports gather writes, so that there is less need for either iteration or re-entrancy when writing multiple buffers (eg headers, chunk and/or content).  But other than that it is semantically the same as the NIO CompletionHandler and if used incorrectly could also suffer from deep stacks and/or dispatch latency.

    Jetty Iterating Callback

    Jetty’s technique to avoid deep stacks and/or dispatch latency is to use the IteratingCallback class as the basis of callbacks for tasks that may take multiple IO operations:

    public abstract class IteratingCallback implements Callback
    {
      protected enum State
        { IDLE, SCHEDULED, ITERATING, SUCCEEDED, FAILED };
      private final AtomicReference<State> _state =
        new AtomicReference<>(State.IDLE);
      abstract protected void completed();  
      abstract protected State process() throws Exception;
      public void iterate()
      {
        while(_state.compareAndSet(State.IDLE,State.ITERATING))
        {
          State next = process();
          switch (next)
          {
            case SUCCEEDED:
              if (!_state.compareAndSet(State.ITERATING,State.SUCCEEDED))
                throw new IllegalStateException("state="+_state.get());
              completed();
              return;
            case SCHEDULED:
              if (_state.compareAndSet(State.ITERATING,State.SCHEDULED))
                return;
              continue;
            ...
          }
        }
        public void succeeded()
        {
          loop: while(true)
          {
            switch(_state.get())
            {
              case ITERATING:
                if (_state.compareAndSet(State.ITERATING,State.IDLE))
                  break loop;
                continue;
              case SCHEDULED:
                if (_state.compareAndSet(State.SCHEDULED,State.IDLE))
                  iterate();
                break loop;
              ...
            }
          }
        }

    IteratingCallback is itself an example of another pattern used extensively in Jetty-9:  it is a lock-free atomic state machine implemented with an AtomicReference to an Enum.  This pattern allows very fast and efficient lock free thread safe code to be written, which is exactly what asynchronous IO needs.

    The IteratingCallback class iterates on calling the abstract process() method until such time as it returns the SUCCEEDED state to indicate that all operations are complete.  If the process() method is not complete, it may return SCHEDULED to indicate that it has invoked an asynchronous operation (such as EndPoint.write(...)) and passed the IteratingCallback as the callback.

    Once scheduled, there are two possible outcomes for a successful operation. In the case that the operations completed trivially it will have called back succeeded() within the scope of the write, thus the state will have been switched from ITERATING to IDLE so that the while loop in iterate will fail to set the SCHEDULED state and continue to switch from IDLE to ITERATING, thus calling process() again iteratively.

    In the case that the schedule operation does not complete within the scope of process, then the iterate while loop will succeed in setting the SCHEDULED state and break the loop. When the IO infrastructure subsequently dispatches a thread to callback succeeded(), it will switch from SCHEDULED to IDLE state and itself call the iterate() method to continue to iterate on calling process().

    Iterating Callback Example

    A simplified example of using an IteratingCallback to implement the AsyncWriter example from above is given below:

    private class AsyncWriter extends IteratingCallback
    {
      private final Callback _callback;
      private final InputStream _in;
      private final EndPoint _endp;
      private final ByteBuffer _buffer;
      public AsyncWriter(InputStream in,EndPoint endp,Callback callback)
      {
        _callback=callback;
        _in=in;
        _endp=endp;
        _buffer = BufferUtil.allocate(4096);
      }
      protected State process() throws Exception
      {     
        int l=_in.read(_buffer.array(),
                       _buffer.arrayOffset(),
                       _buffer.capacity());
        if (l<0)
        {
           _callback.succeeded();
           return State.SUCCEEDED;
        }
        _buffer.position(0);
        _buffer.limit(len);
        _endp.write(this,_buffer);
        return State.SCHEDULED;
      }

    Several production quality examples of IteratingCallbacks can be seen in the Jetty HttpOutput class, including a real example of asynchronously writing data from an input stream.

    Conclusion

    Jetty-9 has had a lot of effort put into using efficient lock free patterns to implement a high performance scalable IO layer that can be seamlessly extended all the way into the servlet application via the Servlet 3.1 asynchronous IO.   Iterating callback and lock free state machines are just some of the advanced techniques Jetty is using to achieve excellent scalability results.

  • Jetty SPDY push improvements

    After having some discussions on spdy-dev and having some experience with our current push implementation, we’ve decided to change a few things to the better.
    Jetty now sends all push resources non interleaved to the client. That means that the push resources are being sent sequentially to the client one after the other.
    The ReferrerPushStrategy which automatically detects which resources need to be pushed for a specific main resource. See SPDY – we push! for details. Previously we’ve just send the push resources in random order back to the client. However with the change to sequentially send the resources, it’s best to keep the order that the first browser client requested those resources. So we changed the implementation of ReferrerPushStrategy accordingly.
    This all aims at improving the time needed for rendering the page in the browser by sending the data to the browser as the browser needs them.

  • Jetty SPDY to HTTP Proxy

    We have SPDY to SPDY and HTTP to SPDY proxy functionality implemented in Jetty for a while now.
    An important and very common use case however is a SPDY to HTTP proxy. Imagine a network architecture where network components like firewalls need to inspect application layer contents. If those network components are not SPDY aware and able to read the binary protocol you need to terminate SPDY before passing the traffic through those components. Same counts for other network components like loadbalancers, etc.
    Another common use case is that you might not be able to migrate your legacy application from an HTTP connector to SPDY. Maybe because you can’t use Jetty for your application or your application is not written in Java.
    Quite a while ago, we’ve implemented a SPDY to HTTP proxy functionality in Jetty. We just didn’t blog about it yet. Using that proxy it’s possible to gain all the SPDY benefits where they really count…on the slow internet with high latency, while terminating SPDY on the frontend and talking plain HTTP to your backend components.
    Here’s the documentation to setup a SPDY to HTTP proxy:
    http://www.eclipse.org/jetty/documentation/current/spdy-configuring-proxy.html#spdy-to-http-example-config

  • Asynchronous Rest with Jetty-9

    This blog is an update for jetty-9 of one published for Jetty 7 in 2008 as an example web application  that uses Jetty asynchronous HTTP client and the asynchronoous servlets 3.0 API, to call an eBay restful web service. The technique combines the Jetty asynchronous HTTP client with the Jetty servers ability to suspend servlet processing, so that threads are not held while waiting for rest responses. Thus threads can handle many more requests and web applications using this technique should obtain at least ten fold increases in performance.

    Screenshot from 2013-04-19 09:15:19

    The screen shot above shows four iframes calling either a synchronous or the asynchronous demonstration servlet, with the following results:

    Synchronous Call, Single Keyword
    A request to lookup ebay auctions with the keyword “kayak” is handled by the synchronous implementation. The call takes 261ms and the servlet thread is blocked for the entire time. A server with a 100 threads in a pool would be able to handle 383 requests per second.
    Asynchronous Call, Single Keyword
    A request to lookup ebay auctions with the keyword “kayak” is handled by the asynchronous implementation. The call takes 254ms, but the servlet request is suspended so the request thread is held for only 5ms. A server with a 100 threads in a pool would be able to handle 20,000 requests per second (if not constrained by other limitations)
    Synchronous Call, Three Keywords
    A request to lookup ebay auctions with keywords “mouse”, “beer” and “gnome” is handled by the synchronous implementation. Three calls are made to ebay in series, each taking approx 306ms, with a total time of 917ms and the servlet thread is blocked for the entire time. A server with a 100 threads in a pool would be able to handle only 109 requests per second!
    Asynchronous Call, Three Keywords
    A request to lookup ebay auctions with keywords “mouse”, “beer” and “gnome” is handled by the asynchronous implementation. The three calls can be made to ebay in parallel, each taking approx 300ms, with a total time of 453ms and the servlet request is suspended, so the request thread is held for only 7ms. A server with a 100 threads in a pool would be able to handle 14,000 requests per second (if not constrained by other limitations).

    It can be seen by these results that asynchronous handling of restful requests can dramatically improve both the page load time and the capacity by avoiding thread starvation.
    The code for the example asynchronous servlet is available from jetty-9 examples and works as follows:

    1. The servlet is passed the request, which is detected as the first dispatch, so the request is suspended and a list to accumulate results is added as a request attribute:
      // If no results, this must be the first dispatch, so send the REST request(s)
      if (results==null) {
          final Queue> resultsQueue = new ConcurrentLinkedQueue<>();
          request.setAttribute(RESULTS_ATTR, results=resultsQueue);
          final AsyncContext async = request.startAsync();
          async.setTimeout(30000);
          ...
    2. After suspending, the servlet creates and sends an asynchronous HTTP exchange for each keyword:
      for (final String item:keywords) {
        _client.newRequest(restURL(item)).method(HttpMethod.GET).send(
          new AsyncRestRequest() {
            @Override
            void onAuctionFound(Map<String,String> auction) {
              resultsQueue.add(auction);
            }
            @Override
            void onComplete() {
              if (outstanding.decrementAndGet()<=0)
                async.dispatch();
            }
          });
      }
    3. All the rest requests are handled in parallel by the eBay servers and when each of them completes, the call back on the exchange object is called. The code (shown above) extracts auction information in the base class from the JSON response and adds it to the results list in the onAuctionFound method.  In the onComplete method, the count of expected responses is then decremented and when it reaches 0, the suspended request is resumed by a call to dispatch.
    4. After being resumed (dispatched), the request is re-dispatched to the servlet. This time the request is not initial and has results, so the results are retrieved from the request attribute and normal servlet style code is used to generate a response:
      List> results = (List>) request.getAttribute(CLIENT_ATTR);
      response.setContentType("text/html");
      PrintWriter out = response.getWriter();
      out.println("");
      for (Map m : results){
        out.print("");
      ...
      out.println("");
      
    5. The example does lack some error and timeout handling.

    This example shows how the Jetty asynchronous client can easily be combined with the asynchronous servlets of Jetty-9 (or the Continuations of Jetty-7) to produce very scalable web applications.