Category: General

Back to the Future with Cross-Context Dispatch

Cross-Context Dispatch reintroduced to Jetty-12

With the release of Jetty 12.0.8, we’re excited to announce the (re)implementation of a somewhat maligned and deprecated feature: Cross-Context Dispatch. This feature, while having been part of the Servlet specification for many years, has seen varied levels of use and support. Its re-introduction in Jetty 12.0.8, however, marks a significant step forward in our commitment to supporting the diverse needs of our users, especially those with complex legacy and modern web applications.

Understanding Cross-Context Dispatch

Cross-Context Dispatch allows a web application to forward requests to or include responses from another web application within the same Jetty server. Although it has been available as part of the Servlet specification for an extended period, it was deemed optional with Servlet 6.0 of EE10, reflecting its status as a somewhat niche feature.

Initially, Jetty 12 moved away from supporting Cross-Context Dispatch, driven by a desire to simplify the server architecture amidst substantial changes, including support for multiple environments (EE8, EE9, and EE10). These updates mean Jetty can now deploy web applications using either the javax namespace (EE8) or the jakarta namespace (EE9 and EE10), all using the latest optimized jetty core implementations of HTTP: v1, v2 or v3.

Reintroducing Cross-Context Dispatch

The decision to reintegrate Cross-Context Dispatch in Jetty 12.0.8 was influenced significantly by the needs of our commercial clients, some who still leveraging this feature in their legacy applications. Our commitment to supporting our clients’ requirements, including the need to maintain and extend legacy systems, remains a top priority.

One of the standout features of the newly implemented Cross-Context Dispatch is its ability to bridge applications across different environments. This means a web application based on the javax namespace (EE8) can now dispatch requests to, or include responses from, a web application based on the jakarta namespace (EE9 or EE10). This functionality opens up new pathways for integrating legacy applications with newer, modern systems.

Looking Ahead

The reintroduction of Cross-Context Dispatch in Jetty 12.0.8 is more than just a nod to legacy systems; it can be used as a bridge to the future of Java web development. By allowing for seamless interactions between applications across different Servlet environments, Jetty-12 opens the possibility of incremental migration away from legacy web applications.

16/05/2024
If Virtual Threads are the solution, what is the problem?
Java’s Virtual Threads (aka Project Loom or JEP 444) have arrived as a full platform feature in Java 21, which has generated considerable interest and many projects (including Eclipse Jetty) are adding support.

I have previously been somewhat skeptical about how significant any advantages Virtual Threads actually have over Platform Threads (aka Native Threads). I’ve also pointed out that cheap Threads can do expensive things, so that using Virtual Threads may not be a universal panacea for concurrent programming.

However, even with those doubts, it is clear that Virtual Threads do have advantages in memory utilization and speed of startup. In this blog we look at what kinds of applications may benefit from those advantages.

In short we investigate what scalability problems are Virtual Threads the solution for.

Axioms

Firstly let’s agree on what is accepted about Virtual Thread usage:
- Writing asynchronous code is extraordinary difficult. “Yeah I know” you say… yeah but no, it is harder than that! Avoiding the need to write application logic in asynchronous style is key to improving the quality and stability of an application. This blog is not generally advocating you write your applications in an asynchronous style.
- Virtual Threads are very cheap to create. From a performance perspective there is no reason to pool already started Virtual Threads and such pools are considered an anti pattern. If a Virtual Thread is needed, then just create a new one.
- Virtual Threads use less memory. This is accepted, but with some significant caveats. Specifically the memory saving is achieved because Virtual Threads only allocate stack memory as needed, whilst Platform Threads provision stack size based on a worst case maximal usage. This is not exactly an apples vs oranges comparison.
If some are good, are more even better?

Consider a blocking style application is running on a traditional application server that is not scaling sufficiently. On inspection you see that all the Threads in the pool (default size 200) are allocated and that there are no Threads available to do more work!

Would making more Threads available be the solution to this scalability problem? Perhaps 2000 Platform Threads will help? Still slow? Let’s try 10,000 Platform Threads! Running out of memory? Then perhaps unlimited Virtual Threads will solve the scalability problems?

What if on further inspection it is found that the pool Threads are mostly blocked waiting for a JDBC Database connection from the JDBC Connection Pool (default size 8) and that as a result the Thread pool is exhausted.

If every request needs the database, then any additional Threads will all just block on the same JDBC pool, thus more Threads will not make a more Scalable solution.

Alternatively, if only some requests need to use the database, then having more Threads would allow request that do not need the database to proceed to completion. However, a fraction of requests would still end up blocked on the JBDC pool. Thus any limited Platform Thread pool could still become exhausted.

With unlimited Virtual Threads there is no effective limit on the number of Threads, so non database requests could always continue, but the queue of Threads waiting on JBDC would also be unlimited as would the total of any resources held by those Threads whilst waiting. Thus the application would only scale for some types of request, whilst giving JDBC dependent requests the same poor Quality of Service as before.

Finite Resources

If an application’s scalability is constrained by access to a finite resource, then it is unlikely that “more Threads” is the solution to any scalability problems. Just like you can’t solve traffic by adding cars to a congested road, adding Threads to an already busy server may make things worse.

Some common examples of finite resources that applications can encounter are:
- CPU: If the server CPU is near 100% utilization, then the existing number of Threads are sufficient to keep it fully loaded. More and/or faster CPUs are needed before any increase in Threads could be beneficial.
- Database: Many database technologies cannot handle many concurrent requests, so parallelism is restricted. If the bottleneck is the database, then it needs to be re-engineered rather than laid siege to by more concurrent Threads.
- Local Network: An application may block reading or writing data because it has reached the limit on the local network. In such cases, more Threads will not increase throughput, but they might improve latency if some threads can progress reading new requests and have responses ready to write once network becomes less congested. However there is a cost in waiting (see below).
- Locks: Parallel applications often use some form of lock or mutual exclusion to serialize access to common data structures. Contention on those locks can limit parallelism and require redesign rather than just more Threads.
- Caches: CPU, memory, file system and object caches are key tools in speeding up execution. However, if too many different tasks are executed concurrently, the capacity of these caches to hold relevant data may be exceeded and execution with a cold cache can be very slow. Sometimes it is better to do less things concurrently and serialize the excess so that caches can be more effective that trying to do everything at once.
If an application’s lack of scalability is due to Threads waiting for finite resources, then any additional Threads (Platform or Virtual) are unlikely to help and may make your application less stable. At best, careful redesign is needed before Thread counts can be increased in any advantageous way.

Infinite (OK Scalable) Resources

Not all resources are finite and some can be considered infinite, at least for some purposes. But let’s call them “Scalable” rather than infinite. Examples of scalable resources that an application may block on include:
- Database: Not all databases are created equal and some types of database have scalability in excess of the request rates experienced by a server. However, such scalability often comes at a latency cost as the database may be remote and/or distributed, thus applications may block waiting for the database, even if it has capacity to handle more requests in parallel.
- Micro services: A scalable database is really just a specific example of a micro service that may be provided by a remote and/or distributed system that has effectively infinite capacity at the cost of some latency. Applications can often find themselves waiting on one or more such services.
- Remote Networks: Local data center networks are often very VERY fast and in many situations they can outstrip even the combined capacity of many client systems. An application sending/receiving larger content may may block writing/reading them due to a slow client, but still have enough local network capacity to communicate with many other clients in parallel.
- Local Filesystems: Typically file systems are faster than networks, but slower than CPU. They also may have significant latency vs throughput tradeoffs (less so now that drives seldom need to spin physical disks). Thus Threads may block on local IO even though there is additional capacity available.
Applications that lack scalability due to Threads waiting for such scalable resources may benefit from more Threads. Whilst some Threads are waiting for the a database, micro service, slow client network or file system, it is likely that other Threads can progress even if they need to access the same types of resources.

Platform Threads pools can easily be increased to many 1000’s or more before typical servers will have memory issues. If scalability is needed beyond that, then Virtual Threads can offer practically unlimited additional Thread, but see the caveats below.

Furthermore, the fast starting Virtual Threads can be of significant benefit in situations where small jobs with long latency can be carried out in parallel. Consider an application that processes request using data from several micro services, each with some access latency. If these are done serially, then the total request latency is the summation of all. Sometimes asynchronous code is used to execute micro service request in parallel, but spinning up a couple of Virtual Threads in this situation is simpler, less error prone and applicable to more APIs.

Too Much of a Good Thing?

There is also some concern with low latency scalable resources that seldom block with Virtual Threads. Since Virtual Threads are not preempted, there can be starvation and/or fairness problems if they are not blocked by slow resources. This is probably a good problem to have, but will need some management on extreme scales for some applications.

The Cost of Waiting

We have identified that there are indeed scalable resources on which an application may wait with many Threads. However, there is no such thing as a free lunch and waiting Threads may have a significant cost, even if they are Virtual. Specifically how/where an application waits can greatly affect resource usage.

Consider a traditional application server with a limited Thread pool that is running near capacity, but with additional demand. While the 200 odd Threads are busy handling 200 concurrent request, there are additional request waiting to be handled. However, in an asynchronous server like Jetty, those additional requests can be cheaply parked and may be represented just be a single set bit in a selector or perhaps a tiny entry in a queue that holds only a reference to a connection that is ready to be read.

Now consider if requests were serviced by Virtual Threads instead of waiting for a pooled Platform Thread to become available. Pending requests would be allowed to proceed to some blocking point in the application. Waiting like this within the application can have additional expenses including:
- An input buffer will be allocated to read the request and any content it has.
- A read is performed into the input buffer, thus removing network back pressure so a client is enabled to send more request/data even if the server is unable to handle them.
- An object representation of the request will be built, containing at least the meta data and frequently some application data if there is an XML or JSON payload
- Sessions may be activated and brought into memory from caches or passivation stores.
- The allocated Thread runs deep inside the application code, potentially reaching near maximal stack depth.
- Application objects created on the heap are held in memory with references from the stack.
- An output buffer may be allocated, along with additional character conversion resources.
When request handling blocks within the application, all these additional resources may be allocated and held during that wait. Worse still, because of the lack of back pressure, a client may send more request/data resulting in more Threads and associated resources being allocated and also being held whilst the application waits for some resource.

Provisioning for the Worst Case

We have seen that there are indeed applications that may benefit from having additional Threads available to service requests. But we have also seen that such additional Threads may incur additional costs beyond just the stack size. Waiting/Blocking within an application will typically be done with a deep stack and other resources allocated. Whilst Virtual Threads might be effectively infinite, it is unlikely that these other required resources are equally scalable.

When an application experiences a worst case peak in load, then ultimately some resource will run out. To provide good Quality of Service, it is vital that such resource exhaustion is handled gracefully, allowing some request handling to continue rather than suffering catastrophic failure.

With traditional Platform Thread based pools, stack memory is already provisioned for worst case stacks for all Threads and the thread pool sized limit is also an indirect limit on the number of concurrent resources used. Threads have sufficient resources available to complete there handling whilst any excess requests suffer latency whilst waiting cheaply for an available Thread. Furthermore, the back pressure resulting from not reading all offered requests can prevent additional load from sent by the clients. Thread limits are imperfect resource limits, but at least they are some kind of limit that can provide some graceful degradation under load.

Alternatively, an application using Virtual Threads that has no explicit resource management will be likely to exhaust some of the resources used by those Threads. This can result in an OutOfMemoryException or similar, as the unlimited Virtual Threads each allocate deep stacks and other resources needed for request handling. The cost of average memory savings may be insufficient provisioning for the worst case resulting in catastrophic failure rather than graceful degradation. An analogy is that building more roads can actually make traffic worse if the added cars overwhelm other infrastructure.

Many applications are written without explicit resource limitations/management. Instead they rely on the imperfect Thread pool for at least some minimal protection. If that is removed, then some form of explicit resource limitation/management is likely to be needed in its place. Stable servers need to be provisioned for the worst case, not the average one.

Conclusion

There are applications that can scale better if more Threads are available, but it is not all applications (at least not without significant redesign). Consideration needs to be given to what will limit the worst case load for a server/application if it is not to be Threads. Specifically, the costs of waiting within the application may be such that scalability is likely to have a limit that will not be enforced by practically infinite Virtual Threads.

It may be that resources have limitations well within the capacity of large but limited Platform Thread pools, which are perfectly capable of scaling to many thousands of threads. So experiments with scaling a Platform Thread pool should first be used to see what limits do apply to an application.

If no upper limit is found before Platform Threads exhaust kernel memory, then Virtual Threads will allow scaling beyond that limit until some other limit is found. Thus the ultimate resource limit will need to be explicitly managed if catastrophic failure is to be avoided (but, to be fair, applications using Thread pools should also do some explicit resource limit management rather than rely just on the course limits of a Thread pool).

Recommendation

If Virtual Threads are not the general solution to scalability then what is? There is no one-size-fits-all solution, but I believe many applications that are limited by blocking on the network would benefit from being deployed in a server like Eclipse Jetty, that can do much of the handling for them asynchronously. Let Jetty read your requests asynchronously and prepare the content as parsed JSON, XML, or form data. Only then allocate a Thread (Virtual or Platform) with a large output buffer so the application can be written in blocking style, but will not block on either reading the request or writing the response. Finally, once the response is prepared, then let Jetty flush it to the network asynchronously. Jetty has always somewhat supported this model (e.g. by delaying dispatch to a Servlet until the first packet of data arrives), but with Jetty-12 we are adding more mechanisms to asynchronously prepare requests and flush responses, whilst leaving the application written in blocking style. More to come on this in future blogs!
18/10/2023
New Jetty 12 Maven Coordinates

Now that Jetty 12.0.1 is released to Maven Central, we’ve started to get a few questions about where some artifacts are, or when we intend to release them (as folks cannot find them).

Things have change with Jetty, starting with the 12.0.0 release.

First, is that our historical versioning of <servlet_support>.<major>.<minor> is no longer being used.

With Jetty 12, we are now using a more traditional <major>.<minor>.<patch> versioning scheme for the first time.

Also new in Jetty 12 is that the Servlet layer has been separated away from the Jetty Core layer.

The Servlet layer has been moved to the new Environments concept introduced with Jetty 12.

Environment Jakarta EE Servlet Jakarta Namespace Jetty GroupID
ee8 EE8 4 javax.servlet org.eclipse.jetty.ee8
ee9 EE9 5 jakarta.servlet org.eclipse.jetty.ee9
ee10 EE10 6 jakarta.servlet org.eclipse.jetty.ee10
Jetty Environments

This means the old Servlet specific artifacts have been moved to environment specific locations both in terms of Java namespace and also their Maven Coordinates.

Example:

Jetty 11 – Using Servlet 5
Maven Coord: org.eclipse.jetty:jetty-servlet
Java Class: org.eclipse.jetty.servlet.ServletContextHandler

Jetty 12 – Using Servlet 6
Maven Coord: org.eclipse.jetty.ee10:jetty-ee10-servlet
Java Class: org.eclipse.jetty.ee10.servlet.ServletContextHandler

We have a migration document which lists all of the migrated locations from Jetty 11 to Jetty 12.

This new versioning and environment features built into Jetty means that new major versions of Jetty are not as common as they have been in the past.

20/09/2023
Introducing Jetty-12
For the last 18 months, Webtide engineers have been working on the most extensive overhaul of the Eclipse Jetty HTTP server and Servlet container since its inception in 1995. The headline for the release of Jetty 12.0.0 could be “Support for the Servlet 6.0 API from Jakarta EE 10“, but the full story is of a root and branch overhaul and modernization of the project to set it up for yet more decades of service.

This blog is an introduction to the features of Jetty 12, many of which will be the subject of further deep-dive blogs.

Servlet API independent

In order to support the Servlet 6.0 API, we took the somewhat counter intuitive approach of making Jetty Servlet API independent. Specifically we have removed any dependency on the Servlet API from the core Jetty HTTP server and handler architecture. This is taking Jetty back to it’s roots as it was Servlet API independent for the first decade of the project.

The Servlet API independent approach has the following benefits:
- There is now a set of jetty-core modules that provide a high performance and scalable HTTP server. The jetty-core modules are usable directly when there is no need for the Servlet API and the overhead introduced by it’s features and legacy.
- For projects like Jetty, support must be maintained for multiple versions of the Servlet APIs. We are currently supporting branches for Servlet 3.1 in Jetty 9.4.x; Servlet 4.0 in Jetty 10.0.x; and Servlet 5.0 in Jetty 11.0.x. Adding a fourth branch to maintain would have been intolerable. With Jetty 12, our ongoing support for Servlet 4.0, 5.0 and 6.0 will be based on the same core HTTP server in the one branch.
- The Servlet APIs have many deprecated features that are no longer best practise. With Servlet 6.0, some of these were finally removed from the specification (e.g. Object Wrapper Identity). Removing these features from the Jetty core modules allows for better performance and cleaner implementations of the current APIs.
Multiple EE Environments

To support the Servlet APIs (and related Jakarta EE APIs) on top of the jetty-core, Jetty 12 uses an Environment abstraction that introduces another tier of class loading and configuration. Each Environment holds the applicable Jakarta EE APIs needed to provide Servlet support (but not the full suite of EE APIs).

Multiple environments can be run simultaneously on the same server and Jetty-12 supports:
- EE8 (Servlet 4.0) in the java.* namespace,
- EE9 (Servlet 5.0) in the jakarta.* namespace with deprecated features
- EE10 (Servlet 6.0) in the jakarta.* namespace without deprecated features.
- Core environments with no Servlet support or overhead.
The implementation of EE8 & EE9 environments are substantially from the current Jetty-10 and Jetty-11 releases, so that applications that are dependent on those can be deployed on Jetty-12 with minimal risk of changes in behaviour (i.e. they are somewhat “bug for bug compatible”). Even if there is no need to simultaneously run different environments, the upgrading of applications to current and future releases of the Jakarta EE specifications, will be simpler as it is decoupled from a major release of the server itself. For example, it is planned that EE 11 support (probably with Servlet 6.1) will be made available in a Jetty 12.1.0 release rather than in a major upgrade to a 13.0.0 release.
Core Environment
As mentioned above, the jetty-core modules are now available for direct support of HTTP without the need for the overhead and legacy of the Servlet API. As part of this effort many API’s have been updated and refined:
- The core Sessions are now directly usable
- A core Security model has been developed, that is used to implement the Servlet security model, but avoids some of the bizarre behaviours (I’m talking about you exposed methods!).
- The Jetty Websocket API has been updated and can be used over the top of the core Websocket APIs
- The Jetty HttpClient APIs have been updated.
Performance

Jetty 12 has achieved significant performance improvements. Our continuous performance tracking indicates that we have equal or better CPU utilisation for given load with lower latency and no long tail of quality of service.

Our tests currently offer 240,000 requests per second and then measure quality of service by latency (99th percentile and maximum). Below is the plot of latency for Jetty 11:

This shows that the orange 99th percentile latency is almost too small in the plot to see (at 24.1 µs average), and all you do see is the yellow plot of the maximal latency (max 1400 µs). Whilst these peaks look large, the scale is in micro seconds, so the longest maximal delay is just over 1.4 milliseconds and 99% of requests are handled in 0.024ms!

Below is the same plot of latency for Jetty 12 handling 240,000 requests per second:

The 99th percentile latency is now only 20.2 µs and the peaks are less frequent and rarely over 1 ms, with the maximum of 1100µs.

You can see the latest continuous performance testing of jetty-12 here.

New Asynchronous IO abstraction
In the jetty-core is a new asynchronous abstraction that is a significant evolution of the asynchronous approaches developed in Jetty over many previous releases.

But “Loom” I hear some say. Why be asynchronous if “Loom” will solve all your problems. Firstly, Loom is not a silver bullet, and we have seen no performance benefits of adopting Loom in the core of Jetty. If we were to adopt loom in the core we’d lose the significant benefits of our advanced execution strategy (which ensures that tasks have a good chance of being executed on a CPU core with a hot cache filled with the relevant data).

However, there are definitely applications that will benefit from the simple scaling offered by Loom’s virtual Threads, thus Jetty has taken the approach to stay asynchronous in the core, but to have optional support of Loom in our Execution strategy. Virtual threads may be used by the execution strategy, rather than submitting blocking jobs to a thread pool. This is a best of both worlds approach as it let’s us deal with the highly complex but efficient/scaleable asynchronous core, whilst letting applications be written in blocking style but can still scale.
But I hear other say: “why yet another async abstraction when there are already so many: reactive, Flow, NIO, servlet, etc”? Adopting a simple but powerful core async abstraction allows us to simply adapt to support many other abstractions: specifically Servlet asynchronous IO, Flow and blocking InputStream/OutputStream are trivial to implement. Other features of the abstraction are:
- Input side can be used iteratively, avoiding deep stacks and needless dispatches. Borrowed from Servlet API.
- Demand API simplified from Flow/Reactive
- Retainable ByteBuffers for zero copy handling
- Content abstraction to simply handle errors and trailers inline.
The asynchronous APIs are available to be used directly in jetty-core, or applications may simply wrap them in alternative asynchronous or blocking APIs, or simply use Servlets and never see them (but benefit from them).

Below is an example of using the new APIs to asynchronously read content from a Content.Source into a string:
```
public static class FutureString extends CompletableFuture<String> {
  private final CharsetStringBuilder text;
  private final Content.Source source;

  public FutureString(Content.Source source, Charset charset) {
    this.source = source;
    this.text = CharsetStringBuilder.forCharset(charset);
    source.demand(this::onContentAvailable);
  }

  private void onContentAvailable() {
    while (true) {
      Content.Chunk chunk = source.read();
      if (chunk == null) {
        source.demand(this::onContentAvailable);
        return;
      }

      try {
        if (Content.Chunk.isFailure(chunk))
          throw chunk.getFailure();

        if (chunk.hasRemaining())
          text.append(chunk.getByteBuffer());

        if (chunk.isLast() && complete(text.build()))
          return;
      } catch (Throwable e) {
        completeExceptionally(e);
      } finally {
        chunk.release();
      }
    }
  }
}
```
The asynchronous abstraction will be explained in detail in a later blog, but we will note about the code above here:
- there are no data copies into buffers (as if often needed with read(byte[]buffer)style APIs. The chunk may be a slice of a buffer that was read directly from the network and there are retain() and release()to allow references to be kept if need be.
- All data and meta flows via pull style calls to the Content.Source.read() method, including bytes of content, failures and EOF indication. Even HTTP trailers are sent as Chunks. This avoids the mutual exclusion that can be needed if there are onData and onError style callbacks.
- The read style is iterative, so there is no less need to break down code into multiple callback methods.
- The only callback is to the onContentAvailable method that is passed to Content.Source#demand(Runnable) and is called back when demand is met (i.e. read can be called with a non null return).
Handler, Request & Response design

The core building block of a Jetty Server are the Handler, Request and Response interfaces. These have been significantly revised in Jetty 12 to:
- Fully embrace and support the asynchronous abstraction. The previous Handler design predated asynchronous request handling and thus was not entirely suitable for purpose.
- The Request is now immutable, which solves many issues (see “Mutable Request” in Less is More Servlet API) and allows for efficiencies and simpler asynchronous implementations.
- Duplication has been removed from the API’s so that wrapping requests and responses is now simpler and less error prone. (e.g. There is no longer the need to wrap both a sendError and setStatus method to capture the response status).
Here is an example Handler that asynchronously echos all a request content back to the response, including any Trailers:
```
public boolean handle(Request request, Response response, Callback callback) {
    response.setStatus(200);
  long contentLength = -1;
  for (HttpField field : request.getHeaders()) {
    if (field.getHeader() != null) {
      switch (field.getHeader()) {
        case CONTENT_LENGTH -> {
          response.getHeaders().add(field);
          contentLength = field.getLongValue();
        }
        case CONTENT_TYPE -> response.getHeaders().add(field);
        case TRAILER -> response.setTrailersSupplier(HttpFields.build());
        case TRANSFER_ENCODING -> contentLength = Long.MAX_VALUE;
      } 
    } 
  } 
  if (contentLength > 0)
    Content.copy(request, response, Response.newTrailersChunkProcessor(response), callback);
  else
    callback.succeeded();
  return true;
}
```
Security

With sponsorship from the Eclipse Foundation and the Open Source Technology Improvement Fund, Webtide was able to engage Trail of Bits for a significant security collaboration. There have been 25 issues of various severity discovered, including several which have resulted in CVEs against the previous Jetty releases. The Jetty project has a good security record and this collaboration is proving a valuable way to continue that.

Big update & cleanup

Jetty is a 28 year old project. A bit of cruft and legacy has accumulated over that time, not to mention that many RFCs have been obsoleted (several times over) in that period.

The new architecture of Jetty 12, together with the name space break of jakarta.* and the removal of deprecated features in Servlet 6.0, has allowed for a big clean out of legacy implementations and updates to the latest RFCs.

Legacy support is still provided where possible, either by compliance modes selecting older implementations or just by using the EE8/EE9 Environments.

Conclusion

The Webtide team is really excited to bring Jetty 12 to the market. It is so much more than just a Servlet 6.0 container, offering a fabulous basis for web development for decades more to come.
18/05/2023
Less is More? Evolving the Servlet API!
With the release of the Servlet API 5.0 as part of Eclipse Jakarta EE 9.0 the standardization process has completed its move from the now-defunct Java Community Process (JCP) to being fully open source at the Eclipse Foundation, including the new Eclipse EE Specification Process (JESP) and the transition of the APIs from the javax.* to the jakarta.* namespace. The move represents a huge amount of work from many parties, but ultimately it was all meta work, in that Servlet 5.0 API is identical to the 4.0 API in all regards but name, licenses, and process, i.e. nothing functional has changed.

But now with the transition behind us, the Servlet API project is now free to develop the standard into a 5.1 or 6.0 release. So in this blog, I will put forward my ideas for how we should evolve the Servlet specification, specifically that I think that before we add new features to the API, it is time to remove some.

Backward Compatibility

Version 1.0 was created in 1997 and it is amazing that over 2 decades later, a Servlet written against that version should still run in the very latest EE container. So why with such a great backward compatible record should we even contemplate introducing breaking changes to future Servlet API specification? Let’s consider some of the reasons that a developer might choose to use EE Servlets over other available technologies:

Performance

Not all web applications need high performance and when they do, it is seldom the Servlet container itself that is the bottleneck. Yet pure performance remains a key selection criteria for containers as developers either wish to have the future possibility of high request rates or need every spare cycle available to help their application meet an acceptable quality of service. Also there is the environmental impact of the carbon foot print of unnecessary cycles wasted in the trillion upon trillions of HTTP requests executed. Thus application containers always compete on performance, but unfortunately many of the features added over the years have had detrimental affects to over-all performance as they often break the “No Taxation without Representation” principle: that there should not be a cost for all requests for a feature only used by <1%.

Features

Developers seek to have the current best practice features available in their container. This may be as simple as changing from byte[] to ByteBuffers or Collections, or it may be more fundamental integration of things such as dependency injection, coding by convention, asynchronous, reactive, etc. The specification has done a reasonable job supporting such features over the years, but mistakes have been made and some features now clash, causing ambiguity and complexity. Ultimately feature integration can be an N² problem, so reducing or simplifying existing features can greatly reduce the complexity of introducing new features.

Portability

The availability of multiple implementations of the Servlet specification is a key selling point. However the very same issues of poor integration of many features has resulted in too many dark corners of the specification where the expected behavior of a container is simply not defined, so portability is by no means guaranteed. Too often we find ourselves needing to be bug-for-bug compatible with other implementations rather than following the actual specification.

Familiarity

Any radical departure from the core Servlet API will force developers away from what they know and to evaluate alternatives. But there are many non core features in the API and this blog will make the case that there are some features which can can be removed and/or simplified without hardly being noticed by the bulk of applications. My aim with this blog is that your typical Servlet developer will think: “why is he making such a big fuss about something I didn’t know was there”, whilst your typical Servlet container implementer will think “Exactly! that feature is such a PITA!!!”.

If the Servlet API is to continue to be relevant, then it needs to be able to compete with start-of-the-art HTTP servers that do not support decades of EE legacy. Legacy can be both a strength and a weakness, and I believe now is the time to focus on the former. The namespace break from java.* to jakarta.* has already introduced a discontinuity in backward compatibility. Keeping 5.0 identical in all but name to 4.0 was the right thing to do to support automatic porting of applications. However, it has also given developers a reason to consider alternatives, so now is the time to act to ensure that Servlet 6.0 a good basis for the future of EE Servlets.

Getting Cross about Cross-Context Dispatch

Let’s just all agree upfront, without going into the details, that cross-context dispatch is a bad thing. For the purposes of the rest of this blog, I’m ignoring the many issues of cross-context dispatch. I’ll just say that every issue I will discuss below becomes even more complex when cross-context dispatch is considered, as it introduces: additional class loaders; different session values in the same session ID space; different authentication realms; authorization bypass. Don’t even get me started on the needless mind-bending complexities of a context that forwards to another then forwards back to the original…

Modern web applications are now often broken up into many microservices, so the concept of one webapp invoking another is not in itself bad, but the idea of those services being co-located in the same container instance is not very general nor flexible assumption. By all means, the Servlet API should support a mechanism to forward or include other resources, but ideally, this should be done in a way that works equally for co-resident, co-located, and remote resources.

So let’s just assume cross-context dispatch is already dead.

Exclude Include

The concept of including another resource in a response should be straight forward, but the specification of RequestDispatcher.include(...) is just bizarre!
```
@WebServlet(urlPatterns = {"/servletA/*"})
public static class ServletA extends HttpServlet
{
    @Override protected void doGet(HttpServletRequest request,
                                   HttpServletResponse response) throws IOException
    {
        request.getRequestDispatcher("/servletB/infoB").include(request, response);
    }
}
```
The ServletA above includes ServletB in its response. However, whilst within ServletB any calls to getServletPath() or getPathInfo(),will still return the original values used to call ServletA, rather than the “/servletB” or “/infoB” values for the target Servlet (as is done for a call to forward(...)). Instead the container must set an ever-growing list of Request attributes to describe the target of the include and any non trivial Servlet that acts on the actual URI path must do something like:
```
public boolean doGet(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException
{
    String servletPath;
    String pathInfo;
    if (request.getAttribute(RequestDispatcher.INCLUDE_REQUEST_URI) != null)
    {
        servletPath = (String)
            request.getAttribute(RequestDispatcher.INCLUDE_SERVLET_PATH);
        pathInfo = (String)
            request.getAttribute(RequestDispatcher.INCLUDE_PATH_INFO);
    }
    else
    {
        servletPath = request.getServletPath();
        pathInfo = request.getPathInfo();
    }
    String pathInContext = URIUtil.addPaths(servletPath, pathInfo);
    // ...
}
```
Most Servlets do not do this, so they are unable to be correctly be the target of an include. For the Servlets that do correctly check, they are more often than not wasting CPU cycles needlessly for the vast majority of requests that are not included.

Meanwhile, the container itself must set (and then reset) at least 5 attributes, just in case the target resource might lookup one of them. Furthermore, the container must disable most of the APIs on the response object during an include, to prevent the included resource from setting the headers. So the included Servlet must be trusted to know that it is being included in order to serve the correct resource, but is then not trusted to not call APIs that are inconsistent with that knowledge. Servlets should not need to know the details of how they were invoked in order to generate a response. They should just use the paths and parameters of the request passed to them to generate a response, regardless of how that response will be used.

Ultimately, there is no need for an include API given that the specification already has a reasonable forward mechanism that supports wrapping. The ability to include one resource in the response of another can be provided with a basic wrapper around the response:
```
@WebServlet(urlPatterns = {"/servletA/*"})
public static class ServletA extends HttpServlet
{
    @Override
    protected void doGet(HttpServletRequest request,
                         HttpServletResponse response) throws IOException
    {
        request.getRequestDispatcher("/servletB/infoB")
            .forward(request, new IncludeResponseWrapper(response));
    }
}
```
Such a response wrapper could also do useful things like ensuring the included content-type is correct and better dealing with error conditions rather than ignoring an attempt to send a 500 status. To assist with porting, the include can be deprecated it’s implementation replaced with a request wrapper that reinstates the deprecated request attributes:
```
@Deprecated
default void include(ServletRequest request, ServletResponse response)
    throws ServletException, IOException
{
    forward(new Servlet5IncludeAttributesRequestWrapper(request),
            new IncludeResponseWrapper(response));
}
```
Dispatch the DispatcherType

The inclusion of the method Request.getDispatcherType()in the Servlet API is almost an admission of defeat that the specification got it wrong in so many ways that required a Servlet to know how and/or why it is being invoked in order to function correctly. Why must a Servlet know its DispatcherType? Probably so it knows it has to check the attributes for the corresponding values? But what if an error page is generated asynchronously by including a resource that forwards to another? In such a pathological case, the request will contain attributes for ERROR, ASYNC, and FORWARD, yet the type will just be FORWARD.

The concept of DispatcherType should be deprecated and it should always return REQUEST. Backward compatibility can be supported by optionally applying a wrapper that determines the deprecated DispatcherType only if the method is called.

Unravelling Wrappers

A key feature that really needs to be revised is 6.2.2 Wrapping Requests and Responses, introduced in Servlet 2.3. The core concept of wrappers is sound, but the requirement of Wrapper Object Identity (see Object Identity Crisis below) has significant impacts. But first let’s look at a simple example of a request wrapper:
```
public static class ForcedUserRequest extends HttpServletRequestWrapper
{
    private final Principal forcedUser;
    public ForcedUserRequest(HttpServletRequest request, Principal forcedUser)
    {
        super(request);
        this.forcedUser = forcedUser;
    }
    @Override
    public Principal getUserPrincipal()
    {
        return forcedUser;
    }
    @Override
    public boolean isUserInRole(String role)
    {
        return forcedUser.getName().equals(role);
    }
}
```
This request wrapper overrides the existing getUserPrincipal() and isUserInRole(String)methods to forced user identity. This wrapper can be applied in a filter or in a Servlet as follows:
```
@WebServlet(urlPatterns = {"/servletA/*"})
public static class ServletA extends HttpServlet
{
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
        throws ServletException, IOException
    {
        request.getServletContext()
            .getRequestDispatcher("/servletB" + req.getPathInfo())
            .forward(new ForcedUserRequest(req, new UserPrincipal("admin")),
                     response);
    }
}
```
Such wrapping is an established pattern in many APIs and is mostly without significant problems. For Servlets there are some issues: it should be better documented if the wrapped user identity is propagated if ServletB makes any EE calls (I think no?); some APIs have become too complex to sensibly wrap (e.g HttpInputStream with non-blocking IO). But even with these issues, there are good safe usages for this wrapping to override existing methods.

Object Identity Crisis!

The Servlet specification allows for wrappers to do more than just override existing methods! In 6.2.2, the specification says that:

“… the developer not only has the ability to override existing methods on the request and response objects, but to provide new API… “

So the example above could introduce new API to access the original user principal:
```
public static class ForcedUserRequest extends HttpServletRequestWrapper
{
    // ... getUserPrincipal & isUserInRole as above
    public Principal getOriginalUserPrincipal()
    {
        return super.getUserPrincipal();
    }
    public boolean isOriginalUserInRole(String role)
    {
        return super.isUserInRole(role);
    }
}
```
In order for targets to be able to use these new APIs then they must be able to downcast the passed request/response to the known wrapper type:
```
@WebServlet(urlPatterns = {"/servletB/*"})
public static class ServletB extends HttpServlet
{
    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp)
        throws ServletException, IOException
    {
        MyWrappedRequest myr = (MyWrappedRequest)req;
        resp.getWriter().printf("user=%s orig=%s wasAdmin=%b%n",
            req.getUserPrincipal(),
            myr.getOriginalUserPrincipal(),
            myr.isOriginalUserInRole("admin"));
    }
}
```
This downcast will only work if the wrapped object is passed through the container without any further wrapping, thus the specification requires “wrapper object identity”:

… the container must ensure that the request and response object that it passes to the next entity in the filter chain, or to the target web resource if the filter was the last in the chain, is the same object that was passed into the doFilter method by the calling filter. The same requirement of wrapper object identity applies to the calls from a Servlet or a filter to RequestDispatcher.forward or RequestDispatcher.include, when the caller wraps the request or response objects.

This “wrapper object identity” requirement means that the container is unable to itself wrap requests and responses as they are passed to filters and servlets. This restriction has, directly and indirectly, a huge impact on the complexity, efficiency, and correctness of Servlet container implementations, all for very dubious and redundant benefits:
Bad Software Components

In the example of ServletB above, it is a very bad software component as it cannot be invoked simply by respecting the signature of its methods. The caller must have a priori knowledge that the passed request will be downcast and any other caller will be met with a ClassCastException. This defeats the whole point of an API specification like Servlets, which is to define good software components that can be variously assembled according to their API contracts.

No Multiple Concerns

It is not possible for multiple concerns to wrap request/responses. If another filter applies its own wrappers, then the downcast will fail. The requirement for “wrapper object identity” requires the application developer to have total control over all aspects of the application, which can be difficult with discovered web fragments and ServletContainerInitializers.

Mutable Requests
By far the biggest impact of “wrapper object identity” is that it forces requests to be mutable! Since the container is not allowed to do its own wrapping within RequestDispatcher.forward(...) then the container must make the original request object mutable so that it changes the value returned from getServletPath() to reflect the target of the dispatch. It is this impact that has significant impacts on complexity, efficiency, and correctness:

Mutating the underlying request makes the example implementation of isOriginalUserInRole(String) incorrect because it calls super.isUserInRole(String) whose result can be mutated if the target Servlet has a run-as configuration. Thus this method will inadvertently return the target rather than the original role.

There is the occasional need for a target Servlet to know details of the original request (often for debugging), but the original request can mutate so it cannot be used. Instead, an ever-growing list of Request attributes that must be set and then cleared on the original request attributes, just in case of the small chance that the target will need one of them. A trivial forward of a request can thus require at least 12 Map operations just to make available the original state, even though it is very seldom required. Also, some aspects of the event history of a request are not recoverable from the attributes: the isUserInRolemethod; the original target of an include that does another include.

Mutable requests cannot be safely passed to asynchronous processes, because there will be a race between the other thread call to a request method and any mutations required as the request propagates through the Servlet container (see the “Off to the Races” example below). As a result, asynchronous applications SHOULD copy all the values from the request that they MIGHT later need…. or more often than not they don’t, and many work by good luck, but may fail if timing on the server changes.

Using immutable objects can have significant benefits by allowing the JVM optimizer and GC to have knowledge that field values will not change. By forcing the containers to use mutable request implementations, the specification removes the opportunity to access these benefits. Worse still, the complexity of the resulting request object makes them rather heavy weight and thus they are often recycled in object pools to save on the cost of creation. Such pooled objects used in asynchronous environments can be a recipe for disaster as asynchronous processes may reference a request object after it has been recycled into another request.
Unnecessary

New APIs can be passed on objects set as request attribute values that will pass through multiple other wrappers, coexist with other new APIs in attributes and do not require the core request methods to have mutable returns.
The “wrapper object identity” requirement has little utility yet significant impacts on the correctness and performance of implementations. It significantly impairs the implementation of the container for a feature that can be rendered unusable by a wrapper applied by another filter. It should be removed from Servlet 6.0 and requests passed in by the container should be immutable.

Asynchronous Life Cycle

A bit of history

Jetty continuations were a non-standard feature introduced in Jetty-6 (around 2005) to support thread-less waiting for asynchronous events (e.g. typically another HTTP request in a chat room). Because the Servlet API had not been designed for thread-safe access from asynchronous processes, the continuations feature did not attempt to let arbitrary threads call the Servlet API. Instead, it has a suspend/resume model that once the asynchronous wait was over, the request was re-dispatched back into the Servlet container to generate a response, using the normal blocking Servlet API from a well-defined context.

When the continuation feature was standardized in the Servlet 3.0 specification, the Jetty suspend/resume model was supported with the APIs ServletRequest.startAsync() and AsyncContext.dispatch() methods. However (against our strongly given advice), a second asynchronous model was also enabled, as represented by ServletRequest.startAsync() followed by AsyncContext.complete(). With the start/complete model, instead of generating a response by dispatching a container-managed thread, serialized on the request, to the Servlet container, arbitrary asynchronous threads could generate the response by directly accessing the request/response objects and then call the AsyncContext.complete() method when the response had been fully generated to end the cycle. The result is that the entire API, designed not to be thread safe, was now exposed to concurrent calls. Unfortunately there was (and is) very little in the specification to help resolve the many races and ambiguities that resulted.

Off to the Races

The primary race introduced by start/complete is that described above caused by mutable requests that are forced by “wrapper object identity”. Consider the following asynchronous Servlet:
```
@WebServlet(urlPatterns = {"/async/*"}, asyncSupported = true)
@RunAs("special")
public static class AsyncServlet extends HttpServlet
{
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
        throws ServletException, IOException
    {
        AsyncContext async = request.startAsync();
        PrintWriter out = response.getWriter();
        async.start( () ->
        {
            response.setStatus(HttpServletResponse.SC_OK);
            out.printf("path=%s special=%b%n",
                       request.getServletPath(),
                       request.isUserInRole("special"));
            async.complete();
        });
    }
}
```
If invoked via a RequestDispatcher.forward(...), then the result produced by this Servlet is a race: will the thread dispatched to execute the lambda execute before or after the thread returns from the `doGet` method (and any applied filters) and the pre-forward values for the path and role are restored? Not only could the path and role be reported either for the target or caller, but the race could even split them so they are reported inconsistently. To avoid this race, asynchronous Servlets must copy any value that they may use from the request before starting the asynchronous thread, which is needless complexity and expense. Many Servlets do not actually do this and just rely on happenstance to work correctly.

This problem is the result of the start/complete lifecycle of asynchronous Servlets permitting/encouraging arbitrary threads to call the existing APIs that were not designed to be thread-safe. This issue is avoided if the request object passed to doGet is immutable and if it is the target of a forward, it will always act as that target. However, there are other issues of the asynchronous lifecycle that cannot be resolved just with immutability.

Out of Time

The example below is a very typical race that exists in many applications between a timeout and asynchronous processing:
```
@Override
protected void doGet(HttpServletRequest request,
                     HttpServletResponse response) throws IOException
{
    AsyncContext async = request.startAsync();
    PrintWriter out = response.getWriter();
    async.addListener(new AsyncListener()
    {
        @Override
        public void onTimeout(AsyncEvent asyncEvent) throws IOException
        {
            response.setStatus(HttpServletResponse.SC_BAD_GATEWAY);
            out.printf("Request %s timed out!%n", request.getServletPath());
            out.printf("timeout=%dms%n ", async.getTimeout());
            async.complete();
        }
    });
    CompletableFuture<String> logic = someBusinessLogic();
    logic.thenAccept(answer ->
    {
        response.setStatus(HttpServletResponse.SC_OK);
        out.printf("Request %s handled OK%n", request.getServletPath());
        out.printf("The answer is %s%n", answer);
        async.complete();
    });
}
```
Because the handling of the result of the business logic may be executed by a non-container-managed thread, it may run concurrently with the timeout callback. The result can be an incorrect status code and/or the response content being interleaved. Even if both lambdas grab a lock to mutually exclude each other, the results are sub-optimal, as both will eventually execute and one will ultimately throw an IllegalStateException, causing extra processing and a spurious exception that may confuse developers/deployers.

The current specification of the asynchronous life cycle is the worst of both worlds for the implementation of the container. On one hand, they must implement the complexity of request-serialized events, so that for a given request there can only be a single container-managed thread in service(...), doFilter(...), onWritePossible(), onDataAvailable(), onAllDataRead()and onError(), yet on the other hand an arbitrary application thread is permitted to concurrently call the API, thus requiring additional thread-safety complexity. All the benefits of request-serialized threads are lost by the ability of arbitrary other threads to call the Servlet APIs.

Request Serialized Threads

The fix is twofold: firstly make more Servlet APIs immutable (as discussed above) so they are safe to call from other threads; secondly and most importantly, any API that does mutate state should only be able to be called from request-serialized threads! The latter might seem a bit draconian as it will make the lambda passed to thenAccept in the example above throw an IllegalStateException when it tries to setStatus(int) or call complete(), however, there are huge benefits in complexity and correctness and only some simple changes are needed to rework existing code.

Any code running within a call to service(...), doFilter(...), onWritePossible(), onDataAvailable(), onAllDataRead()and onError() will already be in a request-serialized thread, and thus will require no change. It is only code executed by threads managed by other asynchronous components (e.g. the lambda passed to thenAccept() above) that need to be scoped. There is already the method AsyncContext.start(Runnable) that allows a non-container thread to access the context (i.e. classloader) associated with the request. An additional similar method AsyncContext.dispatch(Runnable) can be provided that not only scopes the execution but mutually excludes it and serializes it against any call to the methods listed above and any other dispatched Runnable. The Runnables passed may be executed within the scope of the dispatch call if possible (making the thread momentarily managed by the container and request serialized) or scheduled for later execution. Thus calls to mutate the state of a request can only be made from threads that are serialized.

To make accessing the dispatch(Runnable) method more convenient, an executor can be provided with AsyncContext.getExecutor() which provides the same semantic. The example above can now be simply updated:
```
@Override
protected void doGet(HttpServletRequest request,
                     HttpServletResponse response) throws IOException
{
    AsyncContext async = request.startAsync();
    PrintWriter out = response.getWriter();
    async.addListener(new AsyncListener()
    {
        @Override
        public void onTimeout(AsyncEvent asyncEvent) throws IOException
        {
            response.setStatus(HttpServletResponse.SC_BAD_GATEWAY);
            out.printf("Request timed out after %dms%n ", async.getTimeout());
            async.complete();
        }
    });
    CompletableFuture<String> logic = someBusinessLogic();
    logic.thenAcceptAsync(answer ->
    {
        response.setStatus(HttpServletResponse.SC_OK);
        out.printf("The answer is %s%n", answer);
        async.complete();
    }, async.getExecutor());
}
```
Because the AsyncContext.getExecutor() is used to invoke the business logic consumer, then the timeout and business logic response methods are mutually excluded. Moreover, because they are serialized by the container, the request state can be checked between each, so that if the business logic has completed the request, then the timeout callback will never be called, even if the underlying timer expires while the response is being generated. Conversely, if the business logic result is generated after the timeout, then the lambda to generate the response will never be called. Because both of the tasks in this example call complete, then only one of them will ever be executed.

And Now You’re Complete

In the example below, a non-blocking read listener has been set on the request input stream, thus a callback to onDataAvailable() has been scheduled to occur at some time in the future. In parallel, an asynchronous business process has been initiated that will complete the response:
```
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws IOException
{
    AsyncContext async = request.startAsync();
    request.getInputStream().setReadListener(new MyReadListener());
    CompletableFuture<String> logicB = someBusinessLogicB();
    PrintWriter out = response.getWriter();
    logicB.thenAcceptAsync(b ->
    {
        out.printf("The answer for %s is B=%s%n", request.getServletPath(), b);
        async.complete();
    }, async.getExecutor());
}
```
The example uses the proposed APIs above so that any call to complete is mutually excluded and serialized with the call to doGet and onDataAvailable(...). Even so, the current spec is unclear if the complete should prevent any future callback to onDataAvailable(...) or if the effect of complete() should be delayed until the callback is made (or times out). Given that the actions can now be request-serialized, the spec should require that once a request serialized thread that has called complete returns, then the request cycle is complete and there will be no other callbacks other than onComplete(...), thus cancelling any non-blocking IO callbacks.

To Be Removed

Before extending the Servlet specification, I believe the following existing features should be removed or deprecated:
- Cross context dispatch deprecated and existing methods return null. Once a request is matched to a context, then it will only ever be associated with that context and the getServletContext() method will return the same value no matter what state the request is in.
- The “Wrapper Object Identity” requirement is removed and the request object will be required to be immutable in regards to the methods affected by a dispatch and may be referenced by asynchronous threads.
- The RequestDispatcher.include(...) is deprecated and replaced with utility response wrappers. The existing API can be deprecated and its implementation changed to use a request wrapper to simulate the existing attributes.
- The special attributes for FORWARD, INCLUDE, ASYNC are removed from the normal dispatches. Utility wrappers will be provided that can simulate these attributes if needed for backward compatibility.
- The getDispatcherType() method is deprecated and returns REQUEST, unless a utility wrapper is used to replicate the old behavior.
- Servlet API methods that mutate state will only be callable from request-serialized container-managed threads and will otherwise throw IllegalStateException. New AsyncContext.dispatch(Runnable) and AsyncContext.getExecutor() methods will provide access to request-serialization for arbitrary threads/lambdas/Runnables
With these changes, I believe that many web applications will not be affected and most of the remainder could be updated with minimal effort. Furthermore, utility filters can be provided that apply wrappers to obtain almost all deprecated behaviors other than Wrapper Object Identity. In return for the slight break in backward compatibility, the benefit of these changes would be significant simplifications and efficiencies of the Servlet container implementations. I believe that only with such simplifications can we have a stable base on which to build new features into the Servlet specification. If we can’t take out the cruft now, then when?

The plan is to follow this blog up with another proposing some more rationalisation of features (I’m looking at you sessions and authentication), before another blog proposing some new features an future directions.
13/04/2021
A story about Unix, Unicode, Java, filesystems, internationalization and normalization
Recently, I’ve been investigating some test failures that I only experienced on my own machine, which happens to run some flavor of Linux. Investigating those failures, I ran down a rabbit hole that involves Unix, Unicode, Java, filesystems, internationalization and normalization. Here is the story of what I found down at the very bottom.

A story about Unix internationalization

One test that was failing is testAccessUniCodeFile, with the following exception:
```
java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: swedish-å.txt
	at java.base/sun.nio.fs.UnixPath.encode(UnixPath.java:145)
	at java.base/sun.nio.fs.UnixPath.(UnixPath.java:69)
	at java.base/sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:279)
	at java.base/java.nio.file.Path.resolve(Path.java:515)
	at org.eclipse.jetty.util.resource.FileSystemResourceTest.testAccessUniCodeFile(FileSystemResourceTest.java:335)
	...
```
This test asserts that Jetty can read files with non-ASCII characters in their names. But the failure happens in Path.resolve, when trying to create the file, before any Jetty code is executed. But why?
When accessing a file, the JVM has to deal with Unix system calls. The Unix system call typically used to create a new file or open an existing one is int open(const char *path, int oflag, …); which accepts the file name as its first argument.
In this test, the file name is "swedish-å.txt" which is a Java String. But that String isn’t necessarily encoded in memory in a way that the Unix system call expects. After all, a Java String is not the same as a C const char * so some conversion needs to happen before the C function can be called.
We know the Java String is represented by a UTF-8-encoded byte[] internally. But how is the C const char * actually is supposed to be represented? Well, that depends. The Unix spec specifies that internationalization depends on environment variables. So the encoding of the C const char * depends on the LANG, LC_CTYPE and LC_ALL environment variables and the JVM has to transform the Java String to a format determined by these environment variables.
Let’s have a look at those in a terminal:
```
$ echo "LANG=\"$LANG\" LC_CTYPE=\"$LC_CTYPE\" LC_ALL=\"$LC_ALL\""
LANG="C" LC_CTYPE="" LC_ALL=""
$
```
C is interpreted as a synonym of ANSI_X3.4-1968 by the JVM which itself is a synonym of US-ASCII.
I’ve explicitly set this variable in my environment as some commands use it for internationalization and I appreciate that all the command line tools I use strictly stick to English. For instance:
```
$ sudo LANG=C apt-get remove calc
...
Do you want to continue? [Y/n] n
Abort.
$ sudo LANG=fr_BE.UTF-8 apt-get remove calc
Do you want to continue? [O/n] n
Abort.
$
```
Notice the prompt to the question Do you want to continue? that is either [Y/n] (C locale) or [O/n] (Belgian-French locale) depending on the contents of this variable. Up until now, I didn’t know that it also impacted what files the JVM could create or open!
Knowing that, it is now obvious why the file cannot be created: it is not possible to convert the "swedish-å.txt" Java String to an ASCII C const char * simply because there is no way to represent the å character in ASCII.
Changing the LANG environment variable to en_US.UTF-8 allowed the JVM to successfully make that Java-to-C string conversion which allowed that test to pass.
Our build has now been changed to force the LC_ALL environment variable (as it is the one that overrides the other ones) to en_US.UTF-8 before running our tests to make sure this test passes even on environments with non-unicode locales.

A story about filesystem Unicode normalization

There was an extra pair of failing tests, that reported the following error:
```
java.lang.AssertionError:
Expected: is <404>
     but: was <200>
Expected :is <404>
Actual   :<200>
```
For the context, those tests are about creating a file with a non-ASCII name encoded in some way and trying to serve it over HTTP with a request to the same non-ASCII name encoded in a different way. This is needed because Unicode supports different forms of encoding, notably Normalization Form Canonical Composition (NFC) and Normalization Form Canonical Decomposition (NFD). For our example string “swedish-å.txt”, this means there are two ways to encode the letter “å”: either U+00e5 LATIN SMALL LETTER A WITH RING ABOVE (NFC) or U+0061 LATIN SMALL LETTER A followed by U+030a COMBINING RING ABOVE (NFD).
Both are canonically equivalent, meaning that a unicode string with the letter “å” encoded either as NFC or NFD should be considered the same. Is that true in practice?
The failing tests are about creating a file whose name is NFC-encoded then trying to serve it over HTTP with the file name encoded in the URL as NFD and vice-versa.
When running those tests on MacOS on APFS, the encoding never matters and MacOS will find the file with a NFC-encoded filename when you try to open it with a NFD-encoded canonically equivalent filename and vice-versa.
When running those tests on Linux on ext4 or Windows on NTFS, the encoding always matters and Linux/Windows will not find the file with a NFC-encoded filename when you try to open it with a NFD-encoded canonically equivalent filename and vice-versa.
And this is exactly what the tests expect:
```
if (OS.MAC.isCurrentOs())
  assertThat(response.getStatus(), is(HttpStatus.OK_200));
else
  assertThat(response.getStatus(), is(HttpStatus.NOT_FOUND_404));
```
What I discovered is that when running those tests on Linux on ZFS, the encoding sometimes matters and Linux may find the file with a NFC-encoded filename when you try to open it with a NFD-encoded canonically equivalent filename and vice-versa, depending upon the ZFS normalization property; quoting the manual:
```
normalization = none | formC | formD | formKC | formKD
    Indicates whether the file system should perform a unicode normalization of file names whenever two file names are compared, and which normalization algorithm should be used. File names are always stored unmodified, names are normalized as part of any comparison process. If this property is set to a legal value other than none, and the utf8only property was left unspecified, the utf8only property is automatically set to on. The default value of the normalization property is none. This property cannot be changed after the file system is created.
```
So if we check the normalization of the filesystem upon which the test is executed:
```
$ zfs get normalization /
NAME                      PROPERTY       VALUE          SOURCE
rpool/ROOT/nabo5t         normalization  formD          -
$
```
we can understand why the tests fail: due to the normalization done by ZFS, Linux can open the file given canonically equivalent filenames, so the test mistakenly assumes that Linux cannot serve this file. But if we create a new filesystem with no normalization property:
```
$ zfs get normalization /unnormalized/test/directory
NAME                      PROPERTY       VALUE          SOURCE
rpool/unnormalized        normalization  none           -
$
```
and run a copy of the tests from it, the tests succeed.
So we’ve adapted both tests to make them detect if the filesystem supports canonical equivalence and basing the assertion on that detection instead of hardcoding which OS behaves in which way.
22/01/2021
Community Projects & Contributors Take on Jakarta EE 9
With the recent release of JakartaEE9, the future for Java has never been brighter. In addition to headline projects moving forward into the new jakarta.* namespace, there has been a tremendous amount of work done throughout the community to stay at the forefront of the changing landscape. These efforts are the summation of hundreds of hours by just as many developers and highlight the vibrant ecosystem in the Jakarta workspace.
The Jakarta EE contributors and committers came together to shape the 9 release. They chose to create a reality that benefits the entire Jakarta EE ecosystem. Sometimes, we tend to underestimate our influence and the power of our actions. Now that open source is the path of Jakarta EE, you, me, all of us can control the outcome of this technology.
Such examples that are worthy of emulation include the following efforts. In their own words:

Eclipse Jetty – The Jetty project recently released Jetty 11, which has worked towards full compatibility with JakartaEE9 (Servlet, JSP, and WebSocket). We are driven by a mission statement of “By Developers, For Developers”, and the Jetty team has worked since the announcement of the so-called “Big Bang” approach to move Jetty entirely into the jakarta.* namespace. Not only did this position Jetty as a platform for other developers to push their products into the future, but also allowed the project to quickly adapt to innovations that are sure to come.

[Michael Redich] The Road to Jakarta EE 9, an InfoQ news piece, was published this past October to highlight the efforts by Kevin Sutter, Jakarta EE 9 Release Lead at IBM, and to describe the progress made this past year in making this new release a reality. The Java community should be proud of their contributions to Jakarta EE 9, especially implementing the “big bang,” and discussions have already started for Jakarta EE 9.1 and Jakarta EE 10. The Q&A with Kevin Sutter in the news piece includes the certification and voting process for all the Jakarta EE specifications, plans for upcoming releases of Jakarta EE, and how Java developers can get involved in contributing to Jakarta EE. Personally, I am happy to have been involved in Jakarta EE having authored 14 Jakarta EE-related InfoQ news items for the three years, and I look forward to taking my Jakarta EE contributions to the next level. I have committed to contributing to the Jakarta NoSQL specification which is currently under development. The Garden State Java User Group (in which I serve as one of its directors) has also adopted Jakarta NoSQL. I challenge anyone who still thinks that the Java programming language is dead because these past few years have been an exciting time to be part of this amazing Java community!

WildFly 22 Beta1 contains a tech preview EE 9 variant called WildFly Preview that you can download from the WildFly download page. The WildFly team is still working on passing the needed (Jakarta EE 9) TCKs (watch for updates via the wildfly.org site.) WildFly Preview includes a mix of native EE 9 APIs and implementations (i.e. ones that use the jakarta.* namespace) along with many APIs and implementations from EE 8 (i.e. ones that use the java.* namespace). This mix of namespaces is made possible by using the Eclipse community’s excellent Eclipse Transformer project to bytecode transformer legacy EE 8 artifacts to EE 9 when the server is provisioned. Applications that are written for EE 8 can also run on WildFly Preview, as a similar transformation is performed on any deployments managed by the server.

Apache TomEE is a Jakarta EE application server based on Apache Tomcat. The project main focus is the Web Profile up until Jakarta EE 8. However, with Jakarta EE 9 and some parts being optional or pruned, the project is considering the full platform for the future. TomEE is so far a couple of tests down (99% coverage) before it reaches compatibility with Jakarta EE 8 (See Introducing TCK Work and how it helps the community jump into the effort). For Jakarta EE 9, the Community decided to pick a slightly different path than other implementations. We have already produced a couple of Apache TomEE 9 milestones for Jakarta EE 9 based on a customised version of the Eclipse Transformer. It fully supports the new jakarta.* namespace. Not to forget, the project also implements MicroProfile.

Open Liberty is in the process of completing a Compatible Implementation for Jakarta EE 9. For several months, the Jakarta EE 9 implementation has been rolling out via the “monthly” Betas. Both of the Platform and Web Profile TCK testing efforts are progressing very well with 99% success rates. The expectation is to declare one (or more) of the early Open Liberty 2021 Betas as a Jakarta EE 9 Compatible Implementation. Due to Open Liberty’s flexible architecture and “zero migration” goal, customers can be assured that their current Java EE 7, Java EE 8, and Jakarta EE 8 applications will continue to execute without any changes required to the application code or server configuration. But, with a simple change to their server configuration, customers can easily start experimenting with the new “jakarta” namespace in Jakarta EE 9.

Jelastic PaaS is the first cloud platform that has already made Jakarta EE 9 release available for the customers across a wide network of distributed hosting service providers. For the last several months Jelastic team has been actively integrating Jakarta EE 9 within the cloud platform and in December made an official release. The certified container images with the following software stacks are already updated and available for customers across over 100 data centers: Tomcat, TomEE, GlassFish, WildFly and Jetty. Jelastic PaaS provides an easy way to create environments with new Jakarta EE 9 application servers for deep testing, compatibility checks and running live production environments. It’s also possible now to redeploy existing containers with old versions to the newest ones in order to reduce the necessary migration efforts, and to expedite adoption of cutting-edge cloud native tools and products.
[Amelia Eiras] Pull Request 923- Jakarta EE 9 Contributors Card is a formidable example of eleven-Jakartees coming together to create, innovate and collaborate on an Integration-Feature that makes it so that no contributor, who helped on Jakarta EE 9 release, be forgotten in the new landing page for the EE 9 Release. Who chose those Contributors? None. That is the sole point of the existence of PR923.I chose to lead the work on the PR and worked openly by prompt communications delivered the day that Tomitribe submitted the PR – Jakarta EE Working Group message to the forum to invite other Jakartees to provide input in the creation of the new feature. With Triber Andrii, who wrote the code and the feedback of those involved, the feature is active and used in the EE 9 contributors cards, YOU ROCK WALL!
The Integration-Feature will be used in future releases. We hope that it is also adopted by any project, community, or individual in or outside the Eclipse Foundation to say ThankYOU with actions to those who help power & maintain any community.
- PR logistics: 11 Jakartees came together and produced 116 exchanges that helped merge the code. Thank you, Chris (Eclipse WebMaster) for helping check the side of INFRA. The PR’s exchanges lead us to choose the activity from 2 GitHub sources: 1) https://github.com/jakartaee/specifications/pulls all merged pulls and 2) https://github.com/eclipse-ee4j all repositories.
- PR Timeframe: the Contributors’ work accomplished from October 31st, 2019 to November 20th, 2020, was boxed and is frozen. The result is that the Contributor Cards highlight 6 different Jakartees at a time every 15 seconds. A total of 171 Jakartee Contributors (committers and contributors, leveled) belong to the amazing people behind EE 9 code. While working on that PR, other necessary improvements become obvious. A good example is the visual tweaks PR #952 we submitted that improved the landing page’s formatting, cards’ visual, etc.
Via actions, we chose to not “wait & see”, saving the project $budget, but also enabling openness to tackle the stuff that could have been dropped into “nonsense”.
In open-source, our actions project a temporary part of ourselves, with no exceptions. Those actions affect positively or negatively any ecosystem. Thank you for taking the time to read this #SharingIsCaring blog.
12/01/2021
Do Looms Claims Stack Up? Part 2: Thread Pools?
“Project Loom aims to drastically reduce the effort of writing, maintaining, and observing high-throughput concurrent applications that make the best use of available hardware. … The problem is that the thread, the software unit of concurrency, cannot match the scale of the application domain’s natural units of concurrency — a session, an HTTP request, or a single database operation. … Whereas the OS can support up to a few thousand active threads, the Java runtime can support millions of virtual threads. Every unit of concurrency in the application domain can be represented by its own thread, making programming concurrent applications easier. Forget about thread-pools, just spawn a new thread, one per task.” – Ron Pressler, State of Loom, May 2020

In this series of blogs, we are examining the new Loom virtual thread features now available in OpenJDK 16 early access releases. In part 1 we saw that Loom’s claim of 1,000,000 virtual threads was true, but perhaps a little misleading, as that only applies to threads with near-empty stacks. If threads actually have deep stacks, then the achieved number of virtual threads is bound by memory and is back to being the same order of magnitude as kernel threads. In this part, we will further examine the claims and ramifications of Project Loom, specifically if we can now forget about Thread Pools. Spoiler: Cheap threads can do expensive things!
All the code from this blog is available in our loom-trial project and has been run on my dev machine (Intel® Core™ i7-6820HK CPU @ 2.70GHz × 8, 32GB memory, Ubuntu 20.04.1 LTS 64-bit, OpenJDK Runtime Environment (build 16-loom+9-316)) with no specific tuning and default settings unless noted otherwise.

Matching the scale?

Project Loom makes the claim that applications need threads because kernel threads “cannot match the scale of the application domain’s natural units of concurrency”!
Really??? We’ve seen that without tuning, we can achieve 32k of either type of thread on my laptop. We think it would be fair to assume that with careful tuning, that could be stretched to beyond 100k for either technology. Is this really below the natural scale of most applications? How many applications have a natural scale of more than 32k simultaneous parallel tasks? Don’t get me wrong, there are many apps that do exceed those scales and Jetty has users that put an extra 0 on that, but they are the minority and in reality very few applications are ever going to see that demand for concurrency.
So if the vast majority of applications would be covered by blocking code with a concurrency of 32k, then what’s the big deal? Why do those apps need Loom? Or, by the same argument, why would they need to be written in asynchronous style?
The answer is that you rarely see any application deployed with 10,000s of threads; instead, threads are limited by a thread pool, typically to 100s or 1000s of threads. The default thread pool size in jetty is 200, which we sometimes see increased to 1000s, but we have never seen a 32k thread pool even though my un-tuned laptop could supposedly support it!
So what’s going on? Why are thread pools typically so limited and what about the claim that Loom means we can “Forget about thread pools”?

Why Thread Pools?

One reason we are told that thread pools are used is because kernel threads are slow to start, thus having a bunch of them pre-started, waiting for a task in a pool improves latency. Loom claims their virtual threads are much faster to start, so let’s test that with StartThreads, which reports:
```
kStart(ns) ave:137,903 from:1,000 min:47,466 max:6,048,540
vStart(ns) ave: 10,881 from:1,000 min: 4,648 max:  486,078
```
So that claim checks out. Virtual threads start an order of magnitude faster than kernel threads. If start time was the only reason for thread pools, then Loom’s claim of forgetting about thread pools would hold.
But start time only explains why we have thread pools, but it doesn’t explain why thread pools are frequently sized far below the systems capacity for threads: 100s instead of 10,000s? What is the reason that thread pools are sized as they are?

Why Small Thread Pools?

Giving a thread a task to do is a resource commitment. It is saying that a flow of control may proceed to consume CPU, memory and other resources that will be needed to run to completion or at least until a blocking point, where it can wait for those resources. Most of those resources are not on the stack, thus limiting the number of available threads is a way to limit a wide range of resource consumption and give quality of service:
- If your back-end services can only handle 100s of simultaneous requests, then a thread pool with 100s of threads will avoid swamping them with too much load. If your JDBC driver only has 100 pooled connections, then 1,000,000 threads hammering on those connections or other locks are going to have a lot of contention.
- For many applications a late response is a wrong response, thus it may well be better to handle 1000 tasks in a timely way with the 1001st task delayed, rather than to try to run all 1001 tasks together and have them all risk being late.
- Graceful degradation under excess load. Processing a task will need to use heap memory and if too much memory is demanded an OutOfMemeoryException is fatal for all java applications. Limiting the number of threads is a coarse grained way of limiting a class of heap usage. Indeed in part 1, we saw that it was heap memory that limited the number of virtual threads.
Having a limited thread pool allows an application to be tested to that limit so that it can be proved that an application has the memory and other resources necessary to service all of those threads. Traditional thinking has been that if the configured number of threads is insufficient for the load presented, then either the excess load must wait, or the application should start using asynchronous techniques to more efficiently use those threads (rather than increase the number of threads beyond the resource capacity of the machine).
A limited thread pool is a coarse grained limit on all resources, not only threads. Limiting the number of threads puts a limit on concurrent lock contention, memory consumption and CPU usage.

Virtual Threads vs Thread Pool

Having established that there might be some good reasons to use thread pools, let’s see if Loom gives us any good reasons not to use them? So we have created a FakeDataBase class which simulates a JDBC connection pool of 100 connections with a semaphore and then in ManyTasks we run 100,000 tasks that do 2 selects and 1 insert to the database, with a small amount of CPU consumed both with and without the semaphore acquired. The core of the thread pool test is:
```
 for (int i = 0; i < tasks; i++)
   pool.execute(newTask(latch));
```
and this is compared against the Loom virtual thread code of:
```
 for (int i = 0 ; i < tasks; i++)
   Thread.builder().virtual().task(newTask(latch)).start();
```
And the results are…. drum roll… pretty much the same for both types of thread:
```
Pooled  K Threads 33,729ms
Spawned V Threads 34,482ms
```
The pooled kernel thread does appear to be consistently a little bit better, but this test is not that rigorous so let’s call it the same, which is kind of expected as the total duration is pretty much going to be primarily constrained by the concurrency of the database.
So were there any difference at all? Here is the system monitor graph during both runs: kernel threads with a pool are the left hand first period (60-30s) and then virtual threads after a change over peak (30s – 0s):

Kernel threads with thread pool do not stress the CPU at all, but virtual threads alone use almost twice as much CPU! There is also a hint of more memory being used.
The thread pool has 100k tasks in the thread pool queue, 100 kernel threads that take tasks, 100 at a time, and each task takes one of 100 semaphores permits 3 times, with little or no contention.
The Loom approach has 100k independent virtual threads that each contend 3 times for the 100 semaphore permits, with up to 99,900 threads needing to be added then removed 3 times from the semaphore’s wake up queue. The extra queuing for virtual threads could easily explain the excess CPU needed, but more investigation is needed to be definitive.
However, tasks limited by a resource like JDBC are not really the highly concurrent tasks that Loom is targeted at. To truly test Loom (and async), we need to look at a type of task that just won’t scale with blocking threads dispatched from a thread pool.

Virtual Threads vs Async APIs

One highly concurrent work load that we often see on Jetty is chat room style interaction (or games) written on CometD and/or WebSocket. Such applications often have many 10,000s or even 100,000s of connections to the server that are mostly idle, waiting for a message to receive or an event to send. Currently we achieve these scales only by asynchronous threadless waiting, with all its ramifications of complex async APIs into the application needing async callbacks. Luckily, CometD was originally written when there was only async servlets and not async IO, thus it still has the option to be deployed using blocking I/O reads and writes. This gives it good potential to be a like for like comparison between async pooled kernel threads vs blocking virtual threads.
However, we still have a concern that this style of application/load will not be suitable for Loom because each message to a chat room will fan out to the 10s, 100s or even 1000s of other users waiting in that room. Thus a single read could result in many blocking write operations, which are typically done with deep stacks (parsing, framework, handling, marshalling, then writing) and other resources (buffers, locks etc). You can see in the following flame graph from a CometD load test using Loom virtual threads, that even with a fast client the biggest block of time is spent in the blue peak on the left, that is writing with deep stacks. It is this part of the graph that needs to scale if we have either more and/or slower clients:

Jetty with CometD chat on Loom

To fairly test Loom, it is not sufficient to just replace the limited pool of kernel threads with infinite virtual threads. Jetty goes to lots of effort with its eat what you kill scheduling using reserved threads to ensure that whenever a selector thread calls a potentially blocking task, another selector thread has been executed. We can’t just put Loom virtual threads on top of this, else it will be paying the cost and complexity of core Jetty plus the overheads of Loom. Moreover, we have also learnt the risk of Thread Starvation that can result in highly concurrent applications if you defer important tasks (e.g. HTTP/2 flow control). Since virtual threads can be postponed (potentially indefinitely) by CPU bound applications or the use of non-Loom-aware locks (such as the synchronized keyword), they are not suitable for all tasks within Jetty.
Thus we think a better approach is to keep the core of Jetty running on kernel threads, but to spawn a virtual thread to do the actual work of reading, parsing, and calling the application and writing the response. If we flag those tasks with InvocationType.NON_BLOCKING, then they will be called directly by the selector thread, with no executor overhead. These tasks can then spawn a new virtual thread to proceed with the reading, parsing, handling, marshalling, writing and blocking. Thus we have created the jetty-10.0.x-loom branch, to use this approach and hopefully give a good basis for fair comparisons.
Our initial runs with our CometD benchmark with just 20 clients resulted in long GCs followed by out of memory failures! This is due to the usage of ThreadLocal for gathering latency statistics and each virtual thread was creating a latency capture data structure, only to use it once and then throw it away! While this problem is solvable by changing the CometD benchmark code, it reaffirms that threads use resources other than stack and that Loom virtual threads are not a drop in replacement for kernel threads.
We are aware that the handling of ThreadLocal is a well known problem in Loom, but until solved it may be a surprisingly hard problem to cope with, since you don’t typically know if a library your application depends on uses ThreadLocal or not.
With the CometD benchmark modified to not use ThreadLocal, we can now take Loom/Jetty/CometD to a moderate number of clients (1000 which generated the flame graph above) with the following results:
```
CLIENT: Async Jetty/CometD server
========================================
Testing 1000 clients in 100 rooms, 10 rooms/client
Sending 1000 batches of 10x50 bytes messages every 10000 µs
Elapsed = 10015 ms
- - - - - - - - - - - - - - - - - - - -
Outgoing: Rate = 990 messages/s - 99 batches/s - 12.014 MiB/s
Incoming: Rate = 99829 messages/s - 35833 batches/s(35.89%) - 26.352 MiB/s
                @     _  3,898 µs (112993, 11.30%)
                   @  _  7,797 µs (141274, 14.13%)
                   @  _  11,696 µs (136440, 13.65%)
                   @  _  15,595 µs (139590, 13.96%) ^50%
                   @  _  19,493 µs (142883, 14.29%)
                  @   _  23,392 µs (130493, 13.05%)
                @     _  27,291 µs (112283, 11.23%) ^85%
        @             _  31,190 µs (59810, 5.98%) ^95%
  @                   _  35,088 µs (12968, 1.30%)
 @                    _  38,987 µs (4266, 0.43%) ^99%
@                     _  42,886 µs (2150, 0.22%)
@                     _  46,785 µs (1259, 0.13%)
@                     _  50,683 µs (910, 0.09%)
@                     _  54,582 µs (752, 0.08%)
@                     _  58,481 µs (567, 0.06%)
@                     _  62,380 µs (460, 0.05%) ^99.9%
@                     _  66,278 µs (365, 0.04%)
@                     _  70,177 µs (232, 0.02%)
@                     _  74,076 µs (82, 0.01%)
@                     _  77,975 µs (13, 0.00%)
@                     _  81,873 µs (2, 0.00%)
Messages - Latency: 999792 samples
Messages - min/avg/50th%/99th%/max = 209/15,095/14,778/35,815/78,184 µs
Messages - Network Latency Min/Ave/Max = 0/14/78 ms
SERVER: Async Jetty/CometD server
========================================
Operative System: Linux 5.8.0-33-generic amd64
JVM: Oracle Corporation OpenJDK 64-Bit Server VM 16-ea+25-1633 16-ea+25-1633
Processors: 12
System Memory: 89.26419% used of 31.164349 GiB
Used Heap Size: 73.283676 MiB
Max Heap Size: 2048.0 MiB
- - - - - - - - - - - - - - - - - - - -
Elapsed Time: 10568 ms
   Time in Young GC: 5 ms (2 collections)
   Time in Old GC: 0 ms (0 collections)
Garbage Generated in Eden Space: 3330.0 MiB
Garbage Generated in Survivor Space: 4.227936 MiB
Average CPU Load: 397.78314/1200
========================================
Jetty Thread Pool:
    threads:                174
    tasks:                  302146
    max concurrent threads: 34
    max queue size:         152
    queue latency avg/max:  0/11 ms
    task time avg/max:      1/3316 ms
```
```
CLIENT: Loom Jetty/CometD server
========================================
Testing 1000 clients in 100 rooms, 10 rooms/client
Sending 1000 batches of 10x50 bytes messages every 10000 µs
Elapsed = 10009 ms
- - - - - - - - - - - - - - - - - - - -
Outgoing: Rate = 990 messages/s - 99 batches/s - 13.774 MiB/s
Incoming: Rate = 99832 messages/s - 41201 batches/s(41.27%) - 27.462 MiB/s
                 @    _  2,718 µs (99690, 9.98%)
                   @  _  5,436 µs (116281, 11.64%)
                   @  _  8,155 µs (115202, 11.53%)
                   @  _  10,873 µs (108572, 10.87%)
                  @   _  13,591 µs (106951, 10.70%) ^50%
                   @  _  16,310 µs (117139, 11.72%)
                   @  _  19,028 µs (114531, 11.46%)
                @     _  21,746 µs (94080, 9.42%) ^85%
            @         _  24,465 µs (71479, 7.15%)
      @               _  27,183 µs (34358, 3.44%) ^95%
  @                   _  29,901 µs (11526, 1.15%) ^99%
 @                    _  32,620 µs (4513, 0.45%)
@                     _  35,338 µs (2123, 0.21%)
@                     _  38,056 µs (988, 0.10%)
@                     _  40,775 µs (562, 0.06%)
@                     _  43,493 µs (578, 0.06%) ^99.9%
@                     _  46,211 µs (435, 0.04%)
@                     _  48,930 µs (187, 0.02%)
@                     _  51,648 µs (31, 0.00%)
@                     _  54,366 µs (27, 0.00%)
@                     _  57,085 µs (1, 0.00%)
Messages - Latency: 999254 samples
Messages - min/avg/50th%/99th%/max = 192/12,630/12,476/29,704/54,558 µs
Messages - Network Latency Min/Ave/Max = 0/12/54 ms
SERVER: Loom Jetty/CometD server
========================================
Operative System: Linux 5.8.0-33-generic amd64
JVM: Oracle Corporation OpenJDK 64-Bit Server VM 16-loom+9-316 16-loom+9-316
Processors: 12
System Memory: 88.79622% used of 31.164349 GiB
Used Heap Size: 61.733116 MiB
Max Heap Size: 2048.0 MiB
- - - - - - - - - - - - - - - - - - - -
Elapsed Time: 10560 ms
   Time in Young GC: 23 ms (8 collections)
   Time in Old GC: 0 ms (0 collections)
Garbage Generated in Eden Space: 8068.0 MiB
Garbage Generated in Survivor Space: 3.6905975 MiB
Average CPU Load: 413.33084/1200
========================================
Jetty Thread Pool:
    threads:                14
    tasks:                  0
    max concurrent threads: 0
    max queue size:         0
    queue latency avg/max:  0/0 ms
    task time avg/max:      0/0 ms
```
The results here are a bit mixed, but there are some positives for Loom:
- Both approaches easily achieved the 1000 msg/s sent to the server and 99.8k msg/s received from the server (messages have an average fan-out of a factor 100).
- The Loom version broke up those messages into 41k responses/s whilst the async version used bigger batches at 35k responses/s, which each response carrying more messages. We need to investigate why, but we think Loom is faster at starting to run the task (no time in the thread pool queue, no time to “wake up” an idle thread).
- Loom had better latency, both average (~12.5 ms vs ~14.8 ms) and max (~54.6 ms vs ~78.2 ms)
- Loom used more CPU: 413/1200 vs 398/1200 (4% more)
- Loom generated more garbage: ~8068.0 MiB vs ~3330.0 MiB and less objects made it to survivor space.
This is an interesting but inconclusive result. It is at a low scale on a fast loopback network with a client unlikely to cause blocking, so not really testing either approach. We now need to scale this test to many 10,000s of clients on a real network, which will require multiple load generation machines and careful measurement. This will be the subject of part 3 (probably some weeks away).

Conclusion (part 2) – Cheap threads can do expensive things

It is good that Project Loom adds inexpensive and fast spawning/blocking virtual threads to the JVM. But cheap threads can do expensive things!
Having 1,000,000 concurrent application entities is going to take memory, CPU and other resources, no matter if they block or use async callbacks. It may be that entirely different programming styles are needed for Loom, as is suggested by Loom Structured Concurrency, however we have not yet seen anything that provides limitations on resources that can be used by unlimited spawning of virtual threads. There are also indications that Loom’s flexible stack management comes with a CPU cost. However, it has been moderately simple to update Jetty to experiment with using Loom to call a blocking application and we’d very much encourage others to load test their application on the jetty-10.0.x-loom branch.
Many of Loom’s claims have stacked up: blocking code is much easier to write, virtual threads are very fast to start and cheap to block. However, other key claims either do not hold up or have yet to be substantiated: we do not think virtual threads give natural scaling as threads themselves are not the limiting factor, rather it is the resources that are used that determines the scaling. The suggestion to “Forget about thread-pools, just spawn a new thread…” feels like an invitation to create unstable applications unless other substantive resource management strategies are put into place.
Given that Duke’s “new clothes” woven by Loom are not one-size-fits-all, it would be a mistake to stop developing asynchronous APIs for things such as DNS and JDBC on the unsubstantiated suggestion that Loom virtual threads will make them unnecessary.
29/12/2020
Do Loom’s Claims Stack Up? Part 1: Millions of Threads?
“Project Loom aims to drastically reduce the effort of writing, maintaining, and observing high-throughput concurrent applications that make the best use of available hardware. … The problem is that the thread, the software unit of concurrency, cannot match the scale of the application domain’s natural units of concurrency — a session, an HTTP request, or a single database operation. … Whereas the OS can support up to a few thousand active threads, the Java runtime can support millions of virtual threads. Every unit of concurrency in the application domain can be represented by its own thread, making programming concurrent applications easier. Forget about thread-pools, just spawn a new thread, one per task.” – Ron Pressler, State of Loom, May 2020

Project Loom brings virtual threads (back) to the JVM in an effort to reduce the effort of writing high-throughput concurrent applications. Loom has generated a fair bit of interest with claims that Asynchronous APIs may no longer be necessary for things like Futures, JDBC, DNS, Reactive, etc. So since Loom is now available in OpenJDK 16 early access includes, we thought it was a good time to test out some of the amazing claims that have been made for Duke‘s new opaque clothing that has been woven by Loom! Spoiler – Duke might not be naked, but its attire could be a tad see-through!
All the code from this blog is available in our loom-trial project and has been run on my dev machine (Intel® Core™ i7-6820HK CPU @ 2.70GHz × 8, 32GB memory, Ubuntu 20.04.1 LTS 64-bit, OpenJDK Runtime Environment (build 16-loom+9-316)) with no specific tuning and default settings unless noted otherwise.

Some History

We started writing what would become Eclipse Jetty in 1995 on Java 0.9. For its first decade, Jetty was a blocking server using a thread per request and then a thread per connection, and large thread pools (sometimes many thousands) were sufficient to handle almost all the loads offered.
However, there were a few deployments that wanted more parallelism, plus the advent of virtual hosting meant that servers were often sharing physical machines with other server instances, all trying to pre-allocate max resources in their idle thread pools to handle potential load spikes.
Thus there was some demand for async and so Jetty-6 in 2006 introduced some asynchronous I/O. Yet it was not until Jetty-9 in 2012 that we could say that Jetty was fully asynchronous through the container and to the application and we still fight with the complexity of it today.
Through this time, Java threads were initially implemented by Green Threads and there were lots of problems of live lock, priority inversion, etc. It was a huge relief when native threads were introduced to the JVM and thus we were a little surprised at the enthusiasm expressed for Loom, which appears to be a revisit of late-stage MxN Green Threads and suffers from at least some similar limitations (e.g. the CPUBound test demonstrates that the lack of preemption makes virtual tasks unsuitable for CPU bound tasks). This paper from 2002 on Multithreading in Solaris gives an excellent background on this subject and describes the switch from the MxN threading to 1:1 native threads with terms like “better scalability”, “simplicity”, “improved quality” and that MxN had “not quite delivered the anticipated benefits”. Thus we are really interested to find out what is so different this time around.
The Jetty team has a near-unique perspective on the history of both Java threading and the development of highly concurrent large throughput Java applications, which we can use to evaluate Loom. It’s almost like we were frozen in time for decades to bring back our evil selves from the past 🙂

One Million Threads!

That’s a lot of threads and it is a claim that is really easy to test! Here is an extract from MaxVThreads:
```
CountDownLatch hold = new CountDownLatch(1);
while (threads.size() < 1_000_000)
{
    CountDownLatch started = new CountDownLatch(1);
    Thread thread = Thread.builder().virtual().task(() ->
    {
        try
        {
            started.countDown();
            hold.await();
        }
        catch (InterruptedException e)
        {
            e.printStackTrace();
        }
    }).start();
    threads.add(thread);
    started.await();
    System.err.printf("%s: %,d%n", thread, threads.size());
}
```
Which we ran and got:
```
...
VirtualThread[@244165d6,...]:   999,998
VirtualThread[@6f40da3b,...]:   999,999
VirtualThread[@1cfca01c,...]: 1,000,000
```
Async is Dead!!!
Long live Loom!!!
Lunch is Free!!!
Bullets are Silver!!!
(more…)
29/12/2020
Eat What You Kill without Starvation!
Jetty 9 introduced the Eat-What-You-Kill[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n] execution strategy to apply mechanically sympathetic techniques to the scheduling of threads in the producer-consumer pattern that are used for core capabilities in the server. The initial implementations proved vulnerable to thread starvation and Jetty-9.3 introduced dual scheduling strategies to keep the server running, which in turn suffered from lock contention on machines with more than 16 cores. The Jetty-9.4 release now contains the latest incarnation of the Eat-What-You-Kill scheduling strategy which provides mechanical sympathy without the risk of thread starvation in a single strategy. This blog is an update of the original post with the latest refinements.

Parallel Mechanical Sympathy

Parallel computing is a “false friend” for many web applications. The textbooks will tell you that parallelism is about decomposing large tasks into smaller ones that can be executed simultaneously by different computing engines to complete the task faster. While this is true, the issue is that for web application containers there is not an agreement on what is the “large task” that needs to be decomposed.

From the applications point of view the large task to be solved is how to render a complex page for a user, combining multiple requests and resources, using many services for authentication and perhaps RESTful access to a data model on multiple back end servers. For the application, parallelism can improve quality of service of rendering a single page by spreading the decomposed tasks over all the available CPUs of the server.

However, a web application container has a different large task to solve: how to provide service to hundreds or thousands, maybe even hundreds of thousands of simultaneous users. Unfortunately, for the container, the way to optimally allocate its this decomposed task to CPUs is completely opposite to how the application would like it’s decomposed tasks to be executed.

Consider a server with 4 CPUs serving 4 users each which each have 4 tasks. The applications ideal view of parallel decomposition looks like:

Label UxTy represent Task y for User x. Tasks for the same user are coloured alike

This view suggests that each user’s combined task will be executed in minimum time. However some users must wait for prior users tasks to complete before their execution can start, so average latency is higher.

Furthermore, we know from Mechanical Sympathy that such ideal execution is rarely possible, especially if there is data shared between tasks. Each CPU needs time to load its cache and register with data before it can be acted on. If that data is specific to the problem each user is trying to solve, then the real view of the parallel execution looks more like the following, the orange blocks indicating the time taken to load the CPU cache with user and task related data:

Label UxTy represent Task y for User x. Tasks for the same user are coloured alike. Orange blocks represent cache load time.

So from the containers point of view, the last thing it wants is the data from one users large problem spread over all its CPUs, because that means that when it executes the next task, it will have a cold cache and it must be reloaded with the data of the next user. Furthermore, executing tasks for the same user on different CPUs risks Parallel Slowdown, where the cost of mutual exclusion, synchronisation and communication between CPUs can increase the total time needed to execute the tasks to more than serial execution. If the tasks are fully mutually excluded on user data (unlikely but a bounding case), then the execution could look like:

For optimal execution from the containers point of view it is far better if tasks from each user, which use common data, are kept on the same CPU so the cache only needs to be loaded once and there is no mutual exclusion on user data:

While this style of execution does not achieve the minimal latency and throughput of the idealised application view, in reality it is the fairest and most optimal execution, with all users receiving similar quality of service and the optimal average latency.

In summary, when scheduling the execution of parallel tasks, it is best to keep tasks that share data on the same CPU so that they may benefit from a hot cache (the original blog contains some micro benchmark results that quantifies the benefit).

Produce Consume (PC)

In order to facilitate the decomposition of large problems into smaller ones, the Jetty container uses the Producer-Consumer pattern:
- The NIO Selector produces IO events that need to be consumed by reading, parsing and handling the data.
- A multiplexed HTTP/2 connection produces Frames that need to be consumed by calling the Servlet Container. Note that the producer of HTTP/2 frames is itself a consumer of IO events!
The producer-consumer pattern adds another way that tasks can be related by data. Not only might they be for the same user, but consuming a task will share the data that results from producing the task. A simple implementation can achieve this by using only a single CPU to both produce and consume the tasks:
```
while (true)
{
  Runnable task = _producer.produce();
  if (task == null)
    break;
   task.run();
}
```
The resulting execution pattern has good mechanical sympathy characteristics:

Label UxPy represent Produce Task y for User x, Label UxCy represent Consume Task y for User x. Tasks for the same user are coloured in similar tones. Orange blocks are cache load times.

Here all the produced tasks are immediately consumed on the same CPU with a hot cache! Cache load times are minimised, but the cost is that server will suffer from Head of Line (HOL) Blocking, where the serial execution of task from a queue means that execution of tasks are forced to wait for the completion of unrelated tasks. In this case tasks for U1C0 need not wait for U0C0 and U2C0 tasks need not wait for U1C1 or U0C1 etc. There is no parallel execution and thus this is not an optimal usage of the server resources.

Produce Execute Consume (PEC)

To solve the HOL blocking problem, multiple CPUs must be used so that produced tasks can be executed in parallel and even if one is slow or blocks, the other CPU can progress the other tasks. To achieve this, a typical solution is to have one Thread executing on a CPU that will only produce tasks, which are then placed in a queue of tasks to be executed by Threads running on other CPUs. Typically the task queue is abstracted into an Executor:
```
while (true)
{
    Runnable task = _producer.produce();
    if (task == null)
        break;
    _executor.execute(task);
}
```
This strategy could be considered the canonical solution to the producer consumer problem, where producers are separated from consumers by a queue and is at the heart of architectures such as SEDA. This strategy well solves the head of line blocking issue, since all tasks produced can complete independently in different Threads on different CPUs:

This represents a good improvement in throughput and average latency over the simple Produce Consume, solution, but the cost is that every consumed task is executed on a different Thread (and thus likely a different CPU) from the one that produced the task. While this may appear like a small cost for avoiding HOL blocking, our experience is that CPU cache misses significantly reduced the performance of early Jetty 9 releases.

Eat What You Kill (EWYK) AKA Execute Produce Consume (EPC)

To achieve good mechanical sympathy and avoid HOL blocking, Jetty has developed the Execute Produce Consume strategy, that we have nicknamed Eat What You Kill (EWYK) after the expression which states a hunter should only kill an animal they intend to eat. Applied to the producer consumer problem this policy says that a thread should only produce (kill) a task if it intends to consume (eat) it[n]The EatWhatYouKill strategy is named after a hunting proverb in the sense that one should only kill to eat. The use of this phrase is not an endorsement of hunting nor killing of wildlife for food or sport.[/n]. A task queue is still used to achieve parallel execution, but it is the producer that is dispatched rather than the produced task:
```
    while (true)
    {
        Runnable task = _producer.produce();
        if (task == null)
            break;
        _executor.execute(this); // dispatch production
        task.run(); // consume the task ourselves
    }
```
The result is that a task is consumed by the same Thread, and thus likely the same CPU, that produced it, so that consumption is always done with a hot cache:

Moreover, because any thread that completes consuming a task will immediately attempt to produce another task, there is the possibility of a single Thread/CPU executing multiple produce/consume cycles for the same user. The result is improved average latency and reduced total CPU time.

Starvation!

Unfortunately, a pure implementation of EWYK suffers from a fatal flaw! Since any thread producing a task will go on to consume that task, it is possible for all threads/CPU to be consuming at once. This was initially seen as a feature as it exerted good back pressure on the network as a busy server used all its resources consuming existing tasks rather than producing new tasks. However, in an application server consuming a task may be a blocking process that waits for more data/frames to be produced. Unfortunately if every thread/CPU ends up consuming such a blocking task, then there are no threads left available to produce the tasks to unblock them. Dead lock!

A real example of this occurred with HTTP/2, when every Thread from the pool was blocked in a HTTP/2 request because it had used up its flow control window. The windows can be expanded by flow control frames from the other end, but there were no threads available to process the flow control frames!

Thus the EWYK execution strategy used in Jetty is now adaptive and it can can use the most appropriate of the three strategies outlined above, ensuring there is always at least one thread/CPU producing so that starvation does not occur. To be adaptive, Jetty uses two mechanisms:
- Tasks that are produced can be interrogated via the Invocable interface to determine if they are nonblocking, blocking or can be run in either mode. NON_BLOCKING or EITHER tasks can be directly consumed by PC model.
- The thread pools used by Jetty implement the TryExecutor interface which supports the method boolean tryExecute(Runnable task)which allows the scheduler to know if a thread was available to continue producing and thus allows EWYK/EPC mode, otherwise the task must be passed to an executor to be consumed in PEC mode. To implement this semantic, Jetty maintains a dynamically sized pool of reserved threads that can respond to tryExecute(Runnable)calls.
Thus the simple produce consume (PC) model is used for non-blocking tasks; for blocking tasks the EWYK, aka Execute Produce Consume (EPC) mode is used if a reserved thread is available, otherwise the SEDA style Produce Execute Consume (PEC) model is used.

The adaptive EWYK strategy can be written as :
```
    while (true)
    {
        Runnable task = _producer.produce();
        if (task == null)
            break;
        if (Invocable.getInvocationType(task)==NON_BLOCKING)
            task.run();                     // Produce Consume
        else if (executor.tryExecute(this)) // recruit a new producer?
            task.run();                     // Execute Produce Consume (EWYK!)
        else
            executor.execute(task);         // Produce Execute Consume
    }
```
Chained Execution Strategies

As stated above, in the Jetty use-case it is common for the execution strategy used by the IO layer to call tasks that are themselves an execution strategy for producing and consuming HTTP/2 frames. Thus EWYK strategies can be chained and by knowing some information about the mode in which the prior strategy has executed them the strategies can be even more adaptive.

The adaptable chainable EWYK strategy is outlined here:
```
  while (true) {
    Runnable task = _producer.produce();
    if (task == null)
      break;
    if (thisThreadIsNonBlocking())
    {
      switch(Invocable.getInvocationType(task))
      {
        case NON_BLOCKING:
          task.run();                 // Produce Consume
          break;
        case BLOCKING:
          executor.execute(task);     // Produce Execute Consume
          break;
        case EITHER:
          executeAsNonBlocking(task); // Produce Consume break;
       }
    }
    else
    {
      switch(Invocable.getInvocationType(task))
      {
        case NON_BLOCKING:
          task.run();                   // Produce Consume
          break;
        case BLOCKING:
          if (_executor.tryExecute(this))
            task.run();                 // Execute Produce Consume (EWYK!)
          else
            executor.execute(task);     // Produce Execute Consume
          break;
        case EITHER:
          if (_executor.tryExecute(this))
            task.run();                 // Execute Produce Consume (EWYK!)
          else
            executeAsNonBlocking(task); // Produce Consume
            break;
       }
    }
```
An example of how the chaining works is that the HTTP/2 task declares itself as invocable EITHER in blocking on non blocking mode. If IO strategy is operating in PEC mode, then the HTTP/2 task is in its own thread and free to block, so it can itself use EWYK and potentially execute a blocking task that it produced.

However, if the IO strategy has no reserved threads it cannot risk queuing an important Flow Control frame in a job queue. Instead it can execute the HTTP/2 as a non blocking task in the PC mode. So even if the last available thread was running the IO strategy, it can use PC mode to execute HTTP/2 tasks in non blocking mode. The HTTP/2 strategy is then always able to handle flow control frames as they are non-blocking tasks run as PC and all other frames that may block are queued with PEC.

Conclusion

The EWYK execution strategy has been implemented in Jetty to improve performance through mechanical sympathy, whilst avoiding the issues of Head of Line blocking, Thread Starvation and Parallel Slowdown. The team at Webtide continue to work with our clients and users to analyse and innovate better solutions to serve high performance real world applications.
28/03/2019

Environment	Jakarta EE	Servlet	Jakarta Namespace	Jetty GroupID
ee8	EE8	4	`javax.servlet`	`org.eclipse.jetty.ee8`
ee9	EE9	5	`jakarta.servlet`	`org.eclipse.jetty.ee9`
ee10	EE10	6	`jakarta.servlet`	`org.eclipse.jetty.ee10`

Category: General

Cross-Context Dispatch reintroduced to Jetty-12

Understanding Cross-Context Dispatch

Reintroducing Cross-Context Dispatch

Looking Ahead

Axioms

If some are good, are more even better?

Finite Resources

Infinite (OK Scalable) Resources

Too Much of a Good Thing?

The Cost of Waiting

Provisioning for the Worst Case

Conclusion

Recommendation

Servlet API independent

Multiple EE Environments

Core Environment

Performance

New Asynchronous IO abstraction

Handler, Request & Response design

Security

Big update & cleanup

Conclusion

Backward Compatibility

Getting Cross about Cross-Context Dispatch

Exclude Include

Dispatch the DispatcherType

Unravelling Wrappers

Asynchronous Life Cycle

To Be Removed

A story about Unix internationalization

A story about filesystem Unicode normalization

Matching the scale?

Why Thread Pools?

Why Small Thread Pools?

Virtual Threads vs Thread Pool

Virtual Threads vs Async APIs

Jetty with CometD chat on Loom

Conclusion (part 2) – Cheap threads can do expensive things

Some History

One Million Threads!

Parallel Mechanical Sympathy

Produce Consume (PC)

Produce Execute Consume (PEC)

Eat What You Kill (EWYK) AKA Execute Produce Consume (EPC)

Starvation!

Chained Execution Strategies

Conclusion