Category: CometD

CometD and Opera

The Opera browser is working well with the CometD JavaScript library.
However, recently a problem was reported by the BlastChat guys: with Opera, long-polling requests were strangely disconnecting and immediately reconnecting. This problem was only happening if the long poll request was held by the CometD server for the whole duration of the long-polling timeout.
Reducing the long-polling timeout from the default 30 seconds to 20 seconds made the problem disappear.
This made me think that some other entity had a 30 seconds timeout, and it was killing the request just before it had the chance to be responded by the CometD server.
Such entities may be front-end web servers (such as when Apache Httpd is deployed in front of the CometD server), as well as firewalls or other network components.
But in this case, all other major browsers were working fine, only Opera was failing.
So I typed about:config in Opera’s address bar to access Opera’s configuration options, and filtered with the keyword timeout in the “Quick find” text field.
The second entry is “HTTP Loading Delayed Timeout” and it is set at 30 seconds.
Increasing that value to 45 seconds made the problem disappear.
In my opinion, that value is a bit too aggressive, especially these days where Comet techniques are commonly used and where WebSocket is not yet widely deployed.
The simple workaround is to set the CometD long poll timeout to 20-25 seconds as explained here, but it would be great if Opera’s default was set to a bigger value.

05/10/2011
CometD 2.4.0 WebSocket Benchmarks
Slightly more than one year has passed since the last CometD 2 benchmarks, and more than three years since the CometD 1 benchmark. During this year we have done a lot of work on CometD, both by adding features and by continuously improving performance and stability to make it faster and more scalable.
With the upcoming CometD 2.4.0 release, one of the biggest changes is the implementation of a WebSocket transport for both the Java client and the Java server.
The WebSocket protocol is finalizing at the IETF, major browsers all support various draft versions of the protocol (and Jetty supports all draft versions), so while WebSocket is slowly picking up, it is interesting to compare how WebSocket behaves with respect to HTTP for the typical scenarios that use CometD.
We conducted several benchmarks using the CometD load tools on Amazon EC2 instances.

HTTP Benchmark Results

Below you can find the benchmark result graph when using the CometD long-polling transport, based on plain HTTP.

Differently from the previous benchmark, where we reported the average latency, this time we report the median latency, which is a better indicator of the latencies seen by the clients.
Comparison with the previous benchmark would be unfair, since the hosts were different (both in number and computing power), and the JVM also was different.
As you can see from the graph above, the median latency is pretty much the same no matter the number of clients, with the exception of 50k clients at 50k messages/s.
The median latency stays well under 200 ms even at more than 50k messages/s, and it is in the range of 2-4 ms until 10k messages/s, and around 50 ms for 20k messages/s, even for 50k clients.
The result for 50k clients and 50k messages/s is a bit strange, since the hosts (both server and clients) had plenty of CPU available and plenty of threads available (which rules out locking contention issues in the code that would have bumped up threads use).
Could it be possible that at that message rate we hit some limit of the EC2 platform ? It might be possible and this blog post confirms that indeed there are limits in the virtualization of the network interfaces between host and guest. I have words from other people who have performed benchmarks on EC2 that they also hit limits very close to what the blog post above describes.
In any case, one server with 20k clients serving 50k messages/s with 150 ms median latency is a very good result.
For completeness, the 99th percentile latency is around 350 ms for 20k and 50k clients at 20k messages/s and around 1500 ms for 20k clients at 50k messages/s, and much less–quite close to the median latency–for the other results.

WebSocket Benchmark Results

The results for the same benchmarks using the WebSocket transport were quite impressive, and you can see them below.

Note that this graph uses a totally different scale for latencies and number of clients.
Whereas for HTTP we had a 800 ms as maximum latency (on the Y axis), for WebSocket we have 6 ms (yes you read that right); and whereas for HTTP we somehow topped at 50k clients per server, here we could go up to 200k.
We did not merge the two graphs into a single one to avoid that the WebSocket resulting trend lines were collapsed onto the X axis.
With HTTP, having more than 50k clients on the server was troublesome at any message rate, but with WebSocket 200k clients were stable up to 20k messages/s. Beyond that, we probably hit EC2 limits again, and the results were unstable–some runs could complete successfully, others could not.
- The median latencies, for almost any number of clients and any message rate, are below 10 ms, which is quite impressive.
- The 99th percentile latency is around 300 ms for 200k clients at 20k messages/s, and around 200 ms for 50k clients at 50k messages/s.
We have also conducted some benchmarks by varying the payload size from the default of 50 bytes to 500 bytes to 2000 bytes, but the results we obtained with different payload size were very similar, so we can say that payload size has a very little impact (if any) on latencies in this benchmark configuration.
We have also monitored memory consumption in “idle” state (that is, with clients connected and sending meta connect requests every 30 seconds, but not sending messages):
- HTTP: 50k clients occupy around 2.1 GiB
- WebSocket: 50k clients occupy around 1.2 GiB, and 200k clients occupy 3.2 GiB.
The benefits of WebSocket being a lighter weight protocol with respect to HTTP are clear in all cases.

Conclusions

The conclusions are:
- The work the CometD project has done to improve performances and scalability were worth the effort, and CometD offers a truly scalable solution for server-side event driven web applications, for both HTTP and WebSocket.
- As the WebSocket protocol gains adoption, CometD can leverage the new protocol without any change required to applications; they will just perform faster.
- Server-to-server CometD communication can now be extremely fast by using WebSocket. We have already updated the CometD scalability cluster Oort to take advantage of these enhancements.
Appendix–Benchmark Details

The server was one EC2 instance of type “m2.4xlarge” (67 GiB RAM, 8 cores Intel(R) Xeon(R) X5550 @2.67GHz) running Ubuntu Linux 11.04 (2.6.38-11-virtual #48-Ubuntu SMP 64-bit).
The clients were 10 EC2 instances of type “c1.xlarge” (7 GiB RAM, 8 cores Intel Xeon E5410 @2.33GHz) running Ubuntu Linux 11.04 (2.6.38-11-virtual #48-Ubuntu SMP 64-bit).
The JVM used was Oracle’s Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode) version 1.7.0 for both clients and server.
The server was started with the following options:
```
-Xmx32g -Xms32g -Xmn16g -XX:-UseSplitVerifier -XX:+UseParallelOldGC -XX:-UseAdaptiveSizePolicy -XX:+UseNUMA
```
while the clients were started with the following options:
```
-Xmx6g -Xms6g -Xmn3g -XX:-UseSplitVerifier -XX:+UseParallelOldGC -XX:-UseAdaptiveSizePolicy -XX:+UseNUMA
```
The OS was tuned for allowing a larger number of file descriptors, as described here.
21/09/2011
Prelim Cometd WebSocket Benchmarks
I have done some very rough preliminary benchmarks on the latest cometd-2.4.0-SNAPSHOT with the latest Jetty-7.5.0-SNAPSHOT and the results are rather impressive. The features that these two releases have added are:
- Optimised Jetty NIO with latest JVMs and JITs considered.
- Latest websocket draft implemented and optimised.
- Websocket client implemented.
- Jackson JSON parser/generator used for cometd
- Websocket cometd transport for the server improved.
- Websocket cometd transport for the bayeux client implemented.
The benchmarks that I’ve done have all been on my notebook using the localhost network, which is not the most realistic of environments, but it still does tell us a lot about the raw performance of the cometd/jetty. Specifically:
- Both the server and the client are running on the same machine, so they are effectively sharing the 8 CPUs available. The client typically takes 3x more CPU than the server (for the same load), so this is kind of like running the server on a dual core and the client on a 6 core machine.
- The local network has very high throughput which would only be matched by gigabit networks. It also has practically no latency, which is unlike any real network. The long polling transport is more dependent on good network latency than the websocket transport, so the true comparison between these transports will need testing on a real network.
The Test

The cometd load test is a simulated chat application. For this test I tried long-polling and websocket transports for 100, 1000 and 10,000 clients that were each logged into 10 randomly selected chat rooms from a total of 100 rooms. The messages sent were all 50 characters long and were published in batches of 10 messages at once, each to randomly selected rooms. There was a pause between batches that was adjusted to find a good throughput that didn’t have bad latency. However little effort was put into finding the optimal settings to maximise throughput.

The runs were all done on JVM’s that had been warmed up, but the runs were moderately short (approx 30s), so steady state was not guaranteed and the margin of error on these numbers will be pretty high. However, I also did a long run test at one setting just to make sure that steady state can be achieved.

The Results

The bubble chart above plots messages per second against number of clients for both long-polling and websocket transports. The size of the bubble is the maximal latency of the test, with the smallest bubble being 109ms and the largest is 646ms. Observations from the results are:
- Regardless of transport we achieved 100’s of 1000’s messages per second! These are great numbers and show that we can cycle the cometd infrastructure at high rates.
- The long-polling throughput is probably a over reported because there are many messages being queued into each HTTP response. The most HTTP responses I saw was 22,000 responses per second, so for many application it will be the HTTP rate that limits the throughput rather than the cometd rate. However the websocket throughput did not benefit from any such batching.
- The maximal latency for all websocket measurements was significantly better than long polling, with all websocket messages being delivered in < 200ms and the average was < 1ms.
- The websocket throughput increased with connections, which probably indicates that at low numbers of connections we were not generating a maximal load.
A Long Run

The throughput tests above need to be redone on a real network and longer runs. However I did do one long run ( 3 hours) of 1,000,013,657 messages at 93,856/sec. T results suggest no immediate problems with long runs. Neither the client nor the server needed to do a old generation collection and all young generation collections took on average only 12ms.

The output from the client is below:
```
Statistics Started at Fri Aug 19 15:44:48 EST 2011
Operative System: Linux 2.6.38-10-generic amd64
JVM : Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM runtime 17.1-b03 1.6.0_22-b04
Processors: 8
System Memory: 55.35461% used of 7.747429 GiB
Used Heap Size: 215.7406 MiB
Max Heap Size: 1984.0 MiB
Young Generation Heap Size: 448.0 MiB
- - - - - - - - - - - - - - - - - - - -
Testing 1000 clients in 100 rooms, 10 rooms/client
Sending 1000000 batches of 10x50 bytes messages every 10000 µs
- - - - - - - - - - - - - - - - - - - -
Statistics Ended at Fri Aug 19 18:42:23 EST 2011
Elapsed time: 10654717 ms
	Time in JIT compilation: 57 ms
	Time in Young Generation GC: 118473 ms (8354 collections)
	Time in Old Generation GC: 0 ms (0 collections)
Garbage Generated in Young Generation: 2576746.8 MiB
Garbage Generated in Survivor Generation: 336.53125 MiB
Garbage Generated in Old Generation: 532.35156 MiB
Average CPU Load: 433.23907/800
----------------------------------------
Outgoing: Elapsed = 10654716 ms | Rate = 938 msg/s = 93 req/s =   0.4 Mbs
All messages arrived 1000013657/1000013657
Messages - Success/Expected = 1000013657/1000013657
Incoming - Elapsed = 10654716 ms | Rate = 93856 msg/s = 90101 resp/s(96.00%) =  35.8 Mbs
Thread Pool - Queue Max = 972 | Latency avg/max = 3/62 ms
Messages - Wall Latency Min/Ave/Max = 0/8/135 ms
```
Note that the client was using 433/800 of the available CPU, while you can see that the server (below) was using only 170/800. This suggests that the server has plenty of spare capacity if it were given the entire machine.
```
Statistics Started at Fri Aug 19 15:44:47 EST 2011
Operative System: Linux 2.6.38-10-generic amd64
JVM : Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM runtime 17.1-b03 1.6.0_22-b04
Processors: 8
System Memory: 55.27913% used of 7.747429 GiB
Used Heap Size: 82.58406 MiB
Max Heap Size: 2016.0 MiB
Young Generation Heap Size: 224.0 MiB
- - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
Statistics Ended at Fri Aug 19 18:42:23 EST 2011
Elapsed time: 10655706 ms
	Time in JIT compilation: 187 ms
	Time in Young Generation GC: 140973 ms (12073 collections)
	Time in Old Generation GC: 0 ms (0 collections)
Garbage Generated in Young Generation: 1652646.0 MiB
Garbage Generated in Survivor Generation: 767.625 MiB
Garbage Generated in Old Generation: 1472.6484 MiB
Average CPU Load: 170.20532/800
```
Conclusion

These results are preliminary, but excellent none the less! The final releases of jetty 7.5.0 and cometd 2.4.0 will be out within a week or two and we will be working to bring you some more rigorous benchmarks with those releases.
19/08/2011
CometD JSON library pluggability
It all started when my colleague Joakim showed me the results of some JSON libraries benchmarks he was doing, which showed Jackson to be the clear winner among many libraries.
So I decided that for the upcoming CometD 2.4.0 release it would have been good to make CometD independent of the JSON library used, so that Jackson or other libraries could have been plugged in.
Historically, CometD made use of the Jetty‘s JSON library, and this is still the default if no other library is configured.
Running a CometD specific benchmark using Jetty’s JSON library and Jackson (see this test case) shows, on my laptop, this sample output:
```
Parsing:
...
jackson context iteration 1: 946 ms
jackson context iteration 2: 949 ms
jackson context iteration 3: 944 ms
jackson context iteration 4: 922 ms
jetty context iteration 1: 634 ms
jetty context iteration 2: 634 ms
jetty context iteration 3: 636 ms
jetty context iteration 4: 639 ms
Generating:
...
jackson context iteration 1: 548 ms
jackson context iteration 2: 549 ms
jackson context iteration 3: 552 ms
jackson context iteration 4: 561 ms
jetty context iteration 1: 788 ms
jetty context iteration 2: 796 ms
jetty context iteration 3: 798 ms
jetty context iteration 4: 805 ms
```
Jackson is roughly 45% slower in parsing and 45% faster in generating, so not bad for Jetty’s JSON compared to the best in class.
Apart from efficiency, Jackson has certainly more features than Jetty’s JSON library with respect to serializing/deserializing custom classes, so having a pluggable JSON library in CometD is only better for end users, that can now choose the solution that fits them best.
Unfortunately, I could not integrate the Gson library, which does not seem to have the capability of deserializing arbitrary JSON into java.util.Map object graphs, like Jetty’s JSON and Jackson are able to do in one line of code.
If you have insights on how to make Gson work, I’ll be glad to hear.
The documentation on how to configure CometD’s JSON library can be found here.
UPDATE
After a suggestion from Tatu Saloranta of Jackson, the Jackson parsing is now faster than Jetty’s JSON library by roughly 20%:
```
...
jackson context iteration 1: 555 ms
jackson context iteration 2: 506 ms
jackson context iteration 3: 506 ms
jackson context iteration 4: 532 ms
jetty context iteration 1: 632 ms
jetty context iteration 2: 637 ms
jetty context iteration 3: 639 ms
jetty context iteration 4: 635 ms
```
16/08/2011
CometD Message Flow Control with Listeners
In the last blog entry I talked about message flow control using CometD‘s lazy channels.
Now I want to show how it is possible to achieve a similar flow control using specialized listeners that allow to manipulate the ServerSession message queue.
The ServerSession message queue is a data structure that is accessed concurrently when messages are published and delivered to clients, so it needs appropriate synchronization when accessed.
In order to simplify this synchronization requirements, CometD allows you to add DeQueueListeners to ServerSessions, with the guarantee that these listeners will be called with the appropriate locks acquired, to allow user code to freely modify the queue’s content.
Below you can find an example of a DeQueueListener that keeps only the first message of a series of message published to the same channel within a tolerance period of 1000 ms, and removes the others (it relies on the timestamp extension):
```
String channelName = "/stock/GOOG";
long tolerance = 1000;
ServerSession session = ...;
session.addListener(new ServerSession.DeQueueListener()
{
    public void deQueue(ServerSession session, Queue queue)
    {
        long lastTimeStamp = 0;
        for (Iterator iterator = queue.iterator(); iterator.hasNext();)
        {
            ServerMessage message = iterator.next();
            if (channelName.equals(message.getChannel()))
            {
                long timeStamp = Long.parseLong(message.get(Message.TIMESTAMP_FIELD).toString());
                if (timeStamp <= lastTimeStamp + tolerance)
                {
                    System.err.println("removed " + message);
                    iterator.remove();
                }
                else
                {
                    System.err.println("kept " + message);
                    lastTimeStamp = timeStamp;
                }
            }
        }
    }
});
```
Other possibilities include keeping the last message (instead of the first), coalescing the message fields following a particular logic, or even clearing the queue completely.
DeQueueListeners are called when CometD is about to deliver messages to the client, so clearing the queue completely results in an empty response being sent to the client.
This is different from the behavior of lazy channels, that allowed to delay the message delivery until a configurable timeout expired.
However, lazy channels do not alter the number of messages being sent, while DeQueueListeners can manipulate the message queue.
Therefore, CometD message control flow is often best accomplished by using both mechanisms: lazy channels to delay message delivery, and DeQueueListeners to reduce/coalesce the number of messages sent.
19/04/2011
CometD Codemotion Slides

The Codemotion conference slides of my talk on Comet and WebSocket web applications are available here: slideshare, download.

11/04/2011
CometD Message Flow Control with Lazy Channels
In the CometD introduction post, I explained how the CometD project provides a solution for writing low-latency server-side event-driven web applications.
Examples of this kind of applications are financial applications that provide stock quote price updates, or online games, or position tracking systems for fast moving objects (think a motorbike on a circuit).
These applications have in common the fact that they generate a high rate of server-side events, say in the order of around 10 events per second.
With such an event rate, most of the times you start wondering if it is appropriate to really send to clients every event (and therefore 10 events/s) or if it not better to save bandwidth and computing resources and send to clients events at a lower rate.
For example, even if the stock quote price changes 10 times a second, it will probably be enough to deliver changes once a second to a web application that is conceived to be used by humans: I will be surprised if a person can make any use (or even see it and remember it) of the stock price that was updated 2 tenths of a seconds ago (and that in the meanwhile already changed 2 or 3 times). (Disclaimer: I am not involved in financial applications, I am just making a hypothesis here for the sake of explaining the concept).
The CometD project provides lazy channels to implement this kind of message flow control (it also provides other message flow control means, of which I’ll speak in a future entry).
A channel can be marked as lazy during its initialization on server-side:
```
BayeuxServer bayeux = ...;
bayeux.createIfAbsent("/stock/GOOG", new ConfigurableServerChannel.Initializer()
{
    public void configureChannel(ConfigurableServerChannel channel)
    {
        channel.setLazy(true);
    }
});
```
Any message sent to that channel will be marked to be a lazy message, and will be delivered lazily: either when a timeout (the max lazy timeout) expires, or when the long poll returns, whichever comes first.
It is possible to configure the duration of the max lazy timeout, for example to be 1 second, in web.xml:
```
    
        cometd
        org.cometd.server.CometdServlet
        
            maxLazyTimeout
            1000
        
        ...
    
    ...
```
With this configuration, lazy channels will have a max lazy timeout of 1000 ms and messages published to a lazy channel will be delivered in a batch once a second.
Assuming, for example, that you have a steady rate of 8 messages per second arriving to server-side that update the GOOG stock quote, you will be delivering a batch of 8 messages to clients every second, instead of delivering 1 message every 125 ms.
Lazy channels do not immediately reduce the bandwidth consumption (since no messages are discarded), but combined with a GZip filter that compresses the output allow bandwidth savings by compressing more messages for each delivery (as in general it is better to compress a larger text than many small ones).
You can browse the CometD documentation for more information, look at the online javadocs, post to the mailing list or pop up in the IRC channel #cometd on irc.freenode.org.
11/04/2011
CometD Introduction
The CometD project provides tools to write server-side event-driven web applications.
This kind of web application is becoming more popular, thanks to the fact that browsers have become truly powerful (JavaScript performance problems are now a relic of the past) and are widely deployed, so they are a very good platform for no-install applications.
Point to the URL, done.
Server-side event-driven web applications are those web application that receive events from third party systems on server side, and need to delivery those events with a very low latency to clients (mostly browsers).
Examples of such applications are chat applications, monitoring consoles, financial applications that provide stock quotes, online collaboration tools (e.g. document writing, code review), online games (e.g. chess, poker), social network applications, latest news information, mail applications, messaging applications, etc.
The key point of these applications is low latency: you cannot play a one-minute chess game if your application polls the chess server every 5-10 seconds to download your opponent’s moves.
These applications can be written using Comet techniques, but the moment you think it’s simple using those techniques, you’ll be faced with browsers glitches, nasty race conditions, scalability issues and in general with the complexity of asynchronous, multi-threaded programming.
For example, Comet techniques do not specify how to identify a specific client. How can browserA tell the server to send a message to browserB ?
It soon turns out that you need some sort of client identifier, and perhaps you want to support multiple clients in the same browser (so no, the HTTP session is not enough).
Add to that connection heartbeats, error detection, authentication and disconnection and other features, and you realize you are building a protocol on top of HTTP.
And this is where the CometD project comes to a rescue, providing that protocol on top of HTTP (the Bayeux protocol), and easy-to-use libraries that shield developers from said complexities.
In a nutshell, CometD enables publish/subscribe web messaging: it makes possible to send a message from a browser to another browser (or to several other browsers), or to send a message to the server only, or have the server send messages to a browser (or to several other browsers).
Below you can find an example of the JavaScript API, used in conjunction with the Dojo Toolkit:
```
  
    
    
  
```
You can subscribe to a channel, that represents the topic for which you want to receive messages.
For example, a stock quote web application may publish quote updates for Google to channel /stock/GOOG on server-side, and all browsers that subscribed to that channel will receive the message with the updated stock quote (and whatever other information the application puts in the message):
```
dojox.cometd.subscribe("/stock/GOOG", function(message)
{
  // Update the DOM with the content from the message
});
```
Equally easy is to publish messages to the server on a particular channel:
```
dojox.cometd.publish("/game/chess/12345", {
  move: "e4"
});
```
And at the end, you can disconnect:
```
dojox.cometd.disconnect();
```
You can have more information on the CometD site, and on the documentation section.
You can have a skeleton CometD project setup in seconds using Maven archetypes, as explained in the CometD primer. The Maven archetypes support Dojo, jQuery and (optionally) integrate with Spring.
Download and try out the latest CometD 2.1.1 release.
07/04/2011
CometD 2.1.1 Released

CometD 2.1.1 has been released.
This is a minor bug fix release that updates the JavaScript toolkits to Dojo 1.6.0 and jQuery 1.5.1, and Jetty to 7.3.1 and 6.1.26.
Enjoy !

07/04/2011
Is WebSocket Chat Simpler?
A year ago I wrote an article asking Is WebSocket Chat Simple?, where I highlighted the deficiencies of this much touted protocol for implementing simple comet applications like chat. After a year of intense debate there have been many changes and there are new drafts of both the WebSocket protocol and WebSocket API. Thus I thought it worthwhile to update my article with comments to see how things have improved (or not) in the last year.

The text in italics is my wishful thinking from a year ago
The text in bold italics is my updated comments

Is WebSocket Chat Simple (take II)?

The WebSocket protocol has been touted as a great leap forward for bidirectional web applications like chat, promising a new era of simple Comet applications. Unfortunately there is no such thing as a silver bullet and this blog will walk through a simple chat room to see where WebSocket does and does not help with Comet applications. In a WebSocket world, there is even more need for frameworks like cometD.

Simple Chat

Chat is the “helloworld” application of web-2.0 and a simple WebSocket chat room is included with the jetty-7 which now supports WebSockets. The source of the simple chat can be seen in svn for the client-side and server-side.

The key part of the client-side is to establish a WebSocket connection:
```
join: function(name) {
   this._username=name;
   var location = document.location.toString()
       .replace('http:','ws:');
   this._ws=new WebSocket(location);
   this._ws.onopen=this._onopen;
   this._ws.onmessage=this._onmessage;
   this._ws.onclose=this._onclose;
},
```
It is then possible for the client to send a chat message to the server:
```
_send: function(user,message){
   user=user.replace(':','_');
   if (this._ws)
       this._ws.send(user+':'+message);
},
```
and to receive a chat message from the server and to display it:
```
_onmessage: function(m) {
   if (m.data){
       var c=m.data.indexOf(':');
       var from=m.data.substring(0,c)
           .replace('<','<')
           .replace('>','>');
       var text=m.data.substring(c+1)
           .replace('<','<')
           .replace('>','>');
       var chat=$('chat');
       var spanFrom = document.createElement('span');
       spanFrom.className='from';
       spanFrom.innerHTML=from+': ';
       var spanText = document.createElement('span');
       spanText.className='text';
       spanText.innerHTML=text;
       var lineBreak = document.createElement('br');
       chat.appendChild(spanFrom);
       chat.appendChild(spanText);
       chat.appendChild(lineBreak);
       chat.scrollTop = chat.scrollHeight - chat.clientHeight;
  }
},
```
For the server-side, we simply accept incoming connections as members:
```
public void onConnect(Connection connection)
{
    _connection=connection;
    _members.add(this);
}
```
and then for all messages received, we send them to all members:
```
public void onMessage(byte frame, String data){
   for (ChatWebSocket member : _members){
   try{
       member._connection.sendMessage(data);
   }
   catch(IOException e){
       Log.warn(e);
   }
 }
}
```
So we are done, right? We have a working chat room – let’s deploy it and we’ll be the next Google GChat!! Unfortunately, reality is not that simple and this chat room is a long way short of the kinds of functionality that you expect from a chat room – even a simple one.

Not So Simple Chat

On Close?

With a chat room, the standard use-case is that once you establish your presence in the room and it remains until you explicitly leave the room. In the context of webchat, that means that you can send receive a chat message until you close the browser or navigate away from the page. Unfortunately the simple chat example does not implement this semantic because the WebSocket protocol allows for an idle timeout of the connection. So if nothing is said in the chat room for a short while then the WebSocket connection will be closed, either by the client, the server or even an intermediary. The application will be notified of this event by the onClose method being called.

So how should the chat room handle onClose? The obvious thing to do is for the client to simply call join again and open a new connection back to the server:
```
_onclose: function() {
   this._ws=null;
   this.join(this.username);
}
```
This indeed maintains the user’s presence in the chat room, but is far from an ideal solution since every few idle minutes the user will leave the room and rejoin. For the short period between connections, they will miss any messages sent and will not be able to send any chat
themselves.

Keep-Alives

In order to maintain presence, the chat application can send keep-alive messages on the WebSocket to prevent it from being closed due to an idle timeout. However, the application has no idea at all about what the idle timeouts are, so it will have to pick some arbitrary frequent period (e.g. 30s) to send keep-alives and hope that is less than any idle timeout on the path (more or less as long-polling does now).

Ideally a future version of WebSocket will support timeout discovery, so it can either tell the application the period for keep-alive messages or it could even send the keep-alives on behalf of the application.

The latest drafts of the websocket protocol do include control packets for ping and pong, which can effectively be used as messages to keep alive a connection. Unfortunately this mechanism is not actually usable because: a) there is no javascript API to send pings; b) there is no API to communicate to the infrastructure if the application wants the connection kept alive or not; c) the protocol does not require that pings are sent; d) neither the websocket infrastructure nor the application knows the frequency at which pings would need to be sent to keep alive the intermediaries and other end of the connection. There is a draft proposal to declare timeouts in headers, but it remains to be seen if that gathers any traction.

Unfortunately keep-alives don’t avoid the need for onClose to initiate new WebSockets, because the internet is not a perfect place and especially with wifi and mobile clients, sometimes connections just drop. It is a standard part of HTTP that if a connection closes while being used, the GET requests are retried on new connections, so users are mostly insulated from transient connection failures. A WebSocket chat room needs to work with the same assumption and even with keep-alives, it needs to be prepared to reopen a connection when onClose is called.

Queues

With keep-alives, the WebSocket chat connection should be mostly be a long-lived entity, with only the occasional reconnect due to transient network problems or server restarts. Occasional loss of presence might not be seen to be a problem, unless you’re the dude that just typed a long chat message on the tiny keyboard of your vodafone360 app or instead of chat you are playing on chess.com and you don’t want to abandon a game due to transient network issues. So for any reasonable level of quality of service, the application is going to need to “pave over” any small gaps in connectivity by providing some kind of message queue in both client and server. If a message is sent during the period of time that there is no WebSocket connection, it needs to be queued until such time as the new connection is established.

Timeouts

Unfortunately, some failures are not transient and sometimes a new connection will not be established. We can’t allow queues to grow forever and pretend that a user is present long after their connection is gone. Thus both ends of the chat application will also need timeouts and the user will not be seen to have left the chat room until they have no connection for the period of the timeout or until an explicit leaving message is received.

Ideally a future version of WebSocket will support an orderly close message so the application can distinguish between a network failure (and keep the user’s presence for a time) and an orderly close as the user leaves the page (and remove the user’s present).

Both the protocol and API have been updated with the ability to distinguish an orderly close from a failed close. The WebSocket API now has a CloseEvent that is passed to the onclose method that does contain the close code and reason string that is sent with an orderly close and this will allow simpler handling in the endpoints and avoid pointless client retries.

Message Retries

Even with message queues, there is a race condition that makes it difficult to completely close the gaps between connections. If the onClose method is called very soon after a message is sent, then the application has no way to know if that close event happened before or after the message was delivered. If quality of service is important, then the application currently has no option but to have some kind of per message or periodic acknowledgment of message delivery.

Ideally a future version of WebSocket will support orderly close, so that delivery can be known for non-failed connections and a complication of acknowledgements can be avoided unless the highest quality of service is required.

Orderly close is now supported (see above.)

Backoff

With onClose handling, keep-alives, message queues, timeouts and retries, we finally will have a chat room that can maintain a user’s presence while they remain on the web page. But unfortunately the chat room is still not complete, because it needs to handle errors and non-transient failures. Some of the circumstances that need to be avoided include:
- If the chat server is shut down, the client application is notified of this simply by a call to onClose rather than an onOpen call. In this case, onClose should not just reopen the connection as a 100% CPU busy loop with result. Instead the chat application has to infer that there was a connection problem and to at least pause a short while before trying again – potentially with a retry backoff algorithm to reduce retries over time.
  Ideally a future version of WebSocket will allow more access to connection errors, as the handling of no-route-to-host may be entirely different to handling of a 401 unauthorized response from the server.
  
  The WebSocket protocol is now full HTTP compliant before the 101 of the upgrade handshake, so responses like 401 can legally be sent. Also the WebSocket API now has an onerror call back, but unfortuantely it is not yet clear under what circumstances it is called, nor is there any indication that information like a 401 response or 302 redirect, would be available to the application.
- If the user types a large chat message, then the WebSocket frame sent may exceed some resource level on the client, server or intermediary. Currently the WebSocket response to such resource issues is to simply close the connection. Unfortunately for the chat application, this may look like a transient network failure (coming after a successful onOpen call), so it may just reopen the connection and naively retry sending the message, which will again exceed the max message size and we can lather, rinse and repeat! Again it is important that any automatic retries performed by the application will be limited by a backoff timeout and/or max retries.
  Ideally a future version of WebSocket will be able to send an error status as something distinct from a network failure or idle timeout, so the application will know not to retry errors.
  
  While there is no general error control frame, there is now a reason code defined in the orderly close, so that for any errors serious enough to force the connection to be closed the following can be communicated: 1000 – normal closure; 1001 – shutdown or navigate away; 1002 – protocol error; 1003 data type cannot be handled; 1004 message is too large. These are a great improvement, but it would be better if such errors could be sent in control frames so that the connection does not need to be sacrificed in order to reject 1 large message or unknown type.
Does it have to be so hard?

The above scenario is not the only way that a robust chat room could be developed. With some compromises on quality of service and some good user interface design, it would certainly be possible to build a chat room with less complex usage of a WebSocket. However, the design decisions represented by the above scenario are not unreasonable even for chat and certainly are applicable to applications needing a better QoS that most chat rooms.

What this blog illustrates is that there is no silver bullet and that WebSocket will not solve many of the complexities that need to be addressed when developing robust Comet web applications. Hopefully some features such as keep-alives, timeout negotiation, orderly close and error notification can be build into a future version of WebSocket, but it is not the role of WebSocket to provide the more advanced handling of queues, timeouts, reconnections, retries and backoffs. If you wish to have a high quality of service, then either your application or the framework that it uses will need to deal with these features.

CometD with WebSocket

CometD version 2 will soon be released with support for WebSocket as an alternative transport to the currently supported JSON long-polling and JSONP callback-polling. cometD supports all the features discussed in this blog and makes them available transparently to browsers with or without WebSocket support. We are hopeful that WebSocket usage will be able to give us even better throughput and latency for cometD than the already impressive results achieved with long-polling.

Cometd 2 has been released and we now have even more impressive results Websocket support is build into both Jetty and cometd, but uptake has been somewhat hampered by the multiple versions of the protocol in the wild and patchy/changing browser support.

Programming to a framework like cometd remains the easiest way to achieve a comet application as well as have portability over “old” techniques like long polling and emerging technologies like websockets.
06/04/2011