Thursday, March 27, 2008

Scalable Java Web Applications, Part 1

Monday, July 11, 2005

Part I

Scalable Java Web Applications

The things they never teach you in any Java book or Java class...

I had a few things go wrong on a project of mine recently that reminded me of just how complicated it can be to make a create a highly-functional stateless web application.

Java is making it simple to create Enterprise Scalable applications, and yet a robust and clean method of maintaining state from page to page in a browser eludes us to this day.

If you have a rich web application where the users follow a workflow across multiple screens, your application will have to be stateful to some extent. You have to have the ability to "remember" what a user enters on one screen and carry that forward from screen to screen.

And what we have to work with is a combination of half-baked browser technologies that are each serve some limited little purpose, yet can be glued together in a "Frankenstein Kluge" to sorta-kinda make everything work.

Don't take my "half-baked" comment the wrong way; these technologies all work well for what they were originally intended for, but none of them were intended to be a robust state-saving mechanism.

There are many issues to consider with each of these technologies.

  • Cookies: Cookies can be disabled on a browser. They can be hacked or spoofed or stolen unless care is taken to properly encrypt them. Cookies are also treated in erratic ways on most browsers. Example: In the Internet Explorer browser, opening a "New" window will replicate all the cookies from the existing open window, but launching a new IE browser from an icon will not. So an application will either remember you, or not, depending on if you open a new "window" or new "instance" of your browser. And 99.9% of the web-surfing population of the world will never understand the subtle difference between those two states.
  • GET and POST parameters: GET and POST parameters are intended to be used for user input. But they can also be used to piggy-back state data from page to page (possibly as hidden fields). The problem is, all parameters look the same to the server, so this make it easy for a user to hijack your application data. Don't expose key fields to users, and beware of cross-site scripting, SQL injection, buffer overflows, parameter replacement, and all other manner of input manipulation.
  • Hidden fields: There are just a flavor of an HTML parameter, so they inherit all the weakness and dangers of HTML parameters. And hidden fields are only hidden until someone does "View Source". They are a clumsy and risky and tedious method for saving state...but they do work.
  • URL Re-writing: Kluge. Every place you have a URL reference in your entire application, you add some code to "inject" a session id into the GET or POST parameters. And this session id is a key into a table that contains the user's data on the server side. This is a common alternate to using cookies. In theory, this sounds easy. But in practice, I've found that URLs come from all over the place, and URL re-writing becomes a major plumbing project to implement correctly and completely.
  • JavaScript: JavaScript isn't really used to preserve application data from page to page, but JavaScript can inject more dynamic behavior into an existing page, and thus perhaps cut down on server round-trips. However, JavaScript can be disabled. It is difficult to work with, and doesn't work the same in all browsers.
  • Applets: You can say goodbye to your stateless HTML pages and code your entire application as a big stateful Java applet. Yay! However, Java has a long track record of browser compatibility issues, as well as producing awkward and sluggish graphical user interfaces. And most browsers are still too clumsy at identifying and handling the plugin for Java. Microsoft managed to effectively kill Java Applets during the browser wars. Microsoft has finally gotten out of the Java business, and modern Java plugins are getting smarter, but it might be too-little too-late for Java Applets to make a comeback. We'll see.
  • Server Session data: Server Session data is dependent on cookies (or URL-rewriting), and so inherits all of the weaknesses of these technologies (including the browser "New Window" problem). Server Session data only lives in the memory of one server (unless you take other measures described later in this article), so it is vulnerable to server fail-over, and switch-over. If you store application state in the session, you can be vulnerable to the "back" button on a browser knocking your application out of sync. You also have to be mindful of memory usage in the session, because your session store will be multiplied by the number of users you have. Since the server has no way of knowing when the user closes the browser, the server must implement an expiration mechanism for session data. And that in itself presents another set of timeout/logout issues to contend with.

Each of these little technologies does it's own thing in it's own little space, and by gluing them together you have a full Web Application.

It has been my experience that most Java Web applications today use a combination of Hidden Fields and Session Data to manage application information. The really huge and highly-specialized web applications (like Google and Amazon and eBay) are all total custom jobs. They have the resources and army of developers it takes to do their own plumbing for everything. Google makes crazy use of JavaScript, but they have the resources to pull that off.

But for the rest of us the I.T. world, we need to develop highly functional websites cheaply and quickly, and so we need to rely on frameworks and technologies to do as much of the plumbing as possible for us. We need cookies. We need server-side session data.

The Best Practice is to use Hidden Fields for any state data (tracking where the user is and what the user is doing) and using Session Data as a stateless reserve (things you want to remember about a user regardless of what page they are on).

In practice, this IS difficult to code properly, and requires a detailed technical design, with careful layout of the dataflow for the entire application.

Beware of the multiple-submit problem. Example: User presses a button to "submit" an order to the database, and then hits the "Back" button on the browser and presses "submit" again. Did they intend to place a 2nd order? Or did they intend to update the first order? Can your application tell the difference and handle either?

Beware of the refresh problem. Example: User presses a button to "submit" an order to the database, and the user gets a result screen that says something like "Thank you for your order". If the user simply hits "Refresh" on that screen, they will cause the order to be submitted again, because the "submit" operation was part of the request/response process that generated the result screen.

There may also be a few key commit point in an application where you need some hard mechanism to detect and prevent duplicate submissions, perhaps using tokens or timestamps on objects. Most modern frameworks (Struts, JSF, etc.) have some tools that can help.

The good news is that if you follow all the tips and Best Practices I'm outlining in this article, you will be able to develop an application that is immune to back buttons and refreshes without too much plumbing work on your part.

The following is a pretty good article on the subject of redirects and duplicate submissions. It's technically very accurate, if you look past the somewhat flakey writing in places:

http://www.theserverside.com/articles/article.tss?l=RedirectAfterPost

LOAD BALANCING
Okay, I'm on page 3 of this particular blog posting, and I still haven't gotten to the problem I encountered on MY project. :)

My problem was with load balancing and server session data. But in order to understand why load balancing is such a problem for a web application, you have to understand the nature of the technologies that the web is built upon.

As mentioned above, a modern Java application needs to store some amount of data in the Server Session. Writing a session-free Web Application would be complicated and cost-prohibitive, and incompatible with modern frameworks that assume session usage.

But if you've chosen to use Server Session data, how do you scale your application up to multiple servers? How can you store application data in a single server session when your user might be load-balanced across multiple servers? The issues involved with architecting a Web Farm for a java application are addressed in the following article:

http://www.onjava.com/pub/a/onjava/2001/09/26/load.html

For most applications, you will have three choices:

1) Don't load-balance an individual user. Once a user hits one server, keep them locked to that server. Any session storage they use will always be available to them because they never leave that server. This method is called "Server Affinity". All the users on the whole system are still load-balanced over multiple servers, but each individual user is locked into one server.

2) Replicate all session data across all servers. No matter which server the user goes to, their data will be waiting for them.

3) Store session data in a central repository that all servers can access. To prevent a single point of failure, this central repository should itself be distributed across multiple servers.

As with all things in technology, there are trade-offs for each of these.

From my own little informal polling of places I've worked at, and people I know, and articles I've read on the web, and people I've met at JavaOne, it seems that most applications use option #1.

However, option #1 can be difficult for your admins to make work 100%.

Option #1 is the most simple and straight-forward from the point-of-view of the application developer. You write your Java applications to use session data any way you choose, and your application will run in a server farm the exact same way it does on your single desktop machine. I love option #1 because it gives us developers a lot fewer plumbing issues to worry about.

The problems with option #1 are as follows: Some ISP services (like AOL) will use dynamic IP addresses on outbound traffic. So, the user's IP address might change as the user moves from page to page, and that makes it difficult for a Load Balancer to figure out that this is the same user that placed the previous request. But a "smart" Load Balancer will also use cookies to determine a user's identity. However, cookies get encrypted if your application runs under SSL (HTTPS). So, you need a "really smart" Load Balancer that can also decode SSL. (Or some kind of other external SSL encoder/decoder box sitting in front of the balancer.) At this point, we are talking about some pricey hardware that takes considerable time to configure. And even after going through all this trouble, this solution provides no fail-over if a server goes down. All users on the crashing server will be lost, because their session data was never replicated to any other machine.

Some admins have told me that even the smartest Load Balancers on the market get it wrong on rare occasion, and lose track of a user's Server Affinity. I can't confirm this from personal experience though. I've had really good experiences on every project I've been on that used option #1.

Option #2 gives you a more robust solution that can allow for true fail-over of all users to other servers. Plus, option #2 doesn't require smart hardware or other products that provide server affinity for a given user. With option #2, you can have regular load balancers.

The problems with option #2 are as follows: The data in your session will likely have to be serializable so that it can be transmitted between servers. This is tricky to accomplish in Java, and it is not easy to test. If any of your objects contain something in their data chain that isn't serializable, your application won't distribute session data on the server farm, and you won't catch that bug until you actually deploy the application to such a farm. Also, there is considerable overhead in maintaining copies of all user data on all servers in the farm. Some Application Server vendors are better and smarter at managing this than others. Your milage will vary.

Option #3 gives you all the robustness and fail-over of option #2, but it offloads the session management tasks to another group of servers.

The problems with option #3 are as follows: In addition to your data needing to be serializable, the mechanism you use to store session data will likely be custom and/or proprietary depending on which Application Server vendor you are using. Saving session data to an external server is not functionality that is part of the Java standard, but is rather something left to the imagination of each developer and/or vendor.

Option #3 is the most robust and complete, but also requires the most amount of plumbing and overhead to accomplish.

For the current project I'm working on, we've been coding all along assuming we'd be using option #1. So we have not paid any attention to making sure our session data was serializable. Well, we finally moved our application to production, and it turns out our Load Balancer doesn't support Server Affinity through SSL. I guess it wasn't clever enough.

Our Server Admin didn't like option #1 anyway, because it's not a 100% solution that supports fail-over. So, we decided to try option #2. Which means I spent a few days re-writing a bunch of our objects that we store in the server session, because these objects contained member objects that were not serializable (and could not be made serializable because they were 3rd-party objects).

But now the application works in a server farm, and we have 100% fail-over. It's really cool! However, this does add a lot of overhead to the servers to replicate the sessions, and it adds a fair amount of overhead to us Java coders who have to be so careful what we stick in the session, and make sure it's really serializable.

And after everything we'd had to do to get this working (and I still have much work to do), I can't help but think that the Java Community should have come up with some kind of more standardized toolkit or solution for dealing with session data. I'm thinking that the HTTPSession object should not even allow a coder to store non-serialized data into the session. Even if your application uses option #1 today, it might need option #2 or #3 in the future, so maybe it's best to start early with making sure all session data is serializable?

Microsoft .NET has the same problem as Java. You have to make your C# objects serializable if you want to use option #2 or option #3.

Despite the effort of Java and .NET to make Enterprise Web Development more simple, there are still quite an array of technical plumbing issues that make Web Development complicated and risky and buggy. And we just have to keep all these issues in mind as we design and code the application, because they are difficult to fix late in the project. :)

1 comments:

mahakk01 said...

Scalable Java Web Applications are described in detail here. You can get answer of every question related to this topic. I find your post very informative. You are doing tremendous work. I appreciate your work. Thanks for the wonderful work.
digital signature Adobe