Ounce of Prevention
As senior reviewer of code, it is my responsibility to review all incoming changes for possible performance hits.
Obviously bad:
session.addAttribute(new UUID(), new LargeXMLTree())
not so obviously bad (but essentially equivalent)
< ... rendered="#{conditionforsomelargepage}">
<a4j:include viewId="#{somelargepage}" />
</ ... >
For new code, I made sure that the
somelargepage was
blank.xhtml if rendered was
false. I didn't change the old pages. What I didn't realize (this release took 8 months from inception to get out the door, I forgot some changes alright?) was that old pages used to be included like
< ... rendered="#{conditionforsomelargepage}">
<ui:include src="#{somelargepage}">
</ ..>
Which isn't bad at all. Unless you have ui:debug defined anywhere, in which case it is.
Pound of Cure
It's hard to tell your servers are supersaturated. The code worked fine before. We have not enabled new UI features yet. It never came up in test. The logs show different errors each time a server goes down. Servers are going down once every few hours. What happened?
We have no response time analytic capabilities. We have no baseline for time spent in gc. We have no way of telling session sizes in prod. We have no way of telling if this is a rogue action or a rising tide. We have no way of telling if this is a app memory bug, or a server configuration. We have no load testing capabilities.
Nightmare. We had hints though. The servers would be fine with no errors for a few hours until west coast people signed on. The logs would show nothing except for db timeouts of very basic queries. It happened around the same time each day. Eventually we rolled it back. Time wasted: 1 week.
Wasted Effort
But we got out of it. Idea one was to write a service to determine session size. *Heavily* modified jamm was the answer here.
This was a waste of time and did not contribute to final solution. Hey, I have a jsp that I can thread into a war now that will give you session size though. Not all bad. Time wasted: 1 week.
Repro Found
Step two was to write selenium tests from scratch. 1500 lines of code and one hijacked OCR server later, we were able to reproduce the same server behavior. Time spent: 1 week.
Using a trial version of YourKit, I spent 6 hours running the same selenium test case and taking snapshots. By the numbers: new code spent 156% the memory per session that the old code did. New code did not change app memory used (retained - session). I cannot stress how much this overall view was needed to prove that it wasn't a leak, or application memory bug.
Solution
After you can reproduce a problem, it becomes almost trivial to fix. You can literally just tweak at random until you see the problem go away.
Of course, I knew of an optimization beforehand, having it applied it to the new code. So the fix was not random.
Post Notes
I wrote many emails about how our testing infrastructure had holes that let this problem go though. I wrote many more emails about the money and time costs of fixing the infrastructure, and the pros and cons of different ways of fixing our software development life cycle.
Overall, it was one of the more interesting tasks I've had, as "proving" is in my DNA as a maths major.
Final lesson: large JSF apps simply do not scale.