Monday, July 13, 2015

When Spinning Plates Crash

JEE was not made for the working man. 

I mean, if you look at intro docs[0] you are going to see things like custom deployment descriptor capability. In practice what this means is that they expect a deployment person to take an existing working application, in a custom zip file, open the same file that is in charge of the whole container (how it is deployed, how things are named, what classes are and are not loaded) edit incredibly ill-commented xml files, and expect it to work in a new environment straight away. 

No. JEE was made by corporations with agendas. BEA, Oracle, IBM etc.. each with their own "value added" piece. Specifications written by corporations are nothing more than joint advertising efforts for their products.

An anonymous grey beard from the early cocaine-80's days of selling computer hardware told me once he went to a demo with a machine the size of a room that only had their tech lead inside with a small system to fake out the real thing. For $3 million dollars they wanted this thing. A guy in a box literally hard-wiring answers. 

I spent all week on getting each production version it's own building release branch. I'm still not done. 
  • It's maven's SNAPSHOT. 
  • It's svn. 
  • It's ongoing poor deployment practices. 
  • It's slow corp machines.
  • It's corrupted outlook pst files.
  • It's rushed integrations with no real transaction support.
  • It's huge static custom css
  • It's 100 degree heat walking to and from work
  • It's interviews taking 90 minutes.
  • It's bamboo's not latest branch version but current repository version.
  • It's the COMPLETE LACK OF AUTOMATED TAPE BACKUPS.
  • the list goes on and on and on and on


Saturday, July 4, 2015

When You Bungle a Deployment

Ounce of Prevention

As senior reviewer of code, it is my responsibility to review all incoming changes for possible performance hits.

Obviously bad:
  session.addAttribute(new UUID(), new LargeXMLTree())
not so obviously bad (but essentially equivalent)
    < ... rendered="#{conditionforsomelargepage}">
      <a4j:include viewId="#{somelargepage}" />
    </ ... >
For new code, I made sure that the somelargepage was blank.xhtml if rendered was false. I didn't change the old pages. What I didn't realize (this release took 8 months from inception to get out the door, I forgot some changes alright?) was that old pages used to be included like
    < ... rendered="#{conditionforsomelargepage}">
      <ui:include src="#{somelargepage}">
    </ ..>
Which isn't bad at all. Unless you have ui:debug defined anywhere, in which case it is.

Pound of Cure

It's hard to tell your servers are supersaturated. The code worked fine before. We have not enabled new UI features yet. It never came up in test. The logs show different errors each time a server goes down. Servers are going down once every few hours. What happened?

We have no response time analytic capabilities. We have no baseline for time spent in gc. We have no way of telling session sizes in prod. We have no way of telling if this is a rogue action or a rising tide. We have no way of telling if this is a app memory bug, or a server configuration. We have no load testing capabilities.

Nightmare. We had hints though. The servers would be fine with no errors for a few hours until west coast people signed on. The logs would show nothing except for db timeouts of very basic queries. It happened around the same time each day. Eventually we rolled it back. Time wasted: 1 week.

Wasted Effort

But we got out of it. Idea one was to write a service to determine session size. *Heavily* modified jamm was the answer here.

This was a waste of time and did not contribute to final solution. Hey, I have a jsp that I can thread into a war now that will give you session size though. Not all bad. Time wasted: 1 week.

Repro Found

Step two was to write selenium tests from scratch. 1500 lines of code and one hijacked OCR server later, we were able to reproduce the same server behavior. Time spent: 1 week.

Using a trial version of YourKit, I spent 6 hours running the same selenium test case and taking snapshots. By the numbers: new code spent 156% the memory per session that the old code did. New code did not change app memory used (retained - session). I cannot stress how much this overall view was needed to prove that it wasn't a leak, or application memory bug.

Solution

After you can reproduce a problem, it becomes almost trivial to fix. You can literally just tweak at random until you see the problem go away.

Of course, I knew of an optimization beforehand, having it applied it to the new code. So the fix was not random.

Post Notes

I wrote many emails about how our testing infrastructure had holes that let this problem go though. I wrote many more emails about the money and time costs of fixing the infrastructure, and the pros and cons of different ways of fixing our software development life cycle.

Overall, it was one of the more interesting tasks I've had, as "proving" is in my DNA as a maths major.

Final lesson: large JSF apps simply do not scale.