I thought I’d write a post on some bizarre behavior I recently encountered with the Java 5 SAX Parser. This might help people who are using Flying Saucer for PDF generation.
We were reported a bug where the PDF documents generated by a web application we developed, occasionally, had words split across two lines. We tried hard to establish a pattern on when the split happened but after trying out many different scenarios and much deliberation we came to the conclusion that the split was random and no real pattern could be followed to replicate it. It was so hard to replicate that, at times, it would take the tester a whole day to find an example split in the generated PDF documents.
The web application uses Flying Saucer/XHTML Renderer to render PDF from an XHTML document. Flying Saucer is the most widely used PDF renderer in the Java technology stack and therefore it’s quite bizarre to encounter such a problem. Doing some googling also revealed that no one else had encountered a similar problem and for most people Flying Saucer worked like a breeze. This meant that after exhausting all other options of looking for the bug in our code, we were left with no other option than looking through the Flying Saucer code to look for the potential bug.
The web application uses the combination Spring MVC and Apache Tiles to generate XHTML document to represent the format and content of the PDF to be generated. The application was running on IBM Websphere server and used the default websphere Java 6 SAX parser to convert the XHTML document into W3C DOM. Through our investigation it was found that in the DOM, the split words were represented by multiple text nodes. So for example, if performance was to split into ‘perfor’ and ‘mance’ in the PDF, in the DOM, ‘perfor’ and ‘mance’ were parts of the different text nodes. Flying Saucer, the PDF renderer, treats each text node separately, which caused text to break in the middle of words.
We still don’t know why the SAX Parser arbitrarily creates two text nodes to represent one word. An obvious solution to the problem was to use a different parser and using a different parser did seem to solve the problem. However, normalising the document, i.e. combining multiple text nodes and therefore removing the possibility of having a word being represented by multiple text nodes seemed a much easier option since there was already a method available in the Java API to normalise the document and just calling this method fixed our problem.
This seems to have solved the problem with no more word-splits reported to date.
We still don’t know why the SAX parser created multiple text nodes to represent one word.