I’ve been putting up my week in reviews on based on a week starting on Monday for some time now. I’ve been thinking about this and it doesn’t really make much sense. Calibre has a release pretty much every Friday now. So starting next week I’m going to change my week in review to be Friday though Thursday. This way features I talk about in my review will be in the just released version.

TXT Input

First the small changes. Heuristic processing now enables smarten punctuation to further my goal of TXT documents coming out looking great. A change was made to have hard scene breaks separated from the text to ensure it doesn’t accidentally get merged into the paragraph before or after. The formatting type none was renamed to plain to correspond with the formatting output option.

The only big change for TXT input was a new paragraph type option was added. It’s called off. When specified there will be no modifications to the paragraph structure applied to the text. This is especially useful for Markdown and Textile formatted documents. It ensures there are no changes that will cause elements to render incorrectly.

TXTZ Input

A bug caused images to not be included when converting. With Kovid’s help this has been corrected.

TXT Output

I modified Textile output to not write %’s for span tags. The span tag is superfluous in calibre’s Textile output because it does not contain any real information. The span tags are invisible when rendering the XHTML. The %’s cluttered up the resultant TXT so they were removed.

PML Input

PML input saw a lot of of relating to t and T tags. The entire handling of these tags was rewritten. Unfortunately, there is no way to have these two tags map one to one to XHTML so only some common cases are handled.

  • T’s that do not start the line are ignored.
  • t’s that start and end the line use a margin for the text block.
  • t’s that start a line and end another line use a margin for the text block.
  • t’s that start a line but end before a line ending will use a text-indent.
  • t’s that are in the middle of lines are ignored. open and closed t blocks within a line are ignored.

Heuristics

Once again the italicize common cases regex was tweaked. This time it was to fix an issue with None being inserted in the text before ajacent underscores. I’m hoping this is the last time for a while that I need to tweak them.

Kindle Interface

The work I did on the APNX format was undertaken for a very real world reason. Integrating APNX generation to calibre’s Kindle device interface plugin.

The 0.7.45 release saw the initial inclusion of this feature. After I received some user feed back I’ve tweaked it for the 0.7.46 release. The 0.7.45 release included a very basic APNX file that would create pages every 1024 bytes of uncompressed HTML.

In 0.7.46 there are a lot of differences. Writing the APNX can be disabled. This is very useful for Kindle 2 users as the Kindle interface works for both Kindle 2 and 3’s.

There are now two parser for generating pages. The default is the fast parser. It uses the uncompressed length of the MOBI HTML and creates pages every 2300 bytes. A few users complained that 1024 created too many pages. About double what you would find in an average paper back book. The 2300 number is a bit more than double 1024 and I chose 2300 after counting the number of characters in a page of an average paper back book. I counted approximately 2240 and added an additional 60 characters to account for markup per page. Thus 2300.

The other parser that can be enabled in the Kindle interface’s setting is the accurate parser. It works by decompressing the MOBI HTML and looking at the actual content. The big difference and why I’m calling it an accurate parser is it looks at the amount of visible text to decide when a page ends and a new one begins. The assumption is there are 30 lines per page and each line can have up to 70 characters. The parser starts a new line every time it encounters a new paragraph and every 70 characters in a paragraph.

The major disadvantage of the accurate parser and why it’s not the default is it’s slow. It requires the text to decompressed and parsed. With a PalmDoc compressed file this can take a few seconds but with a HUFF/CDIC compressed file it can take minutes.

The other minor disadvantage of the accurate parser is it cannot work on DRM content. The fast parser can because the uncompressed text length is stored unencrypted in the MOBI header. If the accurate parser is chosen it will fall back to the fast parser for DRM content. So when ever a Mobipocket book is sent to the Kindle (AZW, MOBI, PRC) an APNX file can and will (unless disabled) be generated.

One thing I will note about the accurate parer is it currently ignores all markup and only looks at text. Meaning it can be made even more accurate by accounting for <div class="mbp_pagebreak" />, <br>, <hr>, images, margins, and font size changes. I do plan to add support for most if not all of these in the future but since most books people read on their Kindle are pretty much all text and because the accurate parser does a good enough job giving page numbers that correspond to the page length in a paper back book I’m don’t see a pressing need to spend the time on it at this moment.