Calibre Week in Review

Once again this is a big week with a lot of little changes. The majority of which were related to TXT input.

Format Specifications

I was thinking about the fact that for all of the formats I support I use the format specification to know how the reading and writing should happen and how they aren’t part of calibre proper. I have a set of documents that outline what is known about each format I handle. I say what is known because in some cases (eReader) the binary format is reverse engineered and a lot of it is guess work. The documents I use are partly a collection of information available in (sometimes many) different places and some of it is my own work. I’ve now added these documents to calibre proper in the top level format_docs directory. Hopefully people will find this useful and help others work on these formats.

GUI

Recently there was a request to add auto complete to (just like it is in tags) to the authors metadata field in the GUI. I added this a few versions ago and it caused an uproar. Many people loved the feature and many people hated how after completing it would add the completion character at the end of the completion. Even though when you save the changes the completion character is removed people a small group of vocal users didn’t like the way it looked while editing. Kovid changed completion so the separator character isn’t inserted after completion and since I as well as other liked this behavior said that I should re-implement it as a tweak. So in 0.7.45 set the tweak completer_append_separator to True to have it insert the separator character after completion.

Heuristic Processing

Lee and I did some more work on Heuristics. Mainly he did the work. I’ve pretty much just been getting the options set up on the command line and in the GUI for him. There is a new option for replacing soft scene breaks with a hard scene break. The replacement text is user defined but the history drop down comes preloaded with a number of common cases.

I did a little heuristic work myself. Namely I tweaked the italicize patterns to make them more robust and I in the process I simplified them.

FB2

FB2 output was updated to handle creating soft scene breaks added on empty paragraphs and top margins. Because FB2 does not specify how the document is supposed to look (this is left to the reader software, elements only define type not layout) I chose inserting blank lines between paragraphs to create scene breaks.

PML

PML Input had some tweaks regarding soft scene break. I reduced the number of empty lines between paragraphs to create a soft scene break. I haven’t seen any documents that need this change. However, the more I thought about how it was handled, I realized that a valid document can use fewer lines.

Now that PML Input retains soft scene breaks it’s only natural to have PML output write them. Empty paragraphs and margin based spacing are both accounted for. In addition I added support for left margins being retained in the resultant PML.

There was one small bug fix. Looking over the PML docs again I noticed that c and r codes need to be closed on the next line following their opening. I modified the output code to ensure this happens. There was some general work to produce cleaner output as well.

While I was working on the above I decided that since I previously changed PML input to create a multi-level TOC that I should also have PML output write a multi-level TOC. Currently this is based on the tags being pointed to by the TOC items and by them not being headings. Only Cn TOC markers are supported at this time. Xn markers are going to need a bit more work.

TXT

TXT input paragraph processing was restructure so paragraph transformations are always applied. Previously they were not being applied when Markdown or Textile formatting was used. A user on MobileRead had modified their TXT file and simply added #’s in from of the headings to have them formatted in the output. The user did not make any other changes to have their document conform to Markdown and the resultant output was not very nice. I seems very common for users to simply stick Markdown or Textile formatting into their documents and rely on calibre to clean them.

Dehyphenatation of TXT input was tweaking. It now looks for heuristics and dehyphenate options to be enabled. In this case it will be run over all TXT input including Markdown and Textile formatted documents.

There were a few bug fixes related to various issues. Spaces at the beginning of lines were not properly preserved. Spaces within documents were getting converted to entities when they shouldn’t. A regression that broke block formatted paragraphs was fixed.

Print formatted documents not have the indents retained.

For people like me who do not like indented paragraphs I’ve added an option to remove indents from TXT input documents.

There was one small bug fix in TXT output and that was to have TXT output show all TOC items. Previously it was only showing top level items.

TXTZ

I’ve added support for a new pseudo format call TXTZ. It’s essentially just TXT files put into a zip archive with the extension .txtz. It can contain images which should make working with Markdown and Textile formatted text easier. Also, it has metadata support via an OPF file called metadata.opf within the archive. This OPF file will be referenced for metadata reading and writing. Both input and output of TXTZ support has been added.