This week saw massive improvements to TXT input. I started the week with a slew of changes and as soon as I had implemented the first of them Lee Dolsen contacted me. We’ve worked together before improving PDF input. Since then he’s done a lot of work with preprocessing of PDF and other not so clean input.
TXT input now auto detects the character encoding of the file. It isn’t 100% accurate but should work for the majority of cases. It’s using chardet for the detection. Unfortunately, cp1252 is the most common encoding that gives people issues and unless you’re using things like smart quotes and curly apostrophes it doesn’t always detect properly.
I started getting TXT input to detect the document structure. Mainly, are the paragraphs arranged in block, single line, or print fashion. Lee saw the detection code and modifying some of his preprocessing code he was able to greatly increased the detection accuracy over my initial work. He’s also added an unformatted type that assumes the text is one big blob and tries to determine paragraphs in much the same way PDF input tries to determine them. By unwrapping based upon punctuation and other factors.
In addition to detecting the paragraph style used in the document, TXT input now tries to detect the formatting of the text content. Markdown formatted text is detected. I’ve also added a heuristic processor which runs by default if either Markdown is not detected or if the user has not specified the formatting as none (which disables any type of formatting processing).
The heuristic processor uses some ideas from GutenMark. Specifically italicizing common words and certain contentions used in Project Gutenberg texts that denote italics. I started working on a set of heuristics to detect chapter headings but Lee quickly pointed out he had already created something similar using regular expressions in his preprocessing code. I quickly began using it in my heuristic processor and it’s working well. Chapter headings and subheadings are now formatted with the appropriate h tags. He has some plans to enhance the detection further using a word list.
TCR, PDB PalmDoc and PDB zTXT inputs all pass the extracted text to the TXT input plugin for processing. This allows them to take advantage of all the work that’s gone into TXT input. Also, with auto detection now being part of TXT input it should allow for one time conversion instead of convert, check, tweak some options, convert again. Especially since these formats don’t make it easy to see how the text is structured within the file without first converting.
TXT input wasn’t the only part of TXT support that was touched. I spent some time cleaning up the TXT output. Consistent spacing is now created around headings. Also, when using the –remove-paragraph-spacing option, headings are not indented with a tab. The output now looks much cleaner and I consider it acceptable for reading.
Not to be left out FB2 output got a small bug fix. With all the work rewriting it I broke having it read covers. If you were converting an EPUB for instance that specified the cover (or title page) in the guide rather than the spine it would not be included. Also, the –cover option was being ignored. Now that’s fixed and external covers are inserted properly.