Calibre’s robust conversion engine can be leveraged to help with formatting tasks. Converting with calibre helps when you are dealing with a format you cannot easily edit, or a format you aren’t very familiar. It’s also helpful in situations where the insides of the file is such as mess it’s hard to understand.
Many of these tasks are geared toward people who already have an ebook in an existing format that they need to clean up. A common case is a book from Project Gutenberg. Ebooks from Project Gutenberg are in no way poor quality. However, a TXT file is not going to have any formatting because the format doesn’t support it. There is a large number of people who want to add fancy formatting to classic titles. Another reason for conversion first is HTML documents produced by Microsoft Word. While Word is a great text editor its HTML output is atrocious. There is often lots of extra formatting information that is unnecessary and unused. Conversion can reduce the amount of time they need to spend adding formatting by using calibre as a first step.
Some common ebook formats MOBI (used by Amazon) and eReader PDB are compiled. They are not easily edited and one of the easiest ways to make changes is to convert to a format that has robust editing tools.
There have been many ebook formats over the decades that ebooks have been around. I’m not going to go into why there are so many ebook formats but know that many of them are not widely used any more. It’s easier to take a format like an ebook in the Plucker format from Project Gutenberg and convert it to an easy to edit format than working in the Plucker format.
Now that the whys are out of the way let’s move on to the hows.
When converting from any format there are a few output formats I recommend. EPUB is of course easy to edit using Sigil. If you want to use HTML as your base format I would commend looking into calibre’s HTMLZ output. HTMLZ is really a ZIP file with a single HTML file inside of it. One major advantage of HTMLZ over EPUB is HTMLZ output has some options that can be used to clean up the output. For instance HTMLZ can condense CSS styling to a basic set and produce a CSS HTML file inside of the archive.
Another option for conversion is TXT output. TXT by itself will remove all formatting. If the file is so poorly formatted it could be easier to start from scratch and add all formatting back yourself. TXT output also has the advantage of being able to produce Markdown or Textile formatted output. Both of these give basic formatting while removing extraneous formatting information.
Finally, when converting with calibre look at the Heuristic formatting options. They are disabled by default and should be used with care (Except with non-Markdown and non-Textile TXT input). There are heuristics for many common formatting needs such as detecting and formatting chapters and chapter subtitles. There are heuristics for italicizing common cases, removing unnecessary hyphens, and fixing broken lines. There are more options than these and it’s a good idea to go though them all and enable the ones that apply to the ebook your working on.
My recommendation is if you’re starting with a poorly formatted starting point use conversion to do some of the basic work for you. Conversion is an automated process and can take care of some tedious tasks for you but it’s not a replacement for human editing.