Calibre, ePub and epubcheck: the curse of editing XHTML

Calibre is an ebook management application. It comes with a nice ebook reader too, which I use all the time. [Calibre]

Calibre is also the most common ePub generator. Its format converters are robust and battle-hardened.

This post is a record of what I actually did to make the ePub for Libra Shrugged. There’s almost certainly bits I could have done better some other way, and a lot of bits where I got way too deep into technical twiddling just because I could.

(Comments suggesting using L^AT_EX instead will get you slapped over the Internet.)

If you don’t understand any of the technical detail here, don’t worry about it — there must surely be better ways than this.

The horror

Since this is on computers, there are some gotchas — specifically, that ePub is an absolute shower of a format, and you will be editing XHTML by hand if you want to get onto Apple Books and the other minor ebook stores.

The good news is that Calibre has a pretty good XHTML editor — right-click on book title, “Edit book”. The bad news is that you’ll need it.

If you’ve lived your life wrong enough that you’re hand-editing ePub XHTML, you should probably install epubcheck. There’s an online version, but I just installed the Java .jar file locally — it’s much faster. [EPub Validator; Github]

The developer of Calibre considers epubcheck broken, which it is, and wrong about the ePub specification, which it is. Unfortunately, Apple Books requires your book to pass epubcheck anyway, with no errors or warnings. [Apple]

At one point I unzipped the ePub into separate XHTML files. This let me hand-tweak the files directly in vim, then add them back to the ePub using zip -f (freshen). Nobody who isn’t me should expect to have to do this sort of thing, but I’m a control addict.

(I’m using zip -f and not just making a zip file of the separate files because that way, the mimetype file stays both uncompressed and first in the zip file — if it isn’t, epubcheck complains. ePub is weird and annoying.)

Getting Calibre

If you have Windows or Mac, just download the latest version (5.4.2 as I write this) from Calibre and use that. [Calibre]

Update 2022: Ubuntu 22.04’s distro version of Calibre works fine. sudo apt install calibre and ignore the rest of this section, you lucky person.

I use Xubuntu. Unfortunately, Ubuntu 20.04 has a broken version of Calibre that can’t possibly start or work — Ubuntu pulled a development version from Debian, nobody noticed before release time that it literally didn’t work at all, and now the broken version’s stuck in place for the next five years. [Launchpad; ubuntu-devel mailing list]

(The broken version still has a functional ebook-viewer.)

If you’re like me and insist on using Linux, you can run the Linux install instructions without running it as root. I did the isolated install per the Linux download page: [Calibre]

wget -nv -O- https://download.calibre-ebook.com/linux-installer.sh | sh /dev/stdin install_dir=~/calibre-bin isolated=y

(Calibre doesn’t offer distro packages, because the author has had so many bug reports from broken distro versions that he tells users to get the official binary instead.)

After installing Calibre to my home directory in this way, I start it from a terminal.

Convert your DOCX in Calibre

I wrote both books in LibreOffice in its native ODT format. Calibre’s conversion of LibreOffice ODT files is much better in 2020 than it was for Attack of the 50 Foot Blockchain in 2017.

But I wanted clickable indexes — so I saved the book file in LibreOffice as DOCX, and sent that to Calibre for conversion.

This is the easy part. You click the “Add book” button to import the DOCX, you right-click on the book name and go Convert books→Convert individually.

Choose the following options:

Metadata: Output format: EPUB. Enter Title, Author(s), Tags. Add the cover image.
Page setup: Output profile: Generic e-ink HD. Input profile: Default profile.
DOCX input: Do not add a page after every endnote.
EPUB output: Flatten EPUB file structure; ePub version 3

Then click “OK” to generate your ePub!

Back cover

I wanted the ePub to finish with the back cover image from the paperback. So I added the image in the Calibre editor, which called it images/image.jpeg, and added the following code near the end of the final XHTML file:

<div><svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="100%" height="100%" viewBox="0 0 1640 2550"><image width="1640" height="2550" xlink:href="images/image.jpeg"/></svg></div>

I also needed to declare my use of SVG in content.opf by adding properties="svg" for the XHTML file it was in:

<item id="id11" href="index_split_031.xhtml" media-type="application/xhtml+xml" properties="svg"/>

Cross-references

In LibreOffice and Word, if you want multiple references to a single footnote or endnote, you create the footnote or endnote and then you add cross-references to it. These are clunky and inconvenient, but they work fine in the original program, and the links work in a PDF.

If you convert DOCX to ePub in Calibre, the cross-reference becomes a link — not to the desired footnote, but to the spot in the text where the footnote or endnote of that number is.

If you have both footnotes and endnotes, this may not even be the right footnote or endnote, because Calibre mixes your footnotes and endnotes together into a single list — then doesn’t renumber cross-references to them.

If you convert ODT to ePub in Calibre, the cross-reference becomes just plain text with the wrong footnote or endnote number.

The answer is to edit the XHTML. I had to go through my fifteen cross-references and cut-and-paste in the XHTML from the correct endnote in place of the erroneous cross-reference anchor.

Calibre claims to do cross-references correctly, from both ODT and DOCX, but I can’t say it’s ever worked properly for me. [MobileRead forum; Launchpad]

Indexes

Alphabetical index entries are plain text, not clickable links, in both LibreOffice and Word. This is obviously silly, which is why making index entries into clickable links is an open feature request in LibreOffice. [Document Foundation]

Q. What’s duller than indexing?
A. Indexing a second time, to get the ePub index right.

You might correctly note that indexes are superfluous in ebooks, which have a search function — but professionally-published nonfiction ePubs tend to have indexes with page numbers and links. And having an index does look professional as hell. (And in self-publishing, you need every advantage you can get.)

Calibre will import an index from ODT as … plain text. It shows page numbers, without hyperlinks — which is doubly useless. So don’t import from ODT if you want an index.

Calibre will import an index from DOCX, and construct hyperlinks from it! It’ll use its own linking, not the page numbers. The links all work, but the result looks visually like an HTML conversion error.

So if you want page numbers, but you also want working links: import your book as DOCX, and you get to edit the XHTML directly again. You’ll need a copy of the index with page numbers, ‘cos you’re going to need to put every single page number into your XHTML by hand.

This will also force you to closely proofread your index, so … good?

XHTML filenames

Calibre creates ePub 3.2 books with .html filenames, but epubcheck requires .xhtml filenames. I fixed this with shell scripts applied to the unzipped files.

(If you just cut’n’paste these lines without understanding what I did here, you may wreck your book files, and have to start over with exporting to ePub.)

for j in `seq -f %03g 0 31`; do for i in `seq -f %03g 0 31`; do sed -i s/index_split_$i.html/index_split_$i.xhtml/g ./index_split_$j.html ; done; done
for j in toc.ncx nav.xhtml content.opf; do for i in `seq -f %03g 0 31`; do sed -i s/index_split_$i.html/index_split_$i.xhtml/g $j ; done ; done

Then zip -f to freshen the files into the ePub.

<li> in headings

Calibre adds an <ol><li></ol> to every heading and subheading. Every ePub reader seems to handle this fine — except FBReader, my favoured ebook reader on Android, which displays a “1.” before each header.

The actual XHTML looks something like:

<ol class="list_"> <li id="id_RefHeading___Toc28800_897132658" value="2" class="block_10"><b class="calibre5">Introduction: Taking over the money</b></li></ol>

Solution: after you’ve unzipped the files, go through and remove every <ol></ol>, convert the <li></li> to <p></p> and remove the value= attribute from the <p> or else epubcheck complains.

Alternate or additional solution: check stylesheet.css for display: list-item; on styles that shouldn’t have it, and replace those with display: block; .

Font troubles

If you wrote the book in a particular font, the index generated from a DOCX may be in whatever Calibre thinks is a good default font — and this default font may show up elsewhere. The quickest way to fix this is to edit stylesheet.css and remove the wrong font.

Remove back-arrows from footnotes

Calibre puts a back arrow ← character on every footnote or endnote. This renders fine on most ePub readers, but fails on some old ones. I removed it entirely from the file containing the endnotes. I think the endnotes also look better without the arrows.

Delete calibre_bookmarks.txt

If you use Calibre’s ebook-viewer, it’ll add a file called META-INF/calibre_bookmarks.txt to your ePub. Remove this or epubcheck will complain.

Kindle Previewer

Amazon provides Kindle Previewer for Windows or Mac. It doesn’t work in Wine, so I put it in my Windows 10 VM under VirtualBox — you can just download Windows 10 and run it unactivated. [Amazon; Microsoft Windows 10]

Look over every page of your ePub extremely carefully — this is precisely what Amazon will make of your ePub.

I also checked in ebook-viewer and FBReader. You should check in whichever ePub readers you personally use.

Content issues

Draft2Digital requires that you not have the following:

“Competitor Links: The content contains links to sales channels that are in direct competition with the chosen sales channels.”
“Competitor Reference: The content contains references to sales channels that are in direct competition with the chosen sales channels.”

This means that Apple doesn’t like links to Amazon, or even mentioning it. The only such link was in the bit at the end advertising Attack, so, fine — I removed that line.

Smashwords didn’t want page numbers on the table of contents, so I removed those for the Smashwords upload. They didn’t fuss about links to Amazon, though.

Why should I bother to do all of this?

(You probably shouldn’t. I’m just like this.)

An ePub that passes epubcheck with no errors or warnings is a thing of joy! Probably.

A more robust file will work on more ebook readers, and your customers will be happier.

But mostly, you’ll bother doing this if you can tap your inner reserves of extreme fussiness and perfectionism and wanting to make your beautiful literary baby as well-presented as possible. That works too.

Also, you probably have to be a huge nerd. But at least the book will be pretty and work everywhere.

Your subscriptions keep this site going. Sign up today!

10 Comments on “Calibre, ePub and epubcheck: the curse of editing XHTML”

Ingvar says:

6th November 2020 at 11:07 am

Don’t forget the “joy” that is ebook page numbering. If you use the “multiple XHTML” files model, at least multiple Kobo models will reset the page counter (and the expected number of pages) every time you change to a new file (usually, I guess, “new chapter”). This makes it quite hard to se how much book is left, as well as making it hard to find a specific page just from page number.

Samuel S Abram says:

9th November 2020 at 12:08 pm

Thank you so much for this! I don’t write books, but I do compile EPUBs, and this is a great resource! Once again, thank you! I’m in the middle of “fixing” a Cory Doctorow book (to wit, it’s Someone Comes to Town, Someone Leaves Town), and fixing all the bugs is a pain in the ass! I’ve even stopped doing it because it was so frustrating.

1. David Gerard says:
  
  10th November 2020 at 12:54 am
  
  If this is a “resource” for you, then you have my sympathies 😉
  
  Real-world ePub seems to be a nightmare format. You can either have a specification-correct ePub that passes epubcheck, or you can have an ePub that works on everything out in the wild. I can see why Apple just went “OK, you have to pass epubcheck, we’ll just make our reader work with ePubs that do” and left it there – they have, after all, only a single reader to worry about.
  
  The real problem is that I’m doing fancy stuff here – footnotes, and the cross-ref footnotes thing is rickety even the way LO and Word do it. Tra la la.
  
Rob says:

9th November 2020 at 12:57 pm

FYI, the smashwords epub, uploaded to google books for reading displays a bullet point at the start of head heading: https://photos.app.goo.gl/4g9MeRRU7NAheiZo7

1. David Gerard says:
  
  9th November 2020 at 7:49 pm
  
  Good lord. Is that the case on all headings?
  
  THERE ISN’T AN <li> ELEMENT ANYWHERE IN THAT XHTML FILE!! WHY IS IT DISPLAYING A SPURIOUS BULLET POINT??? AAAAAAAA
  
  so yeah, did I mention that ePub compatibility is a frickin’ nightmare
  
  1. Rob says:
    
    10th November 2020 at 2:01 am
    
    So, this gets weirder:
    – In play books, bullet point for every heading (web and android)
    – Moon+ reader, its just fine (yay)
    – In iBooks on MacOS – Bullets for every heading *that isn’t on the first page of the chapter* as laid out by ibooks (If I shrink text / grow window so headings flow to the first page they lose their bullets).
    
    1. David Gerard says:
      
      10th November 2020 at 9:49 am
      
      what the hell is it even doing, where is it getting these spurious bullets from
      
    2. David Gerard says:
      
      10th November 2020 at 9:53 am
      
      this is the point at which it’s helpful that ebook sales are >90% Kindle, and if you pass the Kindle Previewer you’re mostly good. But the Apple and Smashwords readers are noisy and a bit more influential than their numbers, so I’m trying here …
      
    3. David Gerard says:
      
      10th November 2020 at 3:58 pm
      
      SOLUTION DISCOVERED!
      
      The headings had a style that included “display: list-item;” – remove that from stylesheet.css, and all is well
      
      thanks to Rob and Andrew Hickey for working this one out. Additional paragraph added to the post.
      
David Gerard says:

9th November 2020 at 7:52 pm

Just tried pandoc 2.5 (as included in Ubuntu 20.04). The ODT source is reduced to just the dumpster fire picture at the end of the book, nothing else. The DOCX source is mostly good, but pandoc renders the cross-reference endnote anchors as plain text, and doesn’t pass epubcheck.