Converting LaTeX to ePUB

This is my first ePUB. I knew the existence of the format and little else. I do not read ePUBs, I prefer PDFs even on my phone. The ePUB format is based on (X)HTML and my first hurdle was converting LaTeX to HTML. I tried many converters: Pandoc, htlatex, make4ht, LaTeXML... and they all failed. For a complex file with 7237 LaTeX commands, the converters crashed and did not produce any output. These are examples of LaTeXML and make4ht crashing

I am not alone in my adventure of trying to convert LaTeX to HTML, see Damien Desfontaines’ detailed round trip [1]. My approach has been different to him.

MAKE4HT

Since all the converters crashed with the more complex file, I tried a simpler file with 1076 LaTeX commands. Htlatex crashed again, but its cousin make4ht produced an output after reporting tons of errors

The output was horrible. Low resolution images for equations, misalignment, only part of the block equations was converted to image and did not align with the rest of the equation, formatting was lost in some inline equations

Incorrect reference labels and references not linked to the bibliography section at the end of the HTML file

(Yes, make4ht writes three f‘s, when there are only two in the LaTeX file).

Incorrect formatting for some headings

There is no space between the figure counter and the figure caption

Broken formatting of all the section headings

Some broken diagrams

Moreover, semantic markup is lost. For example, inline quotes are converted to ordinary italics enclosed in “”, but make4ht does not even use the italic tag <i> in this case, but a simple <span> tag with an attached fo-style class

As you can see from some of the examples there are encoding problems as well. The –unicode flag solved some encoding problems but not others.

I also tried the MathML option and the problems persisted, with equation images now replaced with parsing errors

and the Mathjax option was not better

Moreover, at some point in the conversion, Make4ht broke the HTML file entirely and everything after that point was formatted in a tiny font with center alignment and margins outside the edges of the page

LATEXML

Pandoc and LaTeXML continued to crash for the simple LaTeX file with a thousand commands. From the reported errors, I was able to guess the problem with LaTeXML, and after editing the source LaTeX file to remove tikzcd diagrams, LaTeXML produced an output, although it still reported 10 warnings and 8 errors (removing tikzcd diagrams did not work for the more complex LaTeX file, because the list of errors was long, as you can see above).

LaTeXML worked better than make4ht. It produced nice MathML equations

All sections were correctly identified and formatted

and the references to the bibliography were correctly numbered and linked

but incorrect format for some headings remained. A space was correctly inserted between the figure counter and the figure caption; however, the bold formatting for the counter was lost

Image panes were misaligned

This could be solved by manually adding a <br> tag between both panes in the HTML file.

Semantic markup continues to get lost and diagrams were not drawn as I had to remove they so LaTeXML would not crash. Unicode worked fine out of the box. An interesting issue is the format of the lists. By default, list bullets appear in a different block than list items

One of the developers recommends correcting this behavior with CSS styles. This is fine when you are setting up your own website, but not when you plan to distribute an ePUB, as some eReaders ignore publishers’ stylesheets. Workarounds are discussed in [2].

MY OWN CONVERTER

Even in those cases where make4ht and LaTeXML convert the LaTeX file and produce an HTML file, both ways require a lot of preprocessing and postprocessing to get exactly what I want, and at this point it would be easier to convert the file manually. Moreover, few eReaders support MathML and I had to find an alternative way to render equations. All those difficulties together suggested me to develop my own LaTeX to HTML converter.

The ePUB format is based on HTML. So the first task is to convert LaTeX to HTML. Actually, the script first produces an XML file for verification purposes and then produces a HTML file from it.

First of all a brief life lesson. I am familiar with sed (the Linux utility) and wanted to use it; however, all experts consulted said it was "impossible" to use sed to parse nested tags, because LaTeX, unlike HTML and XML, does not have named end tags

\foo{... \bar{...} ...}

versus

<foo>... <bar>...</bar> ...</foo>

Their argument is that regular expressions cannot discern which closing brace } matches the open tag and, in fact, codes and examples found on the Internet generally fail to parse LaTeX except in simpler cases such as \section{Introduction}, which is converted to <H3>Introduction</H3> using the regex s#\\section{\\([^}]*\\)}#<H3>\1</H3>#.

Well, I am using sed. I parse the innermost LaTeX tag using an enhanced regular expression and then iterate through the whole document until there is no more tags to convert. The lesson here is you would not stop trying anything just because some expert says you that it cannot be done.

If I had enough free time (or if someone paid me for this work) I would use a full programming language such as Perl or Python to write a more complete piece of software that could parse any LaTeX file, but my current sed based script is sufficient for my purposes. It took me only two weeks to write and test it. I now have a script that takes an input file with thousands of LaTeX commands and produces a beautiful HTML file with autonumbered chapters, sections, and figures, with table of contents, mathematics, hyperlinks, front-page, and cover.

I will explain a bit how it produces a table of contents with chapters since you do not find this information in the Internet. To produce the table of contents, the script first parses the whole document, finding chapters that do not have the no-toc directive and produces an auxiliary file toc.txt with all the headings. A second part of the script converts the headings in toc.txt to hyperlinks and the entire content of this file is inserted into the HTML file at the same point where the command \tableofcontents was in the LaTeX file.

Those are the advantages of my script over the alternatives:

It always works. I mean it always produces output, even in cases where it does not understand part of the original LaTeX file (in which case that part is not converted). My main problem with Pandoc, htlatex, Make4ht, LaTeXML... was they crashed with complex LaTeX input and produced no output.
It is fast: 5 times faster than Make4ht and 72 times faster than LaTeXML.
It produces semantic tags. For instance, it takes my LaTeX macro for inline quotes and produces the corresponding semantic code <q>...</q>, whereas the alternatives produced the presentation code "<i>...</i>".
Most e-readers do not support MathML and the use of images to render mathematics is bloated, ugly, and inaccessible. My script produces a mix of HTML, linearized notation, and CSS tricks to render mathematics. If commercial e-readers had better CSS support, I could produce more beautiful renderings.
It produces nonbloated HTML code. Alternative tools produce bloated code that sometimes rivals the ugly HTML code produced by Microsoft products

my script

LaTeXML

Make4ht

The HTML file is finally converted into ePUB format. This is a video of the conversion.

It first generates an XML file for both automatic and manual evaluation purposes. It is easier to visually check something such as <note>...</note> than the corresponding HTML code <div class="note">...</div> in a sea of </div> tags. Once the XML file is approved, it is converted to HTML along with additional tag and label files; cover and title-page images (png) are also produced. Once the HTML file has been reviewed and approved, the script requests ebook metadata such as BISAC codes, series number, description... and produces the final ePUB file, removing all the intermediate files.

The ePUB format is simply a special zip file with all the HTML, CSS, fonts, and images plus additional navigation and metadata files

Contents of zip file

Contents of ops folder

Contents of xhtml folder

This has been my adventure preparing an ePUB from a LaTeX file in order to distribute my thoughts and knowledge in electronic format to people who do not like to read PDFs.

REFERENCES

https://desfontain.es/privacy/latex-to-html.html
https://tex.stackexchange.com/questions/352634/latexml-itemize-creates-weird-ul