From LaTeX to html

I am a big LaTeX user.

I am a bit less of a LaTeX fan, but that has to do with those days where you are sitting in front of your editor and your are struggling to get that one paragraph with the correct layout. You have five browser windows open with google searches and Stack Exchange results, but it is still working. And then you discover that it either something simple that you overlooked or that you have into a situation where you need to modify a sparsely documented length. And then, asking some friend to help doesn’t help anymore, as most people in my environment ask me to solve their LaTeX problems. It reminds me of me experiences with CSS: clearly someone thought about it, but somehow, he or she was thinking in a different way than I am.

Converting LaTeX to html is a similar type of problem. There exist several tools, google knows a lot about the problem, however, there are many voices, many opinions and nothing looks definitive to me.

I am not a specialist on the problem (maybe not yet :-)), but before doing anything actually, I started with thinking about it.

First thing to think about: how easy would it be? For those not familiar with LaTeX, I include a short example here. The figure below is the resulting output after processing.

\documentclass[12pt,a4paper]{article}
\usepackage{graphicx}
\author{John Doe}
\title{A short example of \LaTeX}
\begin{document}
\maketitle
\section{Introduction}
A short example to show what \LaTeX\ code looks like.

At least, making paragraphs is simple: a blank line in the code is sufficient for that.
\section{Examples}
\subsection{Equations}
$$C_p = \frac{P}{\dot{T}}$$
\subsection{Figures}
\begin{center}
\includegraphics[width=7cm]{figure1.pdf}
\end{center}
\end{document}


It is not a bit of code that will lead to the finely typeset output you’d expect from LaTeX, rather, it was selected to show some essentials. You recognise the presence of a header (everything above \begin{document}, the preamble in LaTeX jargon, and of the actual content (everything in the document environment. Inside, you see comments to generate text: \maketitle generates a title block based on the information in preamble. Also you recognise the way the titles are indicated: their hierarchical nature is given, not the way they need to look.

Thus, for those people familiar with html, it look that once all those commands like \maketitle are processed, a reasonable mapping of the LaTeX markup (because most of the latex code is essentially markup) unto HTML is possible. The obvious example is to map \section{...} in this case unto <h1>...<\h1>.

There are a few simple problems to resolve. For example, LaTeX does an enormous amount of work for you. You note that I didn’t have to include any numbering for the sections, yet they appear in the output: that is LaTeX at work. These things require some extra work, but nothing special.

Note: one thing I dislike a bit about the current LaTeX2e is that it is mixture between a markup language and a typesetting language. You have for example commands to locally change certain settings, there are explicit whitespace modifying commands and more such things. So the situation as a markup language is not as pure as is at least intended with HTML5. So to say, LaTeX resembles a bit more HTML 4, where there is a <font> tag to change the font, but also CSS to allow to systematically influence all elements.

The real problems lies in the formula’s. To say the least, in the world of HTML, mathematics are difficult. You may argue that there exists something like MathML, but hey, have a look at browser support first. Browser programmers don’t like mathematics.

So what are the solutions?

1. Be patient and hope that MathML will be fully supported in some near future. It will be, but I have not much hope for the near future.
2. Use the old-style approach, generate the formula’s in your LaTeX environment, cut them out of the resulting PostScript of pdf file, convert them to a image format that browsers understand (like png, but not jpg, but that is a different story) and insert them in the appropriate place in the text.
3. Use some helper library (javascript) to display formula’s.

Obviously, the first one is the correct one, however, as I noted above, it will take ages before you get the same output quality with MathML as you get with LaTeX. And the latter is the standard that we should strive for. There is some evolution in the implementation of MathML, and it is improving. But, as said, it is not a simple matter and at the moment, I think we should rather have HTML5 and CSS3 in place, rather than MathML. So, for the time being, we wait, and we look for something else.

The ‘create an image file’ approach is a decent solution: LaTeX-level output is there, the format is readable for everyone and you know what your readers will see. The problems, however, are also numerous. It works quite well for formula’s on a separate line, but inserting an image inline in a piece of running text is a different thing. You need to care about the correct size of the image, for a start. But, suppose now that someone changes the font size, then your formula size is not correct any more… There are ways around that, I guess (some playing with a CSS class that changes the <img> height to match the font size?), but it is not really nice. Additionally, sight-impaired people, using a screen reading, will have the sentence interrupted in such a location. The same argument also applies to any formula in an image. And then there is the issue of fonts: your formula will be displayed in the font that was used in LaTeX, which may or may not match the one your are using in the webpage. In summary, there are a few good reasons not to use it if there is a way around it.

That brings us to the javascript libraries. In a general sense, they will, in a webpage where they are included, recognise some LaTeX-like character sequence, and process it into something that can be reliably rendered by a browser. One example of this is jqMath, which renders tekst between the standard LaTeX symbols $ or $\$ as, respectively, inline or block mathematics. The output is, depending on the actual formula to be rendered, MathML or HTML+CSS.
Another one, which appears to become rapidly popular, is MathJax. That has probably to do with the fact that is a project by the American Mathematical society and backed by, among others, the American Institute of Physics. It works similar as the above, relying on MathML where possible, or HTML+CSS otherwise.

As I have used jqMath earlier, with satisfactory results, this time, I am going to test MathJax.

In search for the conversion tool

I have already some experience in this search, and I was aware of the existence of latex2html and tex4ht. The problem with these two is that they are rather old in the context of web design.

The latex2html website is abandoned, and no mention is made in the documentation about HTML5: it dates back from 1999, when HTML5 was only a faint whisper from the future. On the other hand, CTAN still lists it, complete with a 2012-marked version. A bit of digging into the files learns that the manual.pdf belongs to v99.1, mentioning that HTML 4 support was there in v97.1. However, the Changes file in the 2012 bundle mentions only a v98.1, and then jumps to 2012, mentioning only:

fix warnings in perl 5.14

Thus, it looks that HTML5 is not really an option with this program. While it may be useful in other cases, it is thus not for me.

For tex4ht, the situation is different: CTAN lists it as obsolete, mentioning that they are awaiting a new version, and refer to program’s page at TUG. On the other hand, the program is still part of TeXLive 2014, and there are still changes in the source repository. Although the program worked quite well when I used it earlier (to generate with images instead of with MathML that is also supported).
However, its latest release dates from before the inception of MathJax, and thus it cannot serve the purpose at first side.
But, deeply hidden in the documentation is an option documented that generates jsMath output. This means that the question of compatibility between the syntax of jsMath and MathJax should be checked. But the documentation of MathJax is reasonably optimistic on the subject.

However, in view of the situation sketched above, I’d like something that converts the LaTeX to html, but leaves the formula’s intact, so they can be processed by MathJax.

LatexML from NIST seems a good candidate, rendering LaTeX to XML and then processing it further, however, at first sight it doesn’t seem to be able to do exactly what I needed, although it seems to be quite successful for what it is developed.

Then, I came across pandoc, whose latex to html mode seems to be exactly what I want. Now we have to see how it lives up to the test of tackling a large, more complex file.