Converting from Microsoft Word to HTML

(covering Word 5, 6, Office95, Office97, and a non-Microsoft product that actually works; also micro write-ups for Quark and Frame users)

for the Web Tools Review by Philip Greenspun


May 2005 Update: I'm not actively maintaining this page anymore but I did just recently try a product called "Word to Web" (WordToWeb 2.5) to see if it would product less complex and hard-to-edit-manually HTML than Word 2003 does on its own. The answer is "no". In fact, it did a much worse job than Microsoft Word by itself.

Word 5.0 (the bad old days)

There are two tried-and-true ways to get documents from Microsoft Word 5.0 onto the Web:

Word 6.0 (Microsoft Realizes Internet Exists)

Even though I'm generally opposed to making Bill Gates even richer, I decided that I could spare $135 for the academic version of Microsoft Office with Word 6.0. This would, I decided, fix the Text with Layout but and let me try out Microsoft Internet Assistant.

The Academic version of Office comes only on floppy disk; 32 of them. It took me about two hours and 120 megabytes of disk space to install the complete package.

I started Word 6.0.1, the new fixed zippy version of Word that was supposed to address the program's sluggishness on the Macintosh. PowerMac-native Word 6 ran substantially slower than Word 5 ran in emulation on my PowerMac.

"OK, so it is a major pig," I thought to myself, "and proves once again that C programs beyond a certain complexity are neither fast nor small. Still, it will be nice to be able to Save As Text with Layout without crashing the machine."

I tried saving a simple 10-page paper with no figures or equations or anything special (beyond two footnotes) as Text with Layout. An error box appeared. No output file was produced. My machine crashed a few seconds later.

"Oh well, so two years and all the C programmers Bill Gates could bring in from India weren't enough to squash this bug. At least I can play with Internet Assistant."

It turns out that Internet Assistant only runs on the PC version of Word.

Did I feel like I'd been cheated out of $135? Hell no! Aficionados of viewgraph design claim that Aldus Persuasion is way better, but PowerPoint produced a nice stack of colored viewgraphs for a conference talk. Oh, this PowerMac native C program is real zippy. On a 66 Mhz PowerMac, it was barely able to keep up with my typing in certain modes.

Office 95

It took me about a year but I finally got Windows NT to run on a PC. Then I was able to enjoy total Microsoft quality, both operating systems and applications. I installed Office for Win95 and then the very latest Internet Assistant. Now I could just type "Save As HTML" from any Word document. It was incredibly convenient and I bowed down to worship Bill Gates.

As soon as I had rolled up my prayer rug, though, I noticed that the HTML output from Word/Internet Assistant was garbage. For example, it would start to wrap a headline in an H2 tag but then forget to close it, so huge blocks of text were rendered as a headline. There were hundreds of extraneous PRE, BR, and P tags. Worse than useless.

"At least I can go back to my old way of doing things," I mumbled to myself, "I'll just Save As RTF and then use rtftohtml." No such luck. The latest version of Word, at least with Internet Assistant installed, puts some crud in RTF files that rtftohtml doesn't understand.

Something that Works

What finally worked was a beautiful commercial product called HTML Transit from InfoAccess. This works on whole groups of Word docs, producing index pages and local tables of contents in a highly configurable manner. It is a beautiful program that can almost make up for a day of using Microsoft software. HTML Transit understands Word tables and turns them into HTML tables. It isn't too smart about "smart" quotes, though, and it turns them into "’" which is not part of the legal HTML command set. It happens to be rendered by Netscape on Mac/PC as a fancy single quote but on Unix boxes it means nothing so the user sees "dont" when you wrote "don't". perl -i.bak -pe "s/’/'/g" should fix it up. Less easily fixed are equations. I tried translating my pathetic master's thesis which has a lot of equations in the text. These all ended up mangled. Despite these nits, HTML Transit is by far the best tool available for translating Microsoft Word into HTML. It also understands Interleaf, Frame, Word Perfect, RTF and a bunch of other formats (not Pagemaker though). Unfortunately, it only runs on Windows.

Office 97

Of course, you'd expect a company like Microsoft, with $8 billion in the bank, to eventually get it right... A friend of mine contributing a review to my on-line photography magazine wrote the original in Office97. He saved it as HTML and the results were remarkable. Where you'd have expected to find an <H3>, instead you got
<B><FONT FACE="Arial" SIZE=4><P>
The document was filled with special 8-bit ASCII characters that aren't part of the legal HTML character set, e.g., "smart quotes" and long dashes. There was an extra "<P>&nbsp;</P>" in between all of the paragraphs.

In short, a terrible unusable non-standard mess reflecting a complete ignorance of the original point of HTML (that the browser does the formatting).

What About Quark?

Quark does not have native support for writing HTML versions of its documents. There are a couple of plug-ins that you can buy that allegedly output HTML but I haven't been able to get the whole collection to work.

What About Frame?

Adobe's Frame 5.5 has the ability out of the box to output HTML (if you have an older version, you'll have to download a plug-in from the Adobe web site and/or upgrade to 5.5). It takes about 5 minutes to install Frame on a WinNT machine, read in a document, and write out an HTML file. I will be able to tell you more after I get off my butt and finish the on-line edition of Database Backed Web Sites (Macmillan used Frame to produce the final film).

... Well, I've tried Frame 5.5 now. It includes an impressive array of controls for converting documents to HTML and is way better than the plug-in that Adobe distributed with 5.0. I think that the philosophy and overall power are similar to what you get with InfoAccess. One truly impressive thing that Frame will do is simultaneously output a cascading style sheet so that the final HTML stays reasonably clean yet readers with modern browsers can get the benefits of design choices you've made. (Note: I wasn't able to finish the on-line edition of Database Backed Web Sites because Macmillan didn't supply me with something that would load cleanly into Frame and display. But whatever I saw in Frame was ultimately viewable in the final HTML document.)

The bottom line on Frame is that it remains an extremely powerful way to manage a set of documents that you intend to make available simultaneously in print, PDF, and HTML. But in order to get the most out of Frame, you'll have to invest a few days thinking about styles and what they should mean. Frame is a good enough piece of software that there are actually rewards to taking an intelligent and formal approach to your problem. But if you want to be stupid, you can think of Frame as a version of Microsoft Word with most of the bugs taken out.


philg@mit.edu