Blogging with Word in your Jamstack (August Lilleaas' blog)

I wrote a book in Word. That’s 72 000 words in Word.

But I’ve stopped blogging, partly because writing markdown in my code editor just doesn’t vibe with me anymore. These days, I ask myself the question: how can I just get a Word document online as fast as possible?

I like Word! It has nice tooling for grammar and spell checking, and it’s a nice environment for editing prose.

So, I wrote a parser for Word files that slurps them into my blog engine and emits clean HTML – like the post you’re seeing now! Yes, my friends, I now have support for .docx files in my personal blog.

Tldr: Here’s the Word document (called blogging_with_word.docx), and here’s the parser code.

A quick word on tech

Programmers should build their own lightsaber blog.

This blog is a Jamstack style Clojure app (it used to be a Ruby/Middleman app, and before that, Rails) that reads a bunch of files and outputs static HTML, that I host on CloudFront in AWS. Source code here.

This setup turns out to be the best of both worlds. I get to write some fun code, once, and hopefully when it’s all up and running, I can just paste another docx file into the correct folder and have a fresh blog post online in no time.

Back in the old days I enjoyed hand-typing markdown in my code editor. In fact, the first years of this blog was hand-typing HTML. Clearly, I have changed.

Parsing Word docs

I’m glad I use a robust technology like the JVM, which is the platform Clojure runs on. Clojure has great Java interop, so I’m using the good old Apache POI library. It has full support for everything Microsoft Office, and convenient APIs for extracting text from a Word document.

Let’s get into it!

(defn get-body-from-docx [^File file]
  (with-open [doc (XWPFDocument. (io/input-stream file))]
    (->> (iterator-seq (.getParagraphsIterator doc))
         (drop-while #(re-find header-line-re (.getText %)))
         (map (fn [para]
                (if-let [runs (seq (.getRuns para))]
                  {::runs runs
                   ::style (.getStyleID para)
                   ::para para
                   ::doc doc}
                  ::blank)))
         (word-paragraphs-hiccup-seq))))

iterator-seq and the getParagraphsIterator() method are the juicy bits. It yields the Word document paragraphs as a plain lazy Clojure sequence, which lets me pretend I’m writing code against a neat Clojure library, instead of a noun heavy Java library.

The drop-while gets rid of the initial header lines. My word document starts with some metadata directly in the text, such as date: 2023-12-07. It also has title of the post, and so on.

Then I map each paragraph to ::blank for empty lines, and create a map with all the data I need for further processing for lines with contents.

Chunking paragraphs

Then, the fun process of converting Word paragraphs to HTML begins. Paragraphs in Word documents are not like paragraphs on the web. A paragraph in Word is more like a sequence of <div>s, or lines of text separated by a   tag.

Let’s say the word document looks like this:

Here is my line

Here is my other line
with no space between

Here is my last line

That should result in HTML that looks roughly like this:

<p>Here is my line</p>
<p>Here is my other line<br>with no space between</p>
<p>Here is my last line</p>

In other words, the list of paragraphs from Word needs to be chunked properly.

This is what word-paragraphs-hiccup-seq does. That function is just some looping mechanics, the meat of that implementation is a function, get-next-hiccup-tag. This function returns the new HTML for the paragraphs it “consumed”, and the remaining paragraphs to be processed.

(defn get-next-hiccup-tag [paragraphs]
  (let [paragraphs (->> paragraphs 
                        (drop-while #(= % ::blank))
        {style ::style runs ::runs doc ::doc} (first paragraphs)]
    (case style
      "Code" (let [[code-paras rest] (->> paragraphs
                                          (consume-paragraph-chunk #(= "Code" (::style %))))]
               [(get-code-block-tag code-paras) rest])
      "Heading2" [(into [:h2] (runs-to-hiccup-seq runs doc)) (rest paragraphs)]
      "Heading3" [(into [:h3] (runs-to-hiccup-seq runs doc)) (rest paragraphs)]
      "Heading4" [(into [:h4] (runs-to-hiccup-seq runs doc)) (rest paragraphs)]
      (get-next-hiccup-paragraph paragraphs))))

The first step is to remove any preceding blank paragraphs. If I insert 10 blank spaces in a row, I don’t want word to insert to paragraphs separated by 9 (?)   tags. This makes my Word documents behave similar to Markdown, where you also have to do some work to get a bunch of whitespace between paragraphs.

The ::style property of the first paragraph determines what to do. If it’s a code block or a header, I immediately create an element and then return the remaining paragraphs. There’s some logic in code blocks I’ll return to later. But that’s about it.

If the paragraph does not have a style associated with it (meaning it’s just a normal block of text), I invoke the main bulk of the chunking logic.

(defn get-next-hiccup-paragraph [paragraphs]
  (let [[text-paragraphs rest] (->> paragraphs
                                    (split-with
                                      #(and (not= % ::blank)
                                            (nil? (::style %)))))]
    [(into [:p]
           (->> text-paragraphs
                (map #(runs-to-hiccup-seq (::runs %) (::doc %)))
                (interpose [:br])))
     rest]))

First, I take all the blocks of non-styled paragraphs that also aren’t blank lines. Then, I build a  tag that contains each paragraph converted into HTML (it reads out bold, italics, links, etc from the text runs), and I insert a   between each line in the paragraph.

The rest is what is returned to the looping code, meaning that for the next iteration of the processing loop, the remaining characters will be processed.

Processing code blocks

I invoke a function called consume-paragraph-chunk to build the code blocks. Here’s that function.

(defn consume-paragraph-chunk [pred xs]
  (let [[chunk rest] (split-with #(or (pred %) (= ::blank %)) xs)
        whitespace-tail (->> chunk (reverse) (take-while #(= ::blank %)))]
    [(->> chunk
          (drop-last (count whitespace-tail)))
     (concat whitespace-tail rest)]))

Maybe there’s a smarter way to do this, but at this point I just wanted to get this blog post published. The function takes all lines that are blank or matches the predicate (in this case, ::style is "Code"). I don’t want it to include any tails of whitespace after the last chunk of code, so I yank those out and add them to the rest for further processing in the loop.

Great success

So that’s about it! I wrote this very blog post in Word. This has increased the likelihood of increased blog output by at least 1%.

My number one fear is that this will turn out to be some kind of churnfest due to incompatibilities with my old blog posts in new versions of Word, or something like that. The docx format has been stable for a while now though, so I sure hope not. At any rate, it was a fun little project to get up and running.

Test area

Because I’m lazy, I don’t want to set up a separate test environment. So here’s a blurb that tests some random stuff.

Here, we should see a continuous block of code, with newlines inside it but not after.

(defn here-is-some-code []
  ;; Test newline stuff

  ;; Ok here we go
  (prn "Hello, World!"))

Here, we should see a set of words separated by   tags. And, we should not see an actual line break here, but a single paragraph, and some in-line escaped HTML.

Can we have in-line code? (+ 1 2). This should use a monospace font.

Here
We
Go

We’ve already tested links. But let’s test it again.