In this post I show two ways to document a piece of code on Jekyll, the static site generator powering GitHub Pages (including this blog).

The Story Behind the Story

I tried to explain how this piece of code found that (1+2)!! + 3!^4 - 5 = 2011, but C++ is not famous for being concise and transparent. So I tried to give the general idea, and gave the full source in one block, but it was not very satisfying. Full of remorse, I’m now writing this post about how I should have written the previous post, so I can sleep at night again.

I’ll get some inspiration from literate programming, which I’ll quickly present. Then I’ll show how it could have been used on the previous post. This method won’t work for any language, so I will present another one, based on a pending addition to Jekyll.

Literate Programming

Let me copy-paste the definition from the Wikipedia article about literate programming for you:

The literate programming paradigm, as conceived by Knuth, represents a move away from writing programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

Basically, the documentation and the code is written in the same file, but in the order of the natural speech, not in the order expected by the compiler/interpreter, unlike javadoc-alike systems. A preprocessor will extract the code, and reassemble it in the order expected by the compiler/interpreter (hereafter: the processor).

In order to show to the processor that we are serious about not caring about him, literate programming can be done with latex, or a word processor. Yes, I spent an internship coding (Spec#) within Word, and yes, I actually enjoyed it. More recently I discovered Sweave which makes statistical exploration with R enjoyable (for a programmer).

Of course, the rule that any idea has already been implemented (better) and published also applies to literate programming with Markdown (Markdown being the notation used to format this article). In this case, a script comments out all the text that is not marked as code. I guess the author calls this “lightweight literate programming” because the code still has to be written in the order expected by the processor.

Manual Literate Programming On GitHub’s Jekyll

The idea here is to use the #include directive of the C/C++ preprocessor along with the {% include %} tag of the Liquid Extensions of Jekyll to build the “interconnected ‘webs’ of macros” of Knuth’s vision of literate programming. The extraction step is manual: each macro will be written in a separate file.

For example, the core search function seems complicated at first:

(search.messy.cpp) download

template <bool ALL, int VERBOSITY>
void search(Number expectation, const Number epsilon = 0.1) {
  EvaluationResult result;
  long tRemaingCandidates = factorial[NBINOPERATIONS]
                    *pow(NBINOPERATIONTYPES,NBINOPERATIONS)
                    *pow(MAXFRAC+1,NBINOPERATIONS+NOPERANDS-2)
                    *pow(MAXSQRT+1,NBINOPERATIONS+NOPERANDS-1);
  std::cout << tRemaingCandidates << " candidates" << std::endl;
  do {
      result = evaluateCandidate(expectation, epsilon);
      if (result & VERBOSITY) {
          std::cout << decode(result) << ": ";
          printCandidate();
      }

      --tRemaingCandidates;
      if (ALL && !(tRemaingCandidates%100000000)) {
          std::cout << tRemaingCandidates << " remaining candidates"
                  << std::endl;
      }
  } while (generateNextCandidate() && (ALL || result != CORRECT));
}

Actually it is very simple if we extract the code related to printing progress:

(search.cpp) download

#include "a-lot-of-code-before.inc"

template <bool ALL, int VERBOSITY>
void search(Number expectation, const Number epsilon = 0.1) {
  EvaluationResult result;
  #include "print-candidate-total.inc"
  do {
      result = evaluateCandidate(expectation, epsilon);
      #include "print-progress.inc"
  } while (generateNextCandidate() && (ALL || result != CORRECT));
}

#include "a-lot-of-code-after.inc"

We can detail separately the formula of the total of candidates, by showing only the content of print-candidate-total.inc:

(print-candidate-total.inc) download

long tRemaingCandidates = factorial[NBINOPERATIONS]
                *pow(NBINOPERATIONTYPES,NBINOPERATIONS)
                *pow(MAXFRAC+1,NBINOPERATIONS+NOPERANDS-2)
                *pow(MAXSQRT+1,NBINOPERATIONS+NOPERANDS-1);
std::cout << tRemaingCandidates << " candidates" << std::endl;

And we don’t even need to show the content of print-progress.inc, because it is just uninteresting print instructions.

The Markdown document becomes something like:

The core method is pretty simple; generate a solution and try the next:
{% highlight cpp %}
{% include 2011-12-18/search.cpp %}
{% endhighlight %}

Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah,
which gives us `(n-1)! * k^n * (s+1)^(n+n-1) * (f+1)^(n+n-1)` candidates: 
{% highlight cpp %}
{% include 2011-12-18/print-candidate-total.inc %}
{% endhighlight %}

This Markdown document is describing how to write a Markdown document describing other files by including them… Way too meta for me. If you too need to lower down the abstraction, you can have a look at the files I’m talking about:

explanation.markdown is the article describing how the program works
search.cpp and print-candidate-total.inc are pieces of code explained in the article.
print-progress.inc, a-lot-of-code-before.inc, and a-lot-of-code-after.inc is the rest of the code.

The whole point of all this is that the code remains compilable (e.g. with g++ -o search search.cpp) without any transformation, just like the monolithic main.cpp, but the explanation can be a lot more clear. It also means that there is no need to propagate modifications into the description should the code changes or vice versa.

Reverse Literate Programming on Github’s Jekyll

Now the problem is that a lot of programming languages are too elegant to dirty their parsing hands with the include of a preprocessor. Of course, most languages allow the import of some notion of modules, but it’s just the processor trying to dictate us the order in which to describe a program again.

Without heavy weaponry like a real literate programming preprocessor, we will have to call a truce. We’ll still write the code in the order of the processor, but explain it in a human order. Anyway, no one actually believed that I wrote the code of previous section in the presented order.

Instead of writing a story that will be transformed into a program, we will write a program that will be reassembled into a story – hence the reverse literate programming. The feature needed here is to be able to extract parts of the source code. Jekyll’s {% include %} can only extract the whole, but if this patch makes it through, we will be able to use blocks like:

{% extract source.file %}
  {% after //Some marker %}
  {% before //Some other marker %}
{% endextract %}

Where:

source.file is the source to include (similarly to the {% include %} tag already in Jekyll),
after is a tag to specify the line after which the content of the file should be included,
before is a tag to specify the first line from which the content of the file will be ignored again.

Example

For example, if I were to document the code of patch, instead of including the whole file containing:

(extract.rb) download

module Jekyll

  class ExtractBlock < Liquid::Block

    def unknown_tag(name, content, tokens)
      case name
      when "after"
        @after = content.strip
      when "before"
        @before = content.strip
      else
        super
      end
    end

    def initialize(tag_name, file, tokens)
      super
      @file = file.strip
    end

    def render(context)
      includes_dir = File.join(context.registers[:site].source, '_includes')

      if File.symlink?(includes_dir)
        return "Includes directory '#{includes_dir}' cannot be a symlink"
      end

      if @file !~ /^[a-zA-Z0-9_\/\.-]+$/ || @file =~ /\.\// || @file =~ /\/\./
        return "Include file '#{@file}' contains invalid characters or sequences"
      end

      Dir.chdir(includes_dir) do
        choices = Dir['**/*'].reject { |x| File.symlink?(x) }
        if choices.include?(@file)
          source = File.read(@file)
          #Preceding code is the same as IncludeTag.render
          matchdata = source.match /#{Regexp.escape(@after)}[^\n]*\n(.*)\n.*#{Regexp.escape(@before)}/m
          if matchdata.nil? or matchdata.size < 2
            return "Unable to determine which lines of '#{@file}' "+
            " are between '#{@after}' and '#{@before}'"
          end
          source = matchdata[1]
          #Following code is the same as IncludeTag.render
          partial = Liquid::Template.parse(source)
          context.stack do
            partial.render(context)
          end
        else
          "Included file '#{@file}' not found in _includes directory"
        end
      end
    end
  end

end

Liquid::Template.register_tag('extract', Jekyll::ExtractBlock)

I could simply write a markdown file containing:

Blablabla, I just copy-pasted `include.rb` and added this processing to the
content read from the file:
{% extract extract.rb %}
  {% after #Preceding code is the same as IncludeTag.render %}
  {% before #Following code is the same as IncludeTag.render %}
{% endextract %}

Which would give:

Blablabla, I just copy-pasted include.rb and added this processing to the content read from the file:

(extract-core.rb) download

          matchdata = source.match /#{Regexp.escape(@after)}[^\n]*\n(.*)\n.*#{Regexp.escape(@before)}/m
          if matchdata.nil? or matchdata.size < 2
            return "Unable to determine which lines of '#{@file}' "+
            " are between '#{@after}' and '#{@before}'"
          end
          source = matchdata[1]

Alternatives

I can think of two other ways to specify the part of the file to be included. First, specifying the numbers of the first and last line to keep would be the easiest, but it would mean that the documentation has to be updated each time the file changes.

Second, the a from/until pair could specify border lines similarly to the after/before pair, except that the content would include these two lines. The problem here is that premature truncations could happen if we want to stop the inclusion on lines such as } or end.

In the end, the only assumption in the after/before pair is that one can add lines only for the sake of documentation, but I have not seen a programming langage which does not allow comments, so it is always possible to delimit regions.