0
$\begingroup$

So, I recently run into this bug in the compiler for my programming language that outputs WebAssembly Text Format. Here is what the output looked like:

Improperly indented WebAssembly Text Format

I fixed it by changing in my compiler:

            : context.structureSizes.at(getType(context))) +
            ")\n" +
            std::string(
                convertToInteger32(children.at(0), context).indentBy(3)) +
            "\n\t)\n)",
            AssemblyCode::AssemblyType::i32);

to this:

            : context.structureSizes.at(getType(context))) +
            ")\n" +
            std::string(
               convertToInteger32(children.at(0), context)
                    .indentBy(
                        2)) // Indenting by 2 instead of by 3 because of this:
                            // https://github.com/FlatAssembler/AECforWebAssembly/issues/24
            + "\n\t)\n)",
            AssemblyCode::AssemblyType::i32);

But there are almost certainly more such bugs.

So, how do other compilers that output WebAssembly Text Format make sure the s-expressions are properly indented? Using wat2wasm and then wasm2wat from the WebAssembly Binary Toolkit won't do the trick for two reasons:

  1. We will lose the comments.
  2. We will lose the data about which instructions are put into an S-expression and which ones are put linearly, one after another.

So, is there a better way?

$\endgroup$
8
  • 6
    $\begingroup$ Generally, you write a function to print an S-expression, which accepts an argument for the indentation level of nested nodes. This function will ordinarily be recursive, and recursive calls increment the indentation level. Transforming from explicit recursion to a loop and a stack is an exercise for the reader. $\endgroup$ Commented Aug 7 at 15:13
  • 8
    $\begingroup$ I don't think this really has anything to do with Programming Language Design and Implementation. Laying out text and printing nested structures is a generic problem, independent of programming languages, and a solved one at that. But I will let the community decide. Regarding your literal question, the answer is, in 99% of cases, they don't. For most compilers, their output is for machines, not humans. $\endgroup$ Commented Aug 7 at 15:13
  • 2
    $\begingroup$ @JörgWMittag I think this question is obviously on-topic. Code generation is a standard requirement for compiler implementations, and generating correctly-indented code is a common problem when the target language requires indentation (or has readability benefits from it). Parsing likewise has applications outside of langdev, but is clearly on-topic here. What matters is whether the expertise is relevant to this community and likely to exist within this community, and the answer to both is "yes". $\endgroup$ Commented Aug 7 at 15:19
  • 4
    $\begingroup$ They're basically just asking how to write a pretty-printer. The simplest solution would be to write the compiler in Common Lisp and use its built-in function for this. $\endgroup$ Commented Aug 7 at 15:20
  • 1
    $\begingroup$ @Barmar This problem isn't just relevant for pretty printers (where the issue is readability); I have a compiler which targets (among other languages) Python, where the code generator must correctly indent code for it to work at all. FWIW, here and here are the relevant parts of the source code (in TypeScript). $\endgroup$ Commented Aug 7 at 15:25

3 Answers 3

2
$\begingroup$

There are two main approaches to this problem:

Generating the whole code unformatted, then formatting it later

Seems it's fairly trivial to format WASM text properly based just on the parenthesis. Having the proper indentation in every builtin is, like you found, error prone. Instead, you can first generate a file with no indentation at all. Then have a separate pass where you read this code and apply proper formatting.

Generate indent and dedent tokens

If you cannot remove indentation without changing the meaning of the code, or need manual control, you can insert Indent and Dedent tokens inline, then in a separate pass indent each segment.

For example, my GQ implementation, which transpiles to JS, outputs a data structure like this:

enum OutputToken {
    Text(String),
    LineBreak,
    Indent,
    Dedent
}

fn format(code: Vec<OutputToken>) -> String {

}

This way, your generating function just needs to put the INDENT and DEDENT tokens in the right places, and not need to worry about the final sum of the indentations.

$\endgroup$
2
  • 1
    $\begingroup$ I think there is a third approach, specifically for LISP-like languages, which uses another intermediate representation like type Node = (string | Node)[]. The code generator outputs to this intermediate representation, then there is a final pass to convert it to a string with proper indentation. This is similar to your first option, except there is no need to re-parse the generated code. $\endgroup$ Commented Aug 20 at 15:30
  • $\begingroup$ @kaya3 Feel free to post another answer for that $\endgroup$ Commented Aug 21 at 6:02
3
$\begingroup$

Wasm's S-expression format makes this very easy. The Wasm reference interpreter first generates a simple generic representation of the S-expression tree with the OCaml type

type sexpr = Atom of string | Node of string * sexpr list

It then runs a simple pretty printer over that tree that outputs it or converts it to a string. For each node, it decides whether its subnodes are to be laid out horizontally or vertically, depending on the remaining available line width. To do this efficiently, it uses an intermediate rope structure instead of concatenating strings immediately. The whole thing is about 25 lines of code.

$\endgroup$
2
$\begingroup$

For LISP-like languages, where the syntax is so regular, it may be convenient for your code generator to output a simple intermediate representation (IR), rather than the text format directly. The IR consists of raw strings (like i64.eq and 42), and non-empty lists. The IR is straightforward to convert into text with proper indentation; the upside is that all of the rest of your code doesn't need to concern itself with indentation.

Below is an example in TypeScript:

type Part = string | Part[];

const INDENT_SPACES = 2;

function renderIR(part: Part, indentLevel: number = 0): string {
    if(typeof part === 'string') {
        return part;
    } else {
        let out = '(';
        out += renderIR(part[0]);

        for(let i = 1; i < part.length; ++i) {
            out += '\n' + ' '.repeat(indentLevel + INDENT_SPACES);
            out += renderIR(part[i], indentLevel + INDENT_SPACES);
        }

        if(part.length > 1) {
            out += '\n' + ' '.repeat(indentLevel);
        }
        out += ')';

        return out;
    }
}

This is simplified for the purpose of demonstrating the idea; in reality you probably want to output to some kind of buffer rather than do everything with string concatenation. This example assumes every list element belongs on its own line, so if you want some parts like (i32.const 0) on a single line, your code generator can output that as a string instead of a list ─ or you can write some logic to decide whether to render short lists compactly.

This example also doesn't handle line comments in all positions correctly. These can be dealt with by adding a third type to the union string | Comment | Part[].

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.