chaource | Parsing expression grammars: a simpler formulation

Can't stop thinking about "parsing expression grammars" (PEGs).

First of all, the choice of the terms "PEG" seems rather infelicitous to me. (I would rather say "deterministic top-down parsers".) The words "PEG" mean really "grammars described through expressions intended for parsing the input text". We are given some input text, and we are using some pattern expressions as instructions for processing the input. PEGs are an alternative to the Backus-Nauer Form (BNF).

There seem to be relatively few papers about PEGs. I read a paper by Mizushima et al. that explains how Prolog's "cut" operator can be used with PEGs, together with Ierusalimschy's lucid description of an implementation of PEGs through a simple virtual machine. Then I tried to see if it were easy to implement the "cut" operator in that virtual machine. It turned out that the "cut" operator is more basic than most of PEG operators. So I realized that one can describe PEGs by introducing only two easy constructions. What follows is a tutorial about this, which, I hope, is more understandable than other descriptions of PEGs.

[Even though I introduce different (more "low-level") operators than the original PEG papers, I will still call this description a PEG, since the original PEG operators generate the same class of languages. So here goes.]

PEG patterns are instructions for parsing an input string. The input string is a linear array of characters. The parser looks at each character and goes on to the next characters when some given pattern matches. When the parser goes on to the next character, we say that the current character "has been consumed from the input".

In its work, the parser follows a fixed set of rules, as described by the given PEG pattern.

The definition of a PEG pattern is recursive; it builds up larger patterns from smaller ones:

1. GOOD and FAIL are patterns.

GOOD means "go on parsing, all is well". One can say that the pattern GOOD matches the empty string - the pattern GOOD does not check any characters and does not consume any input when parsing.

The FAIL pattern never matches. It fails to match even the emptyas already implementedas already implemented string, and it fails instantly, without checking any characters and without consuming any input.

2. For all characters c, 'c' is a pattern.

The pattern 'c' matches only the single character 'c' and fails to match any other character. If a character was matched, it is consumed from the input.

3. If A and B are patterns then A B is a pattern.

A B matches if and only if A matches (perhaps consuming some input) and then B matches the rest of the input.

4. If A, B, and C are patterns then the construction IF A THEN B ELSE C is a pattern.

This construction matches in the following way: If A matches (perhaps consuming some input), then B must match the rest of the input. If B now fails, the whole if/then/else construction fails (so in that case the parser does not even attempt to match C). If A did not match initially, then the parser tries to match C. If C now fails, the whole construction fails.

(When the whole construction fails, the input possibly consumed by A is returned to the input string.)

That's all for the rules of building patterns.

For clarity, we may put parentheses around patterns, but parentheses do not change the meaning of patterns - a pattern A is the same as (A).

Here is an example of a pattern built in this way:

'a' 'b' IF (IF 'c' THEN ('d' 'e' FAIL) ELSE GOOD) THEN GOOD ELSE ('a' FAIL)

(I do not intend the meaning of this pattern to be obvious now; it is just an example of our syntax for patterns.)

5. General rule: when a pattern does not match, it does not consume any input. (When a pattern matches, it may or may not consume any input.) This rule is obeyed by the primitive patterns (1,2) and by the pattern constructions (3,4). Therefore, this rule is obeyed by every possible PEG pattern.

6. General syntactic rule: a PEG pattern can be denoted by a name such as "A", "B", "Xyz" etc., which is written as

A = pattern

Patterns can then refer to such names (also recursively). So, for example, we could define the patterns "A" and "B" by writing B = 'x' A A and A = IF B THEN ('y' A) ELSE FAIL.

That's it! This set of rules seems to be reasonably short and easy to grasp intuitively.

Now one just needs a few examples to see how PEGs work. Especially, the if/then/else construction is quite interesting.

A = 'x' 'y' 'z'

The pattern A matches 'xyz' and any string that starts with 'xyz' but does not match 'xya' or 'xy'. If the pattern A is used on the input string, say, 'xyzXX', then the initial 'xyz' will be consumed, leaving 'XX' for any further parsing. If the pattern A is used on the string 'xyXXX', then the pattern fails, consuming nothing - further parsing will again see 'xyXXX'. Despite the presence of the initial 'xy', the pattern A either matches entirely or fails entirely. This is the case with all PEG patterns.

B = IF A THEN GOOD ELSE GOOD

First we try to match A (that is, 'xyz'). If we have 'xyz', we execute GOOD, i.e. do nothing. If we don't have 'xyz', we also do nothing (but in that case, no input was consumed). Therefore, B is a pattern that never fails but consumes 'xyz' if it was found. In regular expression notation, and in PEG notation, this is written as ('xyz')?. This matches 'xyzxyXX', consuming the initial 'xyz' and leaving 'xyXX' unprocessed.

C = IF A THEN FAIL ELSE GOOD

First we try to match A, that is, 'xyz'. If we find 'xyz', we will fail (and thus C will not consume any input). If we do not find 'xyz', we have not yet consumed any input, and we then go on to "GOOD", i.e. we consider that the pattern has matched. So C actually never consumes any input! The pattern C succeeds if and only if there is no 'xyz' at the beginning of the input. In PEGs, this pattern is denoted by the "negative lookahead" operator and written as ! 'xyz'.

D = IF A THEN GOOD ELSE FAIL

If A matches, D matches and consumes the same input. If A does not match, D does not match and consumes no input. So actually D is the same as A.

E = IF A THEN FAIL ELSE FAIL

Obviously this is the same as simply FAIL. Whether or not A matches, the pattern E fails and no input is consumed.

F = IF 'a' THEN GOOD ELSE ('b' F)

This is a recursive pattern. It matches if 'a' initially matches (i.e. if the input string begins with 'a'). If 'a' is not found then 'b' must match, and we again require that F matches. Hence, F matches any number of 'b' followed by 'a'. In other words, the pattern F will match 'a', 'ba', 'bba', ... but neither 'bbb' (no 'a' at the end) nor 'ca' (does not start with 'a' or 'b'). In regular expression notation, this is b*a. In the PEG notation, this pattern is equivalent to ('b')* 'a'.

The original PEG notation as described by Ierusalimschy actually does not have the if/then/else construction. Instead PEG uses construction 3 and six other operators. Three of them we have already seen, - ?, *, and !. For instance, A * can be defined as N = IF A THEN N ELSE GOOD.

PEGs use three more operators: +, / and &. All these operators can be expressed through the constructions 3 and 4.

The operator + means "one or more repetitions" and is implemented as A+ = AA*. The operator & means "positive lookahead": the pattern &A matches only when A matches, but &A does not consume any input. This can be implemented as ! (! A) through the negative lookahead operator, which we already defined above.

Finally, the / operator of PEG is a "prioritized choice". It works like this: A / B matches if A matches, or if A fails and B matches. Here "A" has priority over B: if "A" matches, "B" will never be attempted for matching. So A / B is not equivalent to B / A. This makes a difference, say, when both "A" and "B" can match the same input but would consume different amounts of the input. But this difference is rather subtle.

The / operator can be implemented like this,

A / B = IF A THEN GOOD ELSE B.

The if/then/else construction can be expressed through the original PEG operators as

IF A THEN B ELSE C = A B / (!A) C,

which in my view is rather more difficult to think about, since this involves two nontrivial PEG operators.

One thing is not allowed in PEGs: left recursion. The pattern A defined by A = A 'x' is not allowed; the parser will go into an infinite loop when trying to match this pattern. Similarly, A = IF A THEN X ELSE Y is not allowed. More generally, it is not allowed that some pattern calls itself after consuming no input. GOOD*, A = IF GOOD THEN A ELSE X, or mutual recursion (A = B C, B = A D) are not well-formed PEGs.

Clearly, some constructions can be simplified, "optimizing" the grammar. For example, A FAIL = FAIL, GOOD A = A, etc. One of my motivations for describing PEGs in this way is that these constructions act as a low-level implementation language for the grammar. A "program" in this language can be "optimized" after it was written, improving the parsing performance. But I have not yet written out all the optimizations and checked that there is indeed a possibility of performing all those optimizations symbolically, at the level of the definitions 1-4, before translating them to the virtual machine. [Ierusalimschy talks about quite a few interesting optimizations at the level of the virtual machine, but those are really low-level.]

Actually, it is possible to express rule 3 through rule 4:

A B = IF A THEN B ELSE FAIL

So it seems that reasoning about PEGs could be simplified to reasoning about the if/then/else operator alone!

One can write a PEG grammar using the PEG operators as abbreviations, or the if/then/else construction when required. For some more convenience, we can define character sets such as [a-z] or string matching so that we can write 'hello' instead of 'h' 'e' 'l' 'l' 'o'. These are inessential; the essential operators are defined in 1-4.

[Mizushima et al. attempted to introduce the "cut" operator into PEG, but they figured that it can work only in two contexts: one is the if/then/else construction, and the other is essentially the pattern

N = IF A THEN B N ELSE GOOD,

which matches any number of repetitions of "AB" except when followed by A without B. They had to explain that the "cut" operator may be inserted only at particular places in the grammar, otherwise it was "invalid". The construction explained in this tutorial is simpler: there is no explicit "cut" operator, instead there is the if/then/else construction, which can be used everywhere and allows us to express everything that the "cut" operator does.

So when we have the if/then/else construction, we do not need a separate "cut" operator. The if/then/else construction do not actually increase the power of PEGs; but the description of PEGs becomes, in my view, a lot easier to understand and to reason about.]

The most interesting and subtle aspect of PEGs is the treatment of backtracking. Other grammar formalisms, such as BNF, allow ambiguity; PEGs disallow ambiguity and specify the backtracking behavior quite precisely. For instance, if a PEG pattern A (B/C) D is executed and D fails while A and B matched, then the entire pattern fails. The alternative (B/C) is not explored further; once B matches, C will never be attempted. In BNF notation, A (B|C) D requires that C is attempted when D fails. So PEG can disallow a specific backtracking path and resolve ambiguity in a way that BNF cannot specify.

I think that the if/then/else definition gives a more useful insight into ambiguity resolution than the standard PEG notation. For example, the pattern A* A apparently should match A as many times as possible, as long as there is at least one A. Suppose A = 'x' 'y'. Does A* A match 'xy12'? The PEG rules do specify this unambiguously, of course. However, the precise rules are difficult to remember (at least, for me). [The usual description of the * operator is "zero or more repetitions", but this is not precise, as we will now see.]

Once we try to rewrite A* A using the if/then/else construction, we find that there are two ways of writing a pattern S that kind of looks like A* A at first sight:

1) "lazy"
S1 = IF A THEN GOOD ELSE (A S1)

2) "greedy"
A_star = IF A THEN A_star ELSE GOOD
S2 = A_star A

Let us apply this to the example where A = 'x' 'y' and the input string 'xy12'. The pattern S1 will match 'xy' and stop; '12' remains unconsumed for further parsing. The pattern S2 will start by applying A_star, which will first consume 'xy' and then apply A_star to the remaining '12'. This starts again by applying A = 'xy' to '12', which fails, hence A_star goes to the GOOD clause and succeeds. We are back to S2, with the input string '12' and A_star succeeding. S2 continues by applying B to '12', which fails. Hence, the entire S2 will consume no input and fail!

So the difference between S1 and S2 is that S2 is "greedy" - its A* consumes as many A's as possible, regardless of what happens later. So A* consumed 'xy' and then the last A failed on '12'. The pattern S1 is "lazy" (not greedy) - it does not try to match A* if the final A is already found. It happens that PEG specifies the greedy matching, as in S2. (This is not explicitly mentioned in the original PEG papers, but Ierusalimschy discusses this issue in detail.)

A "standalone" implementation of the lazy * operator is not possible in PEG: if it were possible, e.g. if there were such an operator, denoted, say, by A*?, then we would expect to write A*? B for non-greedy matching of A before B, for any pattern B. However, A*? B must fail when B fails. So A*? must decide when to stop consuming input so that B would possibly succeed. Thus, A*? needs look-ahead of B. So the only way to implement the lazy * operator is to implement the construction A*? B with two given patterns A,B. (We have just seen an example of this construction as the pattern F above.)

I am, however, just starting to dig into these intricacies. Perhaps I will be more enlightened later. There is an interesting discussion of these matters here.

Flat | Top-Level Comments Only

From:

gds.livejournal.com

not an on-topic, but:
- lady ermine (from jabber conference) is writing PEGs in OCaml: https://github.com/ermine/kombain , she had to enrich the original specifications found in Some Smart Book (named like "Parsing Technologies", approx) to be able to bind parsed values to some identifiers and to post-process parsed values. Some things in her work could be discussed though..
- I have some formalization of her work in Coq (which is perfectly extracted to OCaml), for example, I proved (!(!A))=&A, and some other things, but the whole task of formalization looks useless now (I wanted to generate regexps from the PEG grammatics, but it's hard and it requires some language that could manipulate matched substrings; so I abandoned it).

From:

chaource.livejournal.com

Thanks for the info!
I was implementing PEGs not through functional combinators but through the virtual machine designed by Ierusalimschy. It would be interesting to compare performance; I actually intended also to implement the combinators.

Why did you want to generate regexps from PEGs? And why the effort of formalizing everything in Coq?

As for Coq, this was also interesting for me some time ago, but right now I'm working in the industry and the typical tasks seem to be totally different from those for which Coq is specialized. I am programming mobile devices - iphone, ipad, android phones, and tablets. There are almost no algorithms; the whole task is to organize the classes given by Apple / Google, to subclass them, and to push information from a web server into the GUI and back. Sometimes there is additional complexity due to animations and timing. The complexity is mostly in the organization of the information flow and temporal logic, not in some algorithms manipulating strings or numbers. There are no loops, recursive functions, or parsers; there are just many, many classes with integer and string-valued properties, and the whole problem is to organize the copying of information from one class to another and to check some trivial conditions... I had a hope that Coq would help me organize this information flow, but I don't know enough about it to see whether this is so. And, most of all, you are programming within a large and fixed framework of classes in a given programming language. (There is one company, psellos.com, that uses Ocaml for developing iPhone applications, and I'll definitely try that as well when time permits.)

From:

gds.livejournal.com

> Why did you want to generate regexps from PEGs?

1. I wanted to learn Coq (is was one of the first coding experiments with it), and implementing these combinators was close to "real world tasks" -- for example, implementation contains non structurally-recursive functions (so I had to prove its termination), 2. I chose PEG just because ermine has OCaml implementation of PEG grammar and parsers, and it was interesting to play with Coq extraction (played successfully -- extracted code is very similar to the code I could write manually), 3. there was some working task in some distant future, about parsing some text markups in browser (so, in JS), and afaik the fastest way to work with JS strings is using regexps.

> And why the effort of formalizing everything in Coq?

I wanted to prove some things about PEG to play with proofs. Also I wanted to prove correctness of some PEG grammar transformations. But there is one more reason why I used Coq here, and It's not about formalizing.. Coq extraction has an interesting feature: it allows to inline extracted functions (marked as "inlineable" by user), so monadic/combinatorial code can benefit a lot from this.

> the whole problem is to organize the copying of information from one class to another and to check some trivial conditions

Pity I have no ideas here. These copying/checking manipulations of course can be implemented in Coq, but without a good background idea (on structuring these dataflows) the implementation will be not much better than in any other language. Maybe these dataflows can be written in some concise DSL, but I don't know details enough to say something definitive.

> There is one company, psellos.com, that uses Ocaml for developing iPhone applications, and I'll definitely try that as well when time permits

I've heard about other company (or individual? don't remember) that uses OCaml/iPhones too, but can't give links to this discussion (it was in caml-list). So the whole idea about using OCaml for developing iPhone software is good and it was checked on practice.

From:

chaource.livejournal.com

It would be interesting if Coq could extract at least some code for iPhone applications. I wonder if Coq could formalize a dataflow of the following kind:

- There are classes A, B, C with methods A#a1, A#a2, B#b1, B#b2, C#c1, C#c2, and so on.

- The class A responds to user interface events; which means that the iPhone OS itself calls A#a1, A#a2, etc. when the user clicks on something or other. Class A must be derived from a class provided by the iPhone SDK.

- The class B (provided by the iPhone SDK) can show something on the screen; so we have to call B#b1 when A#a1 is called.

- The class C provides data that we have to show. So when we call B#b1 in response to A#a1, we have to pass a value obtained by calling C#c1.

Skeleton code looks therefore like this:

classA = object
method a1 = B#b1 C#c1
method a2 = B#b2 C#c2
...
end

- Sometimes we need to call C with values obtained from B, e.g. to store the data entered by the user. (C is sometimes a frontend for a database.)

... method a3 = C#c3 B#b3
... method a4 = C#c4 B#b4
...

- Sometimes the calls are asynchronous, so "method a3" must be called within a closure that we pass to some system object S. Say, this is a request to a Web server, or a request to start an animation. The object will call our closure when the result is obtained (when the web server responded, when the animation is finished, etc.).

method a4 = S#start_request (fun x -> match x with | Some result = A#a4 result | None = B#show_error)

The system object S is again derived from a class provided by the iPhone SDK.

So I wonder, is it possible to derive, using Coq, the required dataflow and to generate the required method definitions, from a specification of some kind of "theory"? Of course, the programmer will then have to fill in all the details, e.g. how exactly the data is drawn, what kind of animations are added, etc. Also, Coq would have to know about the facilities provided by the iPhone SDK and OS.

The complexity here is that there are literally hundreds of these classes and methods that we have to implement, and each method is more or less just a trivial reshuffling of data. The specification of the iPhone application can say something like "the value X1, stored in the database, is always shown in the entry field F1, and the user can also edit and save this value" - and this translates into about a dozen new but trivial methods in three or four classes. Currently we have to write all this by hand. It is very easy to make mistakes, and writing unit tests would mean writing twice the amount of code, since every piece of the code is by itself trivial. (A unit test for A#a1 can only check that the method A#a1 really did call B#b1 and C#c1...)

Of course, if we have a hundred fields, like F1, F2, ..., all with exactly the same functionality, we can make this automatic, but this does not really solve the problem - we rarely have such a situation. Normally, we have a couple of such fields, then some buttons that cause all kinds of other events, then some graphics to be loaded from the web server and shown, etc.etc. - every part is by itself trivial, but there are lots of these parts, and the dataflow becomes complex quite quickly. It would be good if Coq could help here.

Edited Date: 2012-06-28 08:03 am (UTC)

From:

gds.livejournal.com

I thought about it. (maybe too long.) But I don't see any "classical" solution to this problem.

Coq can be useful only because of its functionality of "filling holes in terms" -- when you have incomplete term, Coq can write some code automatically, using types. But this is somehow dangerous to use for writing code -- Coq chooses arbitrary values to fill holes until the types match, it's useful in proofs (since "Proof Irrelevance"), but it can choose wrong "computational" value. If you want to [ab]use this feature, you should decorate real API (iPhone's, database's) with good and unique types to prevent accidential building of wrong computational terms.

As for the rest of the task, I think it could be better done with some DSL, where methods will be annotated with some "dataflow" attributes (along with types of course). This approach is more practical than using Coq -- you can generate code in any language (now it's Objective C, I suppose), also the generated code can have some "placeholders" for the "real code" (moreover, you can write some logic to store/restore "real code" when changing class attributes and regenerating "skeleton" of application).

Also DSL can have some directives like "field F1 is connected to data value D1 and must use validation function V1 on entry", or "while doing action Ac1, animation An1 must be showing on screen". So the DSL must have at least two layers: one layer for methods connectivity (which method can/should call which method), and one layer for some functionality that is written and checked against this connectivity information.

(as a side note: the presence of two layers where one of them is used as "checking mechanism" for another gives a hint that "checking layer" can be moved to compiler's typechecker. However here it's not possible to completely move it to types, since sometimes you'll need not only _check_ connectivity, but also _browse_ it (iterate, generate code from it, look at some internals, rewrite it). The common ML solution here is to _partially_ move functionality to type checker: use parametric data type that: 1. has real information useful for "browsing", 2. has type parameter useful for typing some operations on this data type with compiler's type checker.)

For the implementation language of such DSL I can recommend OCaml Coq will add complexity here only, since you don't have to prove this code. Here the logic is complex to write therefore complex to prove. Simple testing will work much better here.

From:

chaource.livejournal.com

Yes, so far I was thinking only along the lines of a DSL that would be embedded directly into Objective-C. My first attempt in this direction was not very successful (the DSL was quite complicated to deal with). Now I am thinking of writing a DSL that will be compiled to Objective-C (or Java, as needed), so that the programmer will only have to write some lower-level modules implementing, say, specific animations or algorithmic computations. But I hoped that Coq could formalize the data flow in some way.

A constraint such that "the field F1 contains only integers and has value equal to that of data object D1" is perhaps similar to the "object constraint language" (OCL) that is part of UML. However, I can't understand most of those specifications, they are too complicated. It is perhaps easier to design one's own thing, and to use OCAML for implementation as you say.

From:

gds.livejournal.com

> But I hoped that Coq could formalize the data flow in some way.

I can't see how Coq can help here. Ok, after formalization you'll get some relations, some properties, proofs of dataflow correctness, mechanics to prove that code respects dataflow. What's next? You need to generate some code, not to prove it. Yes, you can generate code from Coq too, but can't imagine why you would need it. But if you want to generate _and_ check/prove, then the Coq is the way to go. But simple DSL without proofs will require less mind-work.

Also take a look at "arrows" abstraction (it's from haskell world), maybe it will be useful somehow. (it's useful to understand "arrows", at least.)

UML is the beast. Your own DSL will fit much better.

From:

chaource.livejournal.com

К вамъ вопросъ по Окэмлу. Можно ли написать одинъ модуль, состоящiй изъ нѣсколькихъ отдѣльныхъ ML-файловъ? Скажемъ, въ C++ я бы сдѣлалъ #include и такимъ образомъ получилъ-бы классъ, написанный по частямъ въ разныхъ файлахъ. Въ Objective-C и въ C# можно опредѣлять классы по частямъ - нѣсколько методовъ въ одномъ файлѣ, нѣсколько въ другомъ файлѣ. Въ Окэмлѣ врядъ ли будетъ возможно разбить классъ на файлы - поскольку типъ объекта структурный и долженъ быть вѣсь сразу видѣнъ. Вопросъ въ томъ, можно ли разбить модуль на файлы?

From:

gds.livejournal.com

класс на файлы разбить, верно, нельзя. Точнее, в некоторых случаях через разделение на независимые классы и наследование -- можно попробовать. Работать будет, но не уверен в необходимости.
С модулем же -- вполне можно, если нет взаимной рекурсии между элементами из разных файлов. Необходимо описать нужную функциональность в отдельных файлах-модулях, затем в требуемом модуле сделать "include Part1 include Part2 ...".
Проблемы возникнут в следующих случаях:
1. если есть взаимная рекурсия между файлами -- никак, либо хакерствовать, что иногда адекватно. По конкретному примеру проще будет сказать, как и что.
2. если в нескольких модулях используется один и тот же тип данных -- при include будет ругань "Error: Multiple definition of the type name xxx". Это обходится следующим образом: делается файл Mymodule_types.ml с типами, в каждой из частей делается open Mymodule_types, а в результирующем модуле будет "include Mymodule_types include A include B". Я набросал тестовый пример в http://paste.in.ua/4484/ и убедился, что работать будет. Там подразумевается, что T, A, B -- это как бы файлы.

Flat | Top-Level Comments Only

Profile

chaource

February 2026

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Page Summary

gds.livejournal.com - (no subject)

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Feb. 24th, 2026 09:44 pm

Journal de Chaource

Parsing expression grammars: a simpler formulation

Parsing expression grammars: a simpler formulation

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

February 2026

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags