Implement range based DSL with VM

Question

I wanna create DSL for rule modbus devices over the local network. I'll give you some context so that you better understand what I mean.First of all, imagine a device (let's say an oscilloscope), we can control it by sending commands to the device. Usually, all of these devices have similar features. They have input and output channels.We can set the parameters for the channels. (for example, for a power supply unit, we can set the output via channel number 1 to 1 Ampere and 10 Volts)

I have created a simple grammar for this. In the BNF format:

ranges ::= range ( ',' range )*
range  ::= | number '-' number
           | number

More generally, we have this grammar:

expr ::= ranges '=' configuration

I have no problem with the lexer and expression parser stage.To learn, I write everything from scratch. Without using any generators (like bison). I use the book very much for learning.The pseudocode for parsing such expressions is very simple:

def ranges(self):
    self.range()

    while self.match(TokenType.COMMA):
        self.range()
    

def range(self):
    if self.check(TokenType.NUMBER):
        left = int(self.current.lexeme)
        self.advance()

    if self.match(TokenType.MINUS):
        if self.check(TokenType.NUMBER):
            right = int(self.current.lexeme)
            self.advance()
        # have range here
        # can emit it to VM

The problems start further. There are a lot of devices using the modbus protocol. I would like to have some kind of universal solution for them. So I decided to create a virtual machine. We usually create a VM in order not to depend on the target architecture, right? I use a similar principle, but I imagine that instead of the target architecture, I have a set of different devices.

I don't quite understand what to do next between the parser and the VM.

My idea was to simply generate the range itself at the parsing stage (let's say for 1-5 = [1,2,3,4,5] and then just push it to the VM.

I would create something like OP_PUSH_RANGE (Op code) and do push to the stack of the VM.

The problem is when you write something like "1=10V" The stack of the virtual machine will look something like this:

[1, 10,]

I could add special tag values (for example for range values and for configuration values)

Then my VM stack would look like this:

[<rng>, 1, <rng>, <cfg>, 10, <cfg>]

Now I can just start the get stack function and iterate through it and pull out the ranges and settings separately. This sounds like a solution to the problem. But I'm not sure about him yet. My question is still how to distinguish between the values before the '=' character (because they have numeric values) and after. And also how to develop a VM for languages of this type. Yes, I know the language sounds too loud. Let it be a tiny language

I don't understand what the role of your VM/bytecode is. The traditional style of stack bytecode would have the instructions know what the values on top of the stack represent, with likely some kind of pointer equivalent for your variable-size compound range type, and the code generation output instructions and pushes in the right order to make that work, but it seems like you're thinking differently and I'm not sure if that's because of a constraint in your real-world application, which I don't quite understand, or if that just is the answer to your question. You canedit your question. — Michael Homer
– Michael Homer ♦, Commented Dec 5 at 20:48
Your grammar has digit+ for the range parts, but your parser consumes only a single TokenType.NUMBER for each of those parts. Also, if each digit is a separate token (as the naming and the grammar suggest), then digit+ would match text like 4 5 6, which you presumably don't want. I'm guessing that the way you've actually implemented it, 456 is a single "number" token rather than three "digit" tokens, so your grammar should say number instead of digit+. — kaya3
– kaya3, Commented Dec 6 at 10:24
@kaya3 you are right! thanks of course it need to be number here — Karlos Margaritos
– Karlos Margaritos, Commented Dec 6 at 17:28
@MichaelHomer Thank you! At the moment, I don't know how to create a VM that would support a variable-length range. Also, I would like to note that the case “1-4,5” produces [1,2,3,4,5], and the case “1-4,4” produces [1,2,3,4]. — Karlos Margaritos
– Karlos Margaritos, Commented Dec 6 at 17:32

Michael Homer · Accepted Answer · 2025-12-06 22:04:41Z

I'm going to go through what I think would be the most conventional approaches here, but I'm not sure exactly what your use case is to know where it needs to be adjusted to suit.

Typically, a stack-based bytecode virtual machine has a stack of data values, sometimes alongside an explicit return pointer stack, and has operations that take values from the top of the stack, process them, and push the result(s) back onto the top of the stack. The interpretation of a data value on the stack is determined by the operation that pops it off. Sometimes there are also a small number of registers or additional data stacks.

In lower-level virtual machines, the stack entries are all the same size (e.g. an array of 64-bit values), and then different instructions might interpret any given value as an (unsigned) integer, an IEEE binary64 double, packed 8-bit characters, or a pointer into linear memory. These entries can only store a fixed amount of information, and anything further needs to be created in memory and the address stored on the stack for the next operations to use.

In a higher-level VM, everything might be a pointer already; your code looks like Python, so your stack is a list that can have heterogeneous values in it, and as Python objects they have inherent types as well. One stack slot for you can hold any kind of value, including compound values, and you don't have as much of the "reinterpret some bits" to go on.

Your "ranges" construct can be arbitrarily large, since it has an unlimited number of ranges inside. I can see two approaches to work with that if the ranges are meaningful in themselves:

A range is created from a pair of integers, with a create_range opcode, and a pair of ranges can be merged into one "ranges" with a merge_ranges opcode. Your example in the comments of 1-4,5 would compile down to (push 1) (push 4) (create_range) (push 5) (dup) (create_range) (merge_ranges). The stack would be, step-by-step:
- []
- [1]
- [1,4]
- [Range(1,4)]
- [Range(1,4), 5]
- [Range(1,4), 5, 5]
- [Range(1,4), Range(5,5)]
- [Range([Range(1,4), Range(5,5)])]
where Range is a Python class type you create that holds either a pair of lower and upper inclusive bounds or a list of constituent ranges. You could then (push 10) (apply_configuration), with apply_configuration taking the top two items as the range to apply to and the configuration to apply to that range. In this way, you'd have compound values as your stack entries and your operations would be of fixed small arities.

I would think of this as the usual way if the ranges are individually meaningful. You have complex compound types of unknown size in your language and they need to be represented, so you create them in memory and you point at those compound values from the stack.
You expand out your ranges statically and push the individual values to the stack, followed by the number of items you pushed, and have variable-arity opcodes using that value off the stack. In that case, the 1-4,5 = 10 example would turn into (push 1) (push 2) (push 3) (push 4) (push 5) (push 5) (push 10) (apply_configuration), where apply_configuration would first take two items (the configuration, and the number of range entries it will be applied to), and then an additional number of items corresponding to that second value (here 5).

This is not the most common, but variable-arity bytecode systems do exist and it's not wrong to have one if that reflects the underlying domain needs. Here the stack only ever contains numbers, and the meaning of any of those numbers is determined by the bytecode that pops it off.
A third version, if the ranges themselves are not significant and are just a concise way of expressing multiple values, would be to have your code generation handle all of this: statically expand out the ranges as you've described and as above, and then produce the bytecode for an individual configuration for each of them. In that case, 1-4,5 = 10 would generate bytecode for five different single-entry configurations: (push 1) (push 10) (apply_configuration) (push 2) (push 10) (apply_configuration) ....

I would see this as the usual way when the ranges are not in themselves semantically-meaningful values. The role of a compiler is to take code in a human-accessible form and elaborate it out into what the machine needs, which includes generating repetitions that the user doesn't have to write. This gives the simplest bytecodes: apply a single configuration to a single channel. Again, the stack contains only numbers and it's the instruction that knows that one of them is the channel and one is the configuration value; your compiler generates the appropriate pushes in the right order.
If you really do need to have the types separated, you can use multiple stacks: a dedicated stack that holds channel identifiers and/or one that holds configurations, with the main data stack either separate again or unified with one of those. There would be separate instructions to push constants to each stack. Instructions would pull some number of values off each stack, potentially including emptying one of them for the variable-arity cases, and push results into any combination of them. In this case, you could have (channel_push 1) (channel_push 2) ... (push 10) (apply_configuration_all) or (channel_push 1) (push 10) (apply_configuration), where the final opcode pulls either all or a single channel identifier off that dedicated stack, and the configuration value off the main stack.

Again there are real systems that do this (mostly for differently-sized types), and it could make sense in your world to have a strict separation. I would not leap to use this approach if you don't see it as clearly necessary.

I don't think I fully understand the domain you're working in, so I can't say whether one or other of those is more suitable, but it may be obvious to you that one of them isn't. Abstractly, I'd suggest starting with #3 as default and moving to #1 if you need it. The major value a VM might give you is simplicity of reimplementation, and pushing complexity into the compiler rather than runtime is beneficial there.

There is one further note of caution: from your problem description, it's not clear to me that a VM is buying you anything. Crafting Interpreters gets into this because it's one next step that can be taken to gain performance or to learn about compilation, but it's not necessary. A tree-walking interpreter will give you the semantics you need without getting into this, and so you could probably stick to the first half of the book that is describing those.

If your language design is in flux, that's going to be a lot easier to work with than having to update the VM as you go as well. You can still address different systems it's being used with inside the language, or by swapping out the implementing classes, so you shouldn't lose anything but will gain a lot of simplicity if this is your first time building a language implementation. Once you have a more solidified language, then you could come back around to creating a backend VM for it.

Stack Exchange Network

Implement range based DSL with VM

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Implement range based DSL with VM

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions