Stories by Amr Hesham on Medium

How I built a LLVM IR linter using SQL syntax

Amr Hesham — Mon, 10 Feb 2025 00:15:30 GMT

Hello everyone, I am Amr Hesham, a software engineer interested in Open source, Compilers and Language design. In the previous article, I explained how I built an LLQL tool to search for patterns in LLVM IR/BC files with API similar to LLVM Inst Combine matches functions.

For example, this query searches for Add with Sub as the left-hand side and Mul as the right-hand side.

SELECT instruction FROM instructions WHERE m_inst(instruction, m_add(m_sub(), m_mul()))

This query will match the following instruction

%sub = sub i32 %a, %b
%mull = mul i32 %a, %b
%add = add i32 %sub, %mull  <------- Matches

Currently, with the improvements of GitQL SDK to support new features in the language, functions and SDK customizations we can take LLQL idea to the next step.

The idea is that we want to create a linter that detects patterns in the IR/BC that must be optimized, so if they exist in the output IR that means the optimizer missed it,

GitQL already supports script mode, so you can put all queries in one file and execute it, so you can check for all patterns you want in one run, Let's start,

Create the linter file that will contain all queries

So first, let's create a simple file Lint.sql to add queries for all patterns we want, and we can easily run it using llql -f "Output.ll" -s "Lint.sql .

Creating SQL queries for linting

So for this article, I will show you how to build queries to detect actual real-case patterns (All patterns in this article are coming from LLVM InstCombine) but you can build the patterns you want using a large set of functions supported in LLQL, you can check them from the documentation.

Pattern 1:

The first pattern we want to detect is trunc(binop(Y, ext(X)) because it can be optimized tobinop(trunc(y), x), note that binop here means that it can match BinaryExpression for example Add, Sub, Mul, Or, And …etc, ext mean zext (zero-extends an integer to a larger size) or sext (sign extension on integer values) and trunc for truncate.

We can count the number of times this pattern appears or print the instructions or something else, for me, I would like to print the instructions so the query will be like

SELECT instruction AS "Missing Optimization: trunc (binop (Y, (ext X))) --> binop ((trunc Y), X)"
FROM instructions WHERE m_inst(instruction, m_trunc(m_binop(m_any_inst(), m_zext() || m_sext())));

You can realize that we almost wrote the same pattern but with m_ as prefix and m_zext() || m_sext for ext but in future, we can add m_ext .

If you run this query on a simple example like

define i32 @function(i8* %x, i32* %y) { 
entry: 
  %load_x = load i8, ptr %x, align 1 
  %load_y = load i32, ptr %y, align 4 
  %ext_X = zext i8 %load_x to i32 
  %add_result = add i32 %load_y, %ext_X 
  %trunc_result = trunc i32 %add_result to i16 
  %ret = zext i16 %trunc_result to i32 
  ret i32 %ret 
}

The result will be

And that is exactly what we want 😋

Pattern 2:

For the current and next patterns, I will explain only new functions that have not been explained before 😉, so our next pattern to detect is trunc(binop(ext(X), Y)) that can be optimized to binop(X, trunc(Y)) , it’s very similar to the previous pattern the only difference is that the operand of binop seems to be reversed, and in this case, we can detect it in two ways depending on what output we want to get.

The first way is to create a new query to detect this pattern and append it to our lint file

SELECT instruction AS "Missing Optimization: trunc (binop (ext X), Y) --> binop (X, (trunc Y))"
FROM instructions WHERE m_inst(instruction, m_trunc(m_binop(m_zext() || m_sext())));

So we can get a specific title for each pattern, but what if we don’t care about the title, then we can replace m_binop with m_c_binop which means commutatively_binop, Basically, it matches the binop without caring about the order of operands, Which means we can detect both patterns in one query 🤯.

Pattern 3:

In this pattern, what we want is to check for missing constants folding optimizations for example, if there is something like add 2, 4 because we can easily replace it by 6 at compile time, the query here is much easier because we have a matcher for constants value so it will be like

-- Constant folding
SELECT instruction AS "Missing Optimization: Constants folding"
FROM instructions WHERE m_inst(instruction, m_binop(m_const_int(), m_const_int()));

That means we check for binary expressions with both left and right sides constants int values.

I hope you enjoyed the article and samples, there are tons of matchers supported in LLQL and the end goal is to support all matchers in the upstream LLVM pattern matchers.

Feel free to join the project as a contributor. There is room for everyone 😉

GitHub - AmrDeveloper/LLQL: LLQL is a tool that allow you to run SQL-like query with Pattern matching functions inspired by LLVM InstCombine Pattern Matchers on LLVM IR/Bitcode files

I am looking forward to your opinion and feedback 😋.

I hope you enjoyed my article and you can find me on

You can find me on: GitHub, LinkedIn, and Twitter.

Enjoy Programming 😋.

How I built a LLVM IR linter using SQL syntax was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

LLQL: Matching patterns in LLVM IR/BC files using SQL query

Amr Hesham — Thu, 31 Oct 2024 19:18:32 GMT

Hello everyone, in the last two months i started to contribute to the LLVM amazing project (You can find my patches here) and in this interesting patch the goal was to match pattern in the LLVM IR and transform it to more optimized form, to do this work i got a recommendation from a LLVM organization member to look at the InstCombine Guide for contributor which can be found here, and i found that LLVM has a very interesting pattern matching functions to match the pattern you want, for example if you want to search for pattern like this (A — B) + B you can write

Value *A, *B;
if (match(V, m_Add(m_Sub(m_Value(A), m_Value(B)), m_Deferred(B)))) {
}

And if you want to search for. (A-B) + B or B + (A — B) you can replace m_add with m_c_add , the extra C is for commutatively.

Value *A, *B;
if (match(V, m_c_Add(m_Sub(m_Value(A), m_Value(B)), m_Deferred(B)))) {
}

I finished the patch and it’s merged 🥳🥳, and i back to work on GitQL query engine, at this time i have created the ClangQL tool which is used to run SQL query on your C/C++ code, and i got one question in my mind, What if i can run SQL query on LLVM IR and be able to perform pattern matching like in InstCombine part 🤔🤔.

The LLQL Project

At the start i was thinking okey, how to represent the information like instructions, and how to make performing pattern matching smooth like in LLVM, also how to provide a clean error message and maybe in the future support operators for those new types.

Representing the LLVM Types and Values

What i want here is to have types that map to LLVM types like like i8 , i32 , pointers , arrays , or vectors…etc, and work with them as if they are primitives in GitQL Query Engine.

Thanks to the GitQL new architecture which allow SDK user to define his own types, values and functions (You can read the full details about the design from this article GitQL: The data types from the Engine to the SDK), i started to create Types to be as wrapper for LLVM values for example LLVMDataType which represent Types in LLVM, and created a value that represent that type

pub struct LLVMTypeValue {
    pub llvm_type: LLVMTypeRef,
}

impl Value for LLVMTypeValue {
    ...

    fn data_type(&self) -> Box {
        Box::new(LLVMDataType)
    }
}

and the same idea to represent the LLVMValueRef the code is like this

pub struct ≈ {
    pub llvm_type: LLVMTypeRef,
}

impl Value for LLVMTypeValue {
   ...

   fn data_type(&self) -> Box {
        Box::new(LLVMDataType)
    }
}

After creating the Schema and DataProvider i can select instructions from IR files like this

SELECT instruction FROM instructions

and got a result like this

│ instruction                      │
╞══════════════════════════════════╡
│   %sub = sub i32 %a, %b          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│   %mull = mul i32 %a, %b         │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│   %add = add i32 %sub, %mull     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│   ret i32 %add                   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│   %result = add i32 %a, 0        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤

Now it’s time to find a way to perform pattern matching 🤔.

Building the Pattern Matcher

What i want is to be able to perform query like this

SELECT instruction FROM instructions WHERE m_inst(instruction, m_add(m_sub(), m_mul()))

This match for Add instruction that as Sub Instruction as Left hand side and Mul instruction as right hand side, in LLVM IR it can be like this

%sub = sub i32 %a, %b
%mull = mul i32 %a, %b
%add = add i32 %sub, %mull  <------- Like this add

`So my idea was that lets think of this Add as Node that has two children, each one of them is another node, so what i want is to have a tree structure of Patterns that each node can be matched or not againts that same node in the real IR tree,

So i created a new Type that represent the Matchers so i can make sure that user can only pass the correct type to my new matchers, I called it InstMatcherType for Instruction Matchers and TypeMatcherType for data types matchers, then created a values for them and also a Matchers functions so what the functions does is to build and connect the nodes to endup with full pattern and m_inst will match, so the Matcher node is like

pub trait InstMatcher: DynClone {
    fn is_match(&self, instruction: LLVMValueRef) -> bool;
}

And now i can create BinaryMatchers that can match his two childrens too, or UnaryMatchers, …etc

The first function i created as m_inst which take Instruction and Pattern and return true if the instruction is matched with the pattern, and now i created the m_add function which take optional two other InstMatchers for Left and Right sides, so the implementation is like this

fn match_add_inst(values: &[Box]) -> Box {
    let (lhs_matcher, rhs_matcher) = binary_matchers_sides(values);
    let matcher = ArithmeticInstMatcher::create_add(lhs_matcher, rhs_matcher);
    Box::new(InstMatcherValue { matcher })
}

fn match_sub_inst(values: &[Box]) -> Box {
    let (lhs_matcher, rhs_matcher) = binary_matchers_sides(values);
    let matcher = ArithmeticInstMatcher::create_sub(lhs_matcher, rhs_matcher);
    Box::new(InstMatcherValue { matcher })
}

fn match_mul_inst(values: &[Box]) -> Box {
    let (lhs_matcher, rhs_matcher) = binary_matchers_sides(values);
    let matcher = ArithmeticInstMatcher::create_mul(lhs_matcher, rhs_matcher);
    Box::new(InstMatcherValue { matcher })
}

and the signature is like

Signature {
    parameters: vec![
        Box::new(OptionType {
            base: Some(Box::new(InstMatcherType)),
        }),
        Box::new(OptionType {
             base: Some(Box::new(InstMatcherType)),
        }),
     ],
     return_type: Box::new(InstMatcherType),
}

Thats mean we can call it with 0 arguments to match any Add Instructions, for example

SELECT instruction FROM instructions WHERE m_inst(instruction, m_add())

Or with arguments to match add and then match childrens too

SELECT instruction FROM instructions WHERE m_inst(instruction, m_add(m_sub(), m_mul()))

You can query how many times this pattern exists in each function

SELECT function_name, count() FROM instructions WHERE m_inst(instruction, m_add(m_sub(), m_mul())) GROUP BY function_name

And now you can perform pattern matching againts LLVM IR or BC Instructions.

What is next!

The current state is that you can build your pattern with matchers for Arithmetic, ICMP and FCMP matchers, and there are a lot of other matchers that we can support and we can perform more deep analysis, feel free to write feedback, report issues or fix bugs, everyone is most welcome.

The project is free and open source, you can find The LLQL repository on Github and don’t forget to give it a star 🌟

I am looking forward to your opinion and feedback 😋.

I hope you enjoyed my article and you can find me on

You can find me on: GitHub, LinkedIn, and Twitter.

Enjoy Programming 😋.

GitQL: The data types from the Engine to the SDK

Amr Hesham — Thu, 31 Oct 2024 19:18:29 GMT

Hello everyone, in the last few months, the GitQL project has become bigger and has a lot of useful features now some tools are built using the SDK, like FileQL and ClangQL, so we have reached the point where we can query any kind of data by defining the Data Schema and Provider and map the data to the GitQL builtin types like Integer, Text, Date, Array …etc. and also can extend the Std library to defining your custom functions, but what if you want to create a new custom data type 🤔🤔🤔! And why do you need a custom type?

The benefit of having custom data types

Imagine you want to create a tool to run SQL queries on matrix vectors. By default, you can’t write a query like this without supporting the type Matrix in the Engine.

SELECT matrix1 * matrix2;

Or, when creating a function that does some calculation on the matrix, what parameter will you use?

SELECT perform_on_matrix(matrix1);

Maybe you think, okay, let’s make the function take Text as a parameter, and we can convert the matrix to a String and pass it. The function constructs the matrix, does the calculation, and returns it as a string again, but what if the end-user passes any string to the function 🤔.

SELECT perform_on_matrix("Hello, World!");

In this case, if you have a type called Matrix, you can easily make sure that the user can only pass a value with Matrix type, or he will get an error message that Function {} expects the type Matrix but got Text. And if there is a way to do operator overloading, you can support using the + operator between two Matrices.

In Programming languages, you can create your custom type using Struct or Class to define the structure, and in some languages like C++, you can overload operators for it, too, but how can we get the same result in GitQL using SQL? How can we create a type that can represent complex values like Audio files, Images, Tree data structure of specific values …etc?

In some Database engines like PostgreSQL, it allows the user to define a type as a composed of other defined types and overload operators with implementation for this type using SQL, but the goal of GitQL SDK is to help you easily build your domain-specific query engine, but what I want here is to be able also to define new types that maybe not composed from primitives like Audio, Video, Image, Abstract Syntax Tree for one Language or even Assembly Instruction as Type, but how 🤔.

So inspired by Chris Lattner's design in Swift and currently in Mojo to move all types, even primitives, from Compiler to be defined in the Standard library, I found that this idea is very good, and if I can implement it in the GitQL to move the types to SDK, any SDK user can easily use interface to define custom types and overload operators for it.

Moving the Types from Engine to SDK

So what I did is I moved the old Types from the Engine level to the SDK level, so now the Parser, TypeChecker and Engine deal with Abstract Type as interface and don’t know what this type represents, but they know what this type can do, for example, if I want this type to work this * operator I can do this

impl DataType for MatrixType {
    fn can_perform_mul_op_with(&self) -> Vec> {
        vec![Box::new(MatrixType)]
    }

    fn mul_op_result_type(&self, _other: &Box) -> Box {
        Box::new(MatrixType)
    }
}

This means that you can perform the * operator between two Matrices, and the second function defines that the expected type from this operation will be MatrixType, there are similar functions for each operator.

Now then, the parser and type checker find an expression like matrix1 * matrix2, they will call can_perform_mul_op_with to check if this operation is valid or not, and if valid, they will call mul_op_result_type to get the result type from this operation.

Similar to a custom type, we can define custom Values too, for example

pub struct MatrixValue {
   pub matrix: Matrix
}

impl Value for MatrixValue {
    fn mul_op(&self, other: &Box) -> Result, String> {
        if let Some(other_matrix) = other.as_any().downcast_ref::() {
            let value = self.matrix.multiply(other_matrix.matrix);
            return Ok(Box::new(MatrixValue { value }));
        }
        Err("Unexpected type to perform `*` with".to_string())
    }
}

Now you can create custom functions that take MatrixType as a parameter or return type to end up with a query like this

SELECT create_matrix([1, 2], [3, 4]) * create_matrix([4, 5], [7, 8])

With this new architecture, I built a new tool called LLQL, which allows users to run SQL queries on LLVM IR or Bitcode and it’s possible to implement the same idea on Java Byte code, Assembly or even Machine code, here a real example from LLQL readme,

Imagine we want to search if there is an Add instruction that has sub instruction as left hand side and mul instruction as right hand side

define i32 @function(i32 %a, i32 %b) {
  %sub = sub i32 %a, %b
  %mull = mul i32 %a, %b
  %add = add i32 %sub, %mull    <----- Like this Add(Sub, Mul)
  ret i32 %add
}

You can easily search for this pattern using this SQL query

SELECT instruction FROM instructions WHERE m_inst(instruction, m_add(m_sub(), m_mul()))

In that project Instruction column has InstructionType, m_add, m_sub, m_mul return IntMatcherType and both of them are custom defined type using the SDK without modifying the GitQL engine itself.

You can read more about LLQL design and implementation from this article: LLQL: Matching patterns in LLVM IR/BC files using SQL query

You can find the full detailed documentation on GitQL website.

I am looking forward to your opinion and feedback 😋.

I hope you enjoyed my article and you can find me on

You can find me on: GitHub, LinkedIn, and Twitter.

Enjoy Programming 😋.

Book Summary: Clang Compiler Frontend

Amr Hesham — Tue, 14 May 2024 10:54:21 GMT

Clang Compiler Frontend book cover

Hello everyone, I am Amr and i am a Software engineer who is interested in Compiler and Tools development and in this article i will try to write a summary for The Clang Compiler Frontend book so you can get an idea about what is the content and what to expect from it.

The book contains two parts and two Appendices which are.

Part 1: Clang Setup and Architecture
Part 2: Clang Tools
Appendices: Compilation Database and Build Speed Optimization

Part 1: Clang Setup and Architecture

The first part start with config and build Clang project with CMake and Ninja then moving to the architecture of the Clang compiler starting from Lexer, Parser and CodeGen with good level of details on how each part work and how the preprocessor work and why the C++ parser need to depend on Sema (Semantic analysis) and what is a FontEndAction and how to use it.

For me I know from previous talk that the C++ Parser will need to get some information from Sema to parse code successfully but i like that the author provide much deep details and use the LLDB to know what the program doing right now and which component run first.

After that the author go deeper to explain the design of AST (Abstract Syntax Tree) how it structured and how to traverse it in more than one way, for me this chapter was very important for me to optimize traversing the AST in my side project ClangQL because if for example you want to get only class information but you make the visitor recursion on the nodes you will got bad performance because it will traverse fields, methods …etc. and you don’t need them right now.

The last chapter in this part introduce the internal basic libraries and tools for example the LLVM team has his own testing framework that used beside Google test framework (GTest) and also they implemented containers data structures similar to std library but optimized for some common cases such as some String operations, Small Vector …etc

Part 2: Clang tools

The second is moving from the internal architecture of the Clang to the Clang tools starting with how the internal design of the linter which named Clang-tidy and how to config and use it with a practical example on how to create your own custom check that estimate estimate the complexity of a C++ class based on the number of methods it contains and report an error using the Diagnostic engine if the complexity is bigger than a parameter.

Next chapter will introduce advanced code analysis tools based on the Control Flow Graph (CFG) and how to construct CFG and write your own CFG checks.

Next chapter in this part will teach you how to create your own custom code modification tool to do some refactoring such as renaming methods and also how to integrate it with the linter to provide a modification fix for a check, also how to config and use the clang format tool and also how to format your custom modifications.

The last chapter in this part is about the integration with IDE and how the Clangd project provide many useful features such as code complete, go to definition, type hint, linting and code formatting and many other features to your IDE or Code Editor using the Language Server Protocol with many good details so you can imaging what happens under the hood from the file opened until you close the file and how code editor or IDE can communicate with Clangd, also what optimizations tricks the Clangd do to be fast.

Appendix: Compilation Database

In the first appendix the author talk about the compilation database (CDB) what it is and how to use it in a large project and also how it used by other tools such as the Clang-Tidy and Clangd.

Appendix: Build Speed Optimization

This appendix tech you how to use useful features from the Clang to optimize your build speed such as Modulemap, Precompiled headers and modules.

Summary

This book was very useful and exciting to read for me because i am already interested to learn more information about the Clang and LLVM internals and with those information you can work with tools such as linter and formatter with good knowledge how they are works and also implement your own tools, maybe you want to create some checks for your personal or company works or maybe generate a some information from AST or other many ideas for example the ClangQL project extract the AST informations to run SQL query on them.

Also i like how the author use the debugger to show you what exactly happens step by step.

I hope you enjoyed my article, and you can find me on

You can sponsor my work using GitHub Sponsor ❤️ from here.

You can find me on: GitHub, LinkedIn, and Twitter.

Thanks for reading and enjoy Programming 😋.

Book Summary: Mastering the Java Virtual Machine An in-depth guide to JVM internals and performance…

Amr Hesham — Fri, 19 Apr 2024 12:36:29 GMT

Book Summary: Mastering the Java Virtual Machine An in-depth guide to JVM internals and performance optimization

Hello everyone, I am Amr and i am a Software engineer who is interested in Compiler and Tools development and in this article i will try to write a summary for Mastering the Java Virtual Machine book so you can get an idea about what is the content and what to expect from it.

The book contains four parts which are.

Understanding the JVM
Memory Management and Execution
Alternative JVMs
Advanced Java Topics

Part 1: Understanding the JVM

The author starts this part by taking you in the journey to why the JVM exists what are the advantages, and how it represents your code as bytecodes, it’s teaching you how to map your code to bytecode and how to read and understand the JVM class file structure,

Then it explains the most common bytecode instructions such as bytecode for Arithmetic and comparisons operations, Shifts, bitwise, conditions such as equal, and not equals for primitives and references, different method calls and how value is converted from type to other (Casting) and how JVM manipulate object (Set and Get), in my opinion the advantage here is that the author provide many practical examples and code snippets so after this part you can read and understand bytecodes for basics cases

Part 2: Memory Management and Execution

This part is very important for a developer who cares about optimizing your application, it takes you on a journey to the internals of JVM and how it deals with your code as a VM such as variables, operators, stack, heap, methods and native methods, also how JVM can give you more performance using the JIT (Just in time compilation), those information's are very important for example when you face a StackOverflow exception with your recursive methods now you know what this exactly means and why it overflow!

Next it explains the different types of Garbage Collections algorithms how they are works and what are the different between, how to config the GC for your application and what are the recommended configuration, also how to select the right algorithm for your application.

Part 3: Alternative JVMs

This part introduces the Graal VM and explains the use case for it, comparing it with JVM to make it clear what are the advantages and disadvantages with real use cases and how you can create a native image for your application without the need to have VM installed on machine to be able to launch it

Next, it takes you on a journey to see different alternative JVMs implementations exists, and as we know from the previous part that changing Garbage Collector algorithm can led to different performance and pause time, you will learn that using a different JVM implementation will give you different trade off, so you will learn what are the different between the most six popular JVMs implementation (Eclipse J9, Amazon Corretto, Azul Zulu and Zin, IBM Semeru and Eclipse Temurin).

Part 4: Advanced Java Topics

This part is related to advanced topics in Java starting from Metadata and trade offs in frameworks then going to the most interesting topics for me (As a tools Developer)

Staring by the reflection and Dynamic Proxy topics, why they are important and what are the trade-offs of using them with interesting practical examples for both of them.

Next, it takes you to the Annotation Processor topic and how to take advantages of compile-time annotations to validate and generate code, why it’s different from runtime reflection and recreate the previous example in the Annotation processor.

For me I was interested in building tools using Annotation processor such as EasyAdapter library to help you to build UI Adapters easily using only Annotations, but it was the first time for me to learn about Dynamic Proxy and after some search i found that many Android libraries are built using it.

Summary

The book is very interesting to me and learned much good information about GC and how the JVM works in very low-level details, I recommend searching for examples for each concept trying to install the ASM library with the new java project and playing around with Bytecodes while reading the first two parts.

I hope you enjoyed my article, and you can find me on

You can find me on: GitHub, LinkedIn, and Twitter.

You can sponsor my work using GitHub Sponsor ❤️ from here.

Thanks for reading and enjoy Programming 😋.

Book Summary: Learn LLVM 17

Amr Hesham — Tue, 20 Feb 2024 18:13:57 GMT

Hello everyone, I am Amr and i am a Software engineer who is interested in Compiler and Tools development and in this article i will try to write a summary for Learn LLVM 17 book so you can get an idea about what is the content and what to expect from it.

The book contains four parts which are

The Basics of Compiler Construction with LLVM
From Source to Machine Code Generation
Taking LLVM to the Next Level
Roll Your Own Backend

Part 1: The Basics of Compiler Construction with LLVM

The authors starts this part with step by step on how to install the LLVM on your system with the required dependencies then start to give you an overview about common the structure of the compiler and what is the goal of every pass.

After finishing the overview, the authors start with the first project which is a simple arithmetic expression language, this project it teach you how to define and read grammar, build a handwritten lexer, parser and convert from source code to Abstract Syntax Tree (AST) data resources, then going to Semantic analysis pass to validate this tree and make sure the program is valid for example checking if types and identifiers are valid and declared before used, then start the code generation pass which is generating LLVM IR from the AST and support the language with simple runtime library written in C and create call for it inside the code generation

Part 2: From Source to Machine Code Generation

In this part the authors start the second project which is a compiler for a subset of Modula-2 Programming language, this language supports types, generics and object-orientated programming here is an example from the book.

MODULE Gcd;

PROCEDURE GCD(a, b: INTEGER) : INTEGER;
VAR t: INTEGER;
BEGIN
 IF b = 0 THEN
  RETURN a;
 END;
 WHILE b # 0 DO
  t := a MOD b;
  a := b;
  b := t;
 END;
 RETURN a;
END GCD;

END Gcd.m

The authors starts to define the grammar for the language which is now similar to real language not only arithmetic expression and show you the structure for the project then moving to preparing how to manage files and diagnostics and starting the lexer and the parser and AST which similar to the arithmetic expression language but for sure a lot bigger.

Then moving to performing semantic analysis which is now contains more analysis such as checking that variables or objects names are unique in the current scope and not declared twice, resolve types for constants without declaring it, and in assignment expression make sure that the value match the type for this variable

On the next chapters in this part the authors start to teach you about LLVM IR and SSA by generating IR from C code using Clang Compiler (Also i recommend to use the Compiler explore website too) and then starting to generate LLVM IR form the program AST with and implement optimization passes.

Part 3: Taking LLVM to the Next Level

In this part the authors start to teach you about the LLVM DSL TableGen language, The JIT (Just in time) Compiler and use LLVM tools in deteching issues, debugging, static analyzer and profiling then how to write your own tool using LibClang, In my opinion learning how to use and create tools with LibClang is very important so you can create a lot of good tools or using it to parse C code into AST allow your language to call C code and vice versa.

Part 4: Roll Your Own Backend

In this part the authors starts to teach you about how to add a new backend target for a CPU architecture not supported by LLVM

Summary

The book is very good. i liked that the authors stars with simple language first then moving to subset of real world language (Modula-2) with features that you see in most languages, also with well structure for code and compiler components not dummy.

In my opinion the book is good in both cases if you are want to start learning about compiler or have read some books before but interested to start learning how to create compiler using the LLVM infrastructure.

I hope you enjoyed my article and you can find me on

You can find me on: GitHub, LinkedIn, and Twitter.

Enjoy Programming 😋.

The difference between infix function and operator in Kotlin

Amr Hesham — Fri, 07 Jul 2023 13:22:55 GMT

Hello everyone. Many of us use infix functions that come with Kotlin standard library as a replacement for operators. Since they have inline keyword and give us the same runtime performance as operators, you may think of them as just a named version of operators. But they are not totally the same as operators and today we will know why!

To help us understand the difference, let’s first talk about how an expression is evaluated in programming languages in general.

Expressions Evaluation

In a programming language, each expression has precedence in parsing, for example *and /has higher precedence than *and — that is what makes expressions like 2 * 2 + 2 evaluated to 6 not 8 because 2 * 2 will be evaluated first.

But what if two expressions have the same precedence? Which one will be evaluated first?

In this case, they will be evaluated from left to right; for example 1 + 2 + 3 will evaluate 1 + 2 first then 3 + 3 .

Now let's back to our main question about infix functions vs operators.

Infix functions vs operators

Infix function is just a syntax sugar that represents a function that takes two parameters but can call like an operator between two expressions. For example a function b is actually function(a, b) and from the compiler perspective, the infix function name is just a name like any other function. So if you named it and or plus that doesn’t mean compiler will trait it like && or + operator and this is the main difference and that can cause big problems if you forget this simple difference, take this example.

true || false && false
true or false and false

They look almost the same, but once you evaluate them you will find that they have a different result. Let’s see how they are evaluated.

The first expression

true || false && false

In this expression, the and ( && ) operator has higher precedence than the or ( ||) operator so it will evaluate first false && false will be evaluated to false and now our expression looks like true || false and finally evaluated to true .

The second expression

true or false and false

This expression has no operators. They are two infix functions, and from the compiler prescriptive, they have the same precedence. So they will be evaluated from left to right, first true or false will be evaluated to true so our expression will be true and false and this will be evaluated to false .

Conclusion

When you want to use infix functions, make sure you explicitly know which part will be evaluated first and you can control that using Group expression (..) because expression inside Group always has the highest precedence and will be evaluate first.

I hope you enjoyed my article and you can find me on

You can find me on: GitHub, LinkedIn, Twitter.

Enjoy Programming 😋.

The difference between infix function and operator in Kotlin was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

How i created a query language for .git files (GQL)

Amr Hesham — Thu, 08 Jun 2023 15:04:09 GMT

I created a query language for .git files (GQL)

Hello everyone. Last month I got interested in Rust programming language and want to discover more about it. So I started to learn the basics and started to see the open source projects written in Rust. I also created one PR in the rust analyzer project; it does not depend on my knowledge of rust but on my general knowledge of Compilers and Static analysis. As usual, I love to learn new things by creating new projects with ideas that I am interested in.

The idea

I started to think about small ideas that I love to use, for example, a faster search CLI or some utility apps. But then I got a new cool idea.

While reading the Building git book (a book about building git from scratch), I learned what each file inside the .git folder does and how git store commits, branches and other data and manage its own database. So what if we have a query language that runs on those files?

The Git Query Language (GQL)

I decided to implement this query language, and I named it GQL. I was very excited to start this project because it was my first time implementing a query language. I decided to implement it from scratch, not converting .git files into an SQLite database and running normal SQL queries. And I thought it will be cool if, in the future, I can use the GQL engine as a part of a Git client or analyzer.

The implementation of GQL

The goal is to implement it into two parts. The first one is converting the GQL query into AST of nodes, then passing it to the engine to walk and execute it as an interpreter or in the future to convert this into virtual matching for GQL Byte code instructions.

The engine has the functionality to deal with .git files using the rust binding for git2 library so it can perform selecting, updating and deleting tasks, also storing the selected data into a data structure so we can perform filtering or sorting.

To simplify this implementation and I created a struct called GQLObject that can represent commit, branch, tag or any other object in this engine also to make it easy to perform sorting, searching, and filtering with single functions that deal with this type.

pub struct GQLObject {
  pub attributes: HashMap,
}

The GQLObject is just a map of string as a key and value, so it can be general to put the info of any type. And now features like comparisons, filtering or sorting can be implemented easily on this strings map.

The current state

Over the last week, I implemented the selecting feature with conditions, filtering and sorting with optional limit and offset so you can write queries like this

select * from commits
select name, email, title, message, time from commits
select * from commits order by name limit 1
select * from commits order by name limit 1 offset 1
select * from branches where ishead = "true"
select * from tags where name contains = "2023"

Version 0.1.0 and 0.2.0 Updates

After publishing this article and sharing the project i got amazing feedback from many peoples and feature requests so i started to implement many of them with the goal to be able to use SQL features for example

Now we have group by, Aggregation Functions and column name alias so you can perform more advanced query for example selecting top n contributors name and number of commits

SELECT name, count(name) AS commit_num FROM commits GROUP BY name ORDER BY commit_num DES LIMIT 10

The next step

Now the next step is to optimize the code and start to support more features, for example, imaging query for deleting all branches except the master.

delete * from branches where name ! "master"

Or pushing all or some branches to a remote repository using a single query. Maybe grouping and analyzing how many commits for each user in this month and many other things we can do.

The GQL project is a free open source, so everyone is most welcome to contribute, suggest features or report bugs.

GitHub - AmrDeveloper/GQL: Git Query Language (GQL)

I am looking forward to your opinion and feedback 😋.

I hope you enjoyed my article and you can find me on

You can find me on: GitHub, LinkedIn, and Twitter.

Enjoy Programming 😋.

How i created a query language for .git files (GQL) was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to create a C/C++ Static Code analysis tool

Amr Hesham — Fri, 05 May 2023 12:57:14 GMT

Hello everyone. As a software engineer, static analysis is a very useful tool in our day to day work. In this article, we will learn how it works and create our own one for educational purposes to analyse C/C++ code depending on our rules.

But first, What is Static Code Analysis?

From the name, it’s that tool that analyses source code files but without executing them, and that’s the meaning of static. It can find issues such as Syntax errors, security vulnerabilities, coding standard violations, Undefined values and other types of violations.

You can check the list of static analysis tools from Wikipedia for almost all popular languages.

How Static Code Analysis actually works?

Your source code files, whatever they are (C, C++, Java or any other languages), are just a file with text, and it is hard to perform analysis on this code as pure text. So we need to put it in a data structure to be easy to traverse it as nodes such classes, functions, and variables. But how can we do that?

We can use the same technique that is used in compiling source code. Because the compiler also has the same requirement to traverse code and validate it before converting it to the executable file. This technique is called parsing. It reads the source file as text, and converts it to a tree data structure called Abstract syntax tree (AST). It holds the information about code and the position of every node in source files depending on the grammar of the language.

Now suppose that we created an AST from code, and we want to check that no function name contains _how can we do that?

Step 1: Traverse the tree and find all nodes that represent functions.
Step 2: For each node we will get the function name as a string.
Step 3: if the name contains _ we will report error with this function position.
Note: Remember I told you that AST hold the position too.

If you are confused don’t worry, we will do those steps in Code soon.

How can we create our C/C++ Static Code Analysis?

Our first task is how we can parse C/C++ source files into AST; there are three ways we can do this task,

The first solution is that we use tools that take grammar and generate the parser code for us. These tools are called parser generators. Examples of those tools are ANTLR, Bison, JavaCC and many others you can check the list from Wikipedia.

The second solution is that we write our parse code from scratch to read text, check language grammar and build nodes of AST; this is called Hand Written parser. The advantage of this solution over the first one is that you can provide more helpful error messages and get more performance.

The third solution is to use a library that parses the code for us, so we can pass source files to it and receive the AST to perform the analysis. The advantage of this solution is that it is easy to use and will be updated if the language has a new version.

In this project, we will go with the third solution. Thanks to LLVM Team, we can use the Clang compiler components as libraries. So we will pass the source files to the library, and clang will handle syntax errors for us and type checking then it will provide us with an easy way to traverse the AST using Visitor design pattern and reporting errors.

In this article, we will perform two analysis tasks on our code:
1- Check that no function name contains _ at any position.
2- Check that no class inherit from more than one class

You can extend the project and do any kind of analysis you want.

Now let's start coding!

In this project, I will use C++20, Clang 15 and Cmake version 3.0.0+. To avoid making the article too long, I will not cover installing them. You can easily search for that.

Our CMakeList file will look like this to use the needed Clang libraries.

cmake_minimum_required(VERSION 3.0.0)
project(checker VERSION 0.1.0)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

find_package(Clang REQUIRED CONFIG HINTS "${LLVM_DIR}/lib/cmake/clang/")

add_executable(checker main.cpp)

target_include_directories(checker PRIVATE ${CLANG_INCLUDE_DIRS})
target_link_libraries(checker PRIVATE
  clangAST
  clangBasic
  clangFrontend
  clangTooling
)

I named the project checker, you can replace it with your project name, and for this article, I wrote all the code inside main.cpp file.

Now, let’s setup the basic code for our project in C++,

First, we need to create a class that inherits from RecursiveASTVisitor so we can get any nodes we want to perform analysis.

#include 
#include 
#include 

#include 

using namespace clang;
using namespace clang::tooling;

using namespace llvm;
using namespace llvm::cl;

class CheckerASTVisitor : public RecursiveASTVisitor {}

Then create a class that inherits from ASTConsumer it will take the AST and pass it to our visitor class to traverse it

class CheckerASTConsumer : public ASTConsumer {
public:
  auto HandleTranslationUnit(ASTContext &context) -> void override {
    CheckerASTVisitor visitor;
    visitor.TraverseDecl(context.getTranslationUnitDecl());
  }
};

The last class is CheckerFrontendAction that creates a front end action and unique pointer of our AST consumer

class CheckerFrontendAction : public ASTFrontendAction {
public:
  auto CreateASTConsumer(clang::CompilerInstance &CI, llvm::StringRef InFile)
      -> std::unique_ptr override {
    return std::make_unique();
  }
};

Now in the main function, we will accept a file name from CLI, and pass it to the Clang library using our action

auto main(int argc, const char **argv) -> int {
  if (argc < 2) {
    errs() << "Usage: " << argv[0] << " .cpp\n";
    return 1;
  }

  auto file_name = argv[1];

  // Read source file content to pass to clang
  std::ifstream if_stream(file_name);
  std::string content((std::istreambuf_iterator(if_stream)),
                      (std::istreambuf_iterator()));

  auto action = std::make_unique();
  
  // Pass action, file content and file name that used in error message
  clang::tooling::runToolOnCode(std::move(action), content, file_name);
  return 0;
}

Now our setup is ready, so let's start our checks one by one.

Check 1: Check that no function name contains _ at any position.

In any task, we need to ask ourselves an important question; which node do we need to check from the AST? In this case, it is FunctionDeclaration node. So let's visit it on our visitor. Our visitor will look like this now.

class CheckerASTVisitor : public RecursiveASTVisitor {
public:
  auto VisitFunctionDecl(FunctionDecl *f) -> bool { return true; }
};

Each visit function gives you the node you want and wants you to return a boolean which is used to indicate whether the traversal should continue or not. Now let's implement our logic inside VisitFunctionDecl

auto VisitFunctionDecl(FunctionDecl *f) -> bool { 
   // Get function name
   const auto name = f->getNameAsString();
   // Check if name contains _ at any position
   if (name.find("_") != std::string::npos) {
      // Get the diagnostic engine
      auto &DE = f->getASTContext().getDiagnostics();
      // Create custom error message
      auto diagID = DE.getCustomDiagID(
          DiagnosticsEngine::Error,
          "Function name contains `_`.");
      // Report our custom error
      DE.Report(f->getLocation(), diagID);
   }
   return true;
}

Now let's create a test file test_name.cpp and test our analysis.

void function_name() {}

Let’s run our checker and see the output.

checker test_name.cpp

The output will look like this

test_name.cpp:1:6: error: Function name contains `_`.
void function_name() {}
     ^
1 error generated.

Our first task is successfully done. Let's start the next check.

Check 2: Check that no class inherit from more than one class.

Now the node we need to analysis it is the Class Declaration, so let’s start

bool VisitCXXRecordDecl(CXXRecordDecl *D) {
  // Get number of bases classes
  auto number_of_bases = D->getNumBases();
  // Check if number of bases is bigger than one
  if (number_of_bases > 1) {
    // Report custom error message like the last check
    auto &DE = D->getASTContext().getDiagnostics();
    auto diagID = DE.getCustomDiagID(
        DiagnosticsEngine::Error, "Class inhiert from more than one class.");
    auto DB = DE.Report(D->getLocation(), diagID);
  }
  return true;
}

Now let’s create a test file class_inheritance_test.cpp and test our analysis.

class One {};
class Two {};
class Three : One, Two {};

The output will look like this

class_inheritance_test.cpp:3:7: error: Class inhiert from more than one class.
class Three : One, Two {};
      ^
1 error generated.

Our last task is successfully done 😉.

Summary

There are many more checks, and you can extend this project to be bigger and add many checks. You can customize error messages format, support configuration files to customize the analysis like production tools, and more features left for you to do.

I hope after this article, you understand how static analysis works.

I hope you enjoyed my article and you can find me on

You can find me on: GitHub, LinkedIn, Twitter.

Enjoy Programming 😋.

How to create a C/C++ Static Code analysis tool was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

I created a programming language and created games with it

Amr Hesham — Sat, 22 Apr 2023 16:26:06 GMT

Hello everyone, today I will share with you the story about my programming language Amun, which is an open source low level general purpose language that compile to machine code using LLVM Framework.

The story begins two years ago, when I am as a self taught software engineer learning the subjects of computer science and in this time, I started to learn about Compiler design for the first time from Courses and Books, I created many small programming languages some of them are for specific goals like creating bots, drawing shapes …etc

I really loved the career of Compiler engineer and started to learn more about it, reading source code, try to design new features, watching live streams about compilers.

In the last year, I got the idea to design a high performance language that should be fast as C/C++ but to be very simple and easy to learn, also to give you some features to create good libraries and DSL (Domain specific Languages).

I already created a design for a small language with some features and concepts inspired by other languages such as C/C++, Go, Rust, Jai, Swift, Kotlin, and I started to simplify this design and named it Jot.

What I want is a language that is simple as C and Go, no preprocessor, no garbage collection (remember I need a high performance language).

It also has such as Type inference and Generic Programming support, Compile time stuff, lambda expression, operator overloading, infix, prefix and postfix functions inspired by Swift, and some cool other features for examples.

Lambda expression with the ability to move lambda moved out of parentheses in function call and constructor

var lambda = { printf("Hello from lambda!\n"); };

fun let(value *void, callback *() void) {
    if (value != null) {  callback(); }
}

fun main() {
   let(null) { 
       printf("Will never printed");
   }; 
}

Also, to come up with an easy way to iterate over arrays and strings.

for array { printf("array[%d] = %d\n", it_index, it); } 
for item : array { printf("array[%d] = %d\n", it_index, item); }
for item, index : array { printf("array[%d] = %d\n", index, item); }

Tuples so you can create a collection of different type values and use it to make a function that return more than one type

var tuple = (1, 2, "Hello", 3, 4, "World");

fun max_min(x int64, y int64) (int64, int64) {
   return if (x > y) (x, y) else (y, x);
}

Defer Statement that is useful for de allocate resources or closing streams at the end of current scope

fun main() {
   var stream = open_stream();
   defer close_stream(stream);

   // Read and write from stream

   // Stream will closed here at the end of scope
}

One of the most important goals is to have a cool and helpful compiler like rust, it should tell you messages that help you yo not only fix the bug but also understand more about problem and the language.

After some research, I found that the easiest way to create a high performance low level language is to use the LLVM Framework as a backend for the compiler so it can optimize and generate machine code for most of platforms. So I started to learn more about LLVM from Books and created a few projects with it.

In June 2022 I started to work on the compiler, and I decided to create it using C++ for many reasons such as high performance and LLVM is also created using C++ so I can easily found samples written in C++, also because I have some experience using it.

In Jan 2023 the language had most of the features that can help you to provide any programs that you can create in C but without Macros and i started to create simple programs and link with libraries such as OpenGL and Raylib to create simple GUI Applications.

Then I started to work on features that can make creating libraries and applications easier, such as tuples, operator overloading, lambda, type alias, directives and also improve the compiler error message, I can’t covering all features in this article but I will write about them latter with cool samples.

After testing the features i ported a Pong game written using C++ and Raylib in my language and this is the result.

https://medium.com/media/5c012c3de350b027ba3c79885ed7249b/href

Until this step, the language name was Jot, but I was surprised that there is already a language with the same name, so I searched for a new name, and I named it Amun. The name is inspired by Ancient Egyptian mythology when Amun was the chief deity of the Egyptian Empire.

The language is now still in development and everyone is most welcome to contribute to it. You can help with documentation, compiler, samples …etc.

You can follow the development on the GitHub repository, and there are more than 200 samples to over all language features, so feel free to star the project if you loved it.

GitHub - AmrDeveloper/Amun: A Statically typed, compiled general purpose low level programming language built using C++ and LLVM Infrastructure framework designed to be simple and fast

Enjoy Programming and creating cool stuff 😇

I created a programming language and created games with it was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.