PoPL/04_semantics_and_running
Bananymous 0508bde07c Add some code examples that I used as test cases during implementation 2024-04-30 01:29:09 +03:00
..
examples Add some code examples that I used as test cases during implementation 2024-04-30 01:29:09 +03:00
.gitkeep spring 2024 git template 2023-12-18 07:45:29 +02:00
README.md Add a text document descibing my implementation of Phase 4 2024-04-30 01:28:10 +03:00
build_ast.py Fix true/false naming on unless_expression 2024-04-28 01:35:48 +03:00
lexer.py Implement proper semantic checking for full AST 2024-04-27 00:05:53 +03:00
main.py Fix check for when stack frame is needed 2024-04-30 00:05:52 +03:00
tree_print.py Implement proper semantic checking for full AST 2024-04-27 00:05:53 +03:00

README.md

Phase 4

I implemented a basic compiler for the phase 4. This compiler translates the given program file into GNU assembly that is then compiled using GCC and linked against system libc. Libc is used for printing values and time manipulation.

Semantic checking

Before the compilation check, full semantic check is performed on the code. All types are statically checked to be valid.

Levels of interpretation

  1. All arithmetic expressions and print statements are implemented
  • Arithmetic expressions are checked during semantic checking.
    • Valid addition expressions
      • [int] + [int] -> [int]
      • [date] + [int] -> [date]
    • Valid subtraction expression
      • [int] - [int] -> [int]
      • [date] - [int] -> [date]
      • [date] - [date] -> [int]
    • Valid multiplication expressions
      • [int] * [int] -> [int]
    • Valid division expressions
      • [int] / [int] -> [int]
    • Valid comparisons
      • [int] = [int] -> [bool]
      • [date] = [date] -> [bool]
      • [int] < [int] -> [bool]
      • [date] < [date] -> [bool]
  • Print statement can take a list of [int], [date] or [string] types
  1. Global variables can be defined and they are accessible from anywhere in the code
  2. For [date] variables, reading a field is possible. Writing is not because of limitations of assembly language.
  • Readable fields include
    • [date]'day -> [int], day number in month (1-31)
    • [date]'month -> [int], month number in year (1-12)
    • [date]'year -> [int], year number
    • [date]'weekday -> [int], number of the day in week (1-7)
    • [date]'weeknum -> [int], number of the week in year (1-53)
  • Writable fields include
    • [date].day -> [int], day number in month (1-31)
    • [date].month -> [int], month number in year (1-12)
    • [date].year -> [int], year number
  • Writing to date will pass semantic checking but fail during compilation
  1. Unless-statement, unless-expression and until-loop all work as expected
  2. Function and procedure definitions and calls are properly handled and work as expected
  • Function and procedure return types are validated during semantic check
  • Auto keyword works. In this case the return type is determined using the type of returned value.
  1. There is no runtime checking done as my compiler has static typing. All variables and functions are typed, so there is no need for doing it during runtime.
  2. Recursion and proper local variables are implemented. This happens pretty much automatically when local variables and function arguments are kept on stack.

Test cases

All tests in public_examples/04_semantics_and_running/running_examples pass semantic checking and output the correct values.

All tests in public_examples/04_semantics_and_running/semantic_error_examples fail during semantic checking.

Compiler ABI and generated code

Some insights on how different data types are handled and how the compiler works in general.

  • Integer literals are stored as 64 bit values and arithmetic is done using basic assembly instructions
  • Date literals are stored as an Unix timestamp (seconds since 1970). To add or subtract days, the right operand is multiplied by 86400 (seconds in 24 hours) for the timestamp to work expecedly. Subtracting dates is done using basic integer arithmetic and dividing the result by 86400 which gives number of days as the difference.
  • String literals are stored in the .data section with automatically generated labels.
  • Printing
    • Integers are printed using printf("%lld", value), so printf takes care of the formatting
    • Strings are printed using printf("%s", label), where label is the strings read-only label in .data section
    • Dates are printed by first calling localtime(timestamp) to get struct tm* pointer containing broken down time. Then dates are formatted into a fixed buffer in .bss section using strftime(date_buffer, date_buffer_size, "%Y-%m-%d", tm), to format the string into the wanted form. After the date buffer has formatted string, it can printed using printf("%s", date_buffer).
  • Today() builtin function calls __builtin_today which calls time(NULL) from libc. This function returns current unix timestamp, so it is really suitable for this usage.
  • Reading attributes from date objects is kind of hacky. It is mostly done as printing date objects. First date is converted into struct tm* with localtime(timestamp), and formatted to wanted attribute (day -> "%d", month -> "%m", ...). This is then formatted into the date buffer using strftime. After formatting date buffer should contain only the asked part of the date. This can be converted into integer using the atoi() function.
  • Writing date attributes is not implemented as that would require modifying struct tm* which is not feasible in pure assembly as the actual structure of it is unknown. There doesn't seem to be easy way do determine size and offset of a field in structure in GNU assembly.
  • Conditional statements are done comparing the low byte of rax. After which a conditional jump happens to a label defined in the code.
  • Calling convension of generated assembly code is as follows
    • Arguments are pushed to stack before function call
    • Local variables are initialized to proper values in the stack after as the first step in function call
    • Rest of the code is generated
    • Using either arguments or local variables happens through offset to rbp which contains the current stack frame.
    • Return value of each operation of pythons compile_ast() is stored in rax register
    • Stack is always aligned to 16-byte boundary according to System V ABI. This is because called libc functions might use sse-instructions that depend on this alignment.
    • Global variables are stored in .bss section. The first thing main function does, is initialize global variables to proper values.

Optimization

There is also an optional -O flag for the "compiler". This does a simple optimization pass over the generated assembly and optimizes some obvious instructions. These are the current optimizations implemented

  • Moving from a register to the same register is removed
    • movq %rax, %rax => nop
  • Moving into rax and then pusing into rax is optimized to a single push instruction
    • movq $1, %rax; pushq %rax => pushq $1
  • Moving into rax ant then moving out of rax is optimized to single move instruction without accessing rax
    • movq $1, %rax; movq %rax, %rcx => movq $1, %rcx
  • Repeated addition and subtraction instruction with immediate values on the same register are combined into a signle instruction
    • addq $5, %rax; subq $2, %rax => addq $3, %rax
  • Moving immediate into register followed by add or subtract on the same register is optimized to single move
    • movq $5, %rax; subq $2, %rax => movq $3, %rax
  • Adding or subtracting immediate 1 is optimized to corresponding inc or dec instruction
    • addq $1, %rax => incq %rax
  • Moving immediate zero to register is optimized to xor instruction
    • movq $0, %rax => xorq %rax, %rax

These optimizations are not all something that can be done in all assembly language, but because the way my compiler generates code it does not expect value to stay in register. For example normally movq $4, %rax; movq %rax, %rcx cannot be optimized to movq $4, %rcx since this assumes rax is not used afterwards. In my compiler this kind of assumptions are not done, so this is a valid optimization.

Depending on the code that is to be compiled, optimizations might not do anything meaningful, so this is left as a optional flag.

Usage of the program

usage: main.py [-h] [-d] (--who | -f FILE) [-o OUTPUT] [-a ASSEMBLY] [-O] [-r]

options:
  -h, --help            show this help message and exit
  -d, --debug           debug?
  --who                 print out student IDs and NAMEs of authors
  -f FILE, --file FILE  filename to process
  -o OUTPUT, --output OUTPUT
                        output filename for compiled code. default (a.out)
  -a ASSEMBLY, --assembly ASSEMBLY
                        output filename for generated assembly code
  -O, --optimize        run simple optimization steps on the generated assembly code
  -r, --run             run the compiled code after compilation

By default when the program is run, it compiles file pointed by -f into binary called a.out. If you want the binary to be named something different, you can use the -o flag to specify output file for the generated binary.

If you are interested in the generated assembly, you can use the -a flag. This flag names a file where the generated assembly is dumped. Without this flag, no assembly is written to any file.

If you want to run the optimization pass, you can specify the -O flag. This flag takes no arguments. If both this flag and the -a flag are specified, the dumped assembly will the optimized verision.

To run the program automatically after compilation, you can specify the -r flag. This flag calls the output binary specified by -o flag and executes it.

Requirements

For this program to function properly, you must have GCC installed and the required PLY library for python. Without GCC compilation step will fail. Assembly will be dumped even if there is no GCC available, as that happens before the invocation of GCC.