Add a text document descibing my implementation of Phase 4

This commit is contained in:
Bananymous 2024-04-30 01:28:10 +03:00
parent 3fa0e5f655
commit 83c560b2a9
1 changed files with 132 additions and 0 deletions

View File

@ -0,0 +1,132 @@
## Phase 4
I implemented a basic compiler for the phase 4. This compiler translates the given program file into GNU assembly that is then compiled using GCC and linked against system libc. Libc is used for printing values and time manipulation.
### Semantic checking
Before the compilation check, full semantic check is performed on the code. All types are statically checked to be valid.
#### Levels of interpretation
1. All arithmetic expressions and print statements are implemented
- Arithmetic expressions are checked during semantic checking.
- Valid addition expressions
- [int] + [int] -> [int]
- [date] + [int] -> [date]
- Valid subtraction expression
- [int] - [int] -> [int]
- [date] - [int] -> [date]
- [date] - [date] -> [int]
- Valid multiplication expressions
- [int] * [int] -> [int]
- Valid division expressions
- [int] / [int] -> [int]
- Valid comparisons
- [int] = [int] -> [bool]
- [date] = [date] -> [bool]
- [int] < [int] -> [bool]
- [date] < [date] -> [bool]
- Print statement can take a list of [int], [date] or [string] types
2. Global variables can be defined and they are accessible from anywhere in the code
3. For [date] variables, reading a field is possible. Writing is not because of limitations of assembly language.
- Readable fields include
- [date]'day -> [int], day number in month (1-31)
- [date]'month -> [int], month number in year (1-12)
- [date]'year -> [int], year number
- [date]'weekday -> [int], number of the day in week (1-7)
- [date]'weeknum -> [int], number of the week in year (1-53)
- Writable fields include
- [date].day -> [int], day number in month (1-31)
- [date].month -> [int], month number in year (1-12)
- [date].year -> [int], year number
- Writing to date will pass semantic checking but fail during compilation
4. Unless-statement, unless-expression and until-loop all work as expected
5. Function and procedure definitions and calls are properly handled and work as expected
- Function and procedure return types are validated during semantic check
- Auto keyword works. In this case the return type is determined using the type of returned value.
6. There is no runtime checking done as my compiler has static typing. All variables and functions are typed, so there is no need for doing it during runtime.
7. Recursion and proper local variables are implemented. This happens pretty much automatically when local variables and function arguments are kept on stack.
#### Test cases
All tests in _public\_examples/04\_semantics\_and\_running/running\_examples_ pass semantic checking and output the correct values.
All tests in _public\_examples/04\_semantics\_and\_running/semantic\_error\_examples_ fail during semantic checking.
### Compiler ABI and generated code
Some insights on how different data types are handled and how the compiler works in general.
- Integer literals are stored as 64 bit values and arithmetic is done using basic assembly instructions
- Date literals are stored as an Unix timestamp (seconds since 1970). To add or subtract days, the right operand is multiplied by 86400 (seconds in 24 hours) for the timestamp to work expecedly. Subtracting dates is done using basic integer arithmetic and dividing the result by 86400 which gives number of days as the difference.
- String literals are stored in the .data section with automatically generated labels.
- Printing
- Integers are printed using `printf("%lld", value)`, so `printf` takes care of the formatting
- Strings are printed using `printf("%s", label)`, where label is the strings read-only label in .data section
- Dates are printed by first calling `localtime(timestamp)` to get `struct tm*` pointer containing broken down time. Then dates are formatted into a fixed buffer in .bss section using `strftime(date_buffer, date_buffer_size, "%Y-%m-%d", tm)`, to format the string into the wanted form. After the date buffer has formatted string, it can printed using `printf("%s", date_buffer)`.
- `Today()` builtin function calls `__builtin_today` which calls `time(NULL)` from libc. This function returns current unix timestamp, so it is really suitable for this usage.
- Reading attributes from date objects is kind of hacky. It is mostly done as printing date objects. First date is converted into `struct tm*` with `localtime(timestamp)`, and formatted to wanted attribute (day -> "%d", month -> "%m", ...). This is then formatted into the date buffer using `strftime`. After formatting date buffer should contain only the asked part of the date. This can be converted into integer using the `atoi()` function.
- Writing date attributes is not implemented as that would require modifying `struct tm*` which is not feasible in pure assembly as the actual structure of it is unknown. There doesn't seem to be easy way do determine size and offset of a field in structure in GNU assembly.
- Conditional statements are done comparing the low byte of rax. After which a conditional jump happens to a label defined in the code.
- Calling convension of generated assembly code is as follows
- Arguments are pushed to stack before function call
- Local variables are initialized to proper values in the stack after as the first step in function call
- Rest of the code is generated
- Using either arguments or local variables happens through offset to rbp which contains the current stack frame.
- Return value of each operation of pythons `compile_ast()` is stored in rax register
- Stack is always aligned to 16-byte boundary according to _System V ABI_. This is because called libc functions might use sse-instructions that depend on this alignment.
- Global variables are stored in .bss section. The first thing main function does, is initialize global variables to proper values.
### Optimization
There is also an optional `-O` flag for the "compiler". This does a simple optimization pass over the generated assembly and optimizes some obvious instructions. These are the current optimizations implemented
- Moving from a register to the same register is removed
- `movq %rax, %rax` => `nop`
- Moving into rax and then pusing into rax is optimized to a single push instruction
- `movq $1, %rax; pushq %rax` => `pushq $1`
- Moving into rax ant then moving out of rax is optimized to single move instruction without accessing rax
- `movq $1, %rax; movq %rax, %rcx` => `movq $1, %rcx`
- Repeated addition and subtraction instruction with immediate values on the same register are combined into a signle instruction
- `addq $5, %rax; subq $2, %rax` => `addq $3, %rax`
- Moving immediate into register followed by add or subtract on the same register is optimized to single move
- `movq $5, %rax; subq $2, %rax` => `movq $3, %rax`
- Adding or subtracting immediate 1 is optimized to corresponding inc or dec instruction
- `addq $1, %rax` => `incq %rax`
- Moving immediate zero to register is optimized to xor instruction
- `movq $0, %rax` => `xorq %rax, %rax`
These optimizations are not all something that can be done in all assembly language, but because the way my compiler generates code it does not expect value to stay in register.
For example normally `movq $4, %rax; movq %rax, %rcx` cannot be optimized to `movq $4, %rcx` since this assumes rax is not used afterwards. In my compiler this kind of assumptions are not done, so this is a valid optimization.
Depending on the code that is to be compiled, optimizations might not do anything meaningful, so this is left as a optional flag.
### Usage of the program
```
usage: main.py [-h] [-d] (--who | -f FILE) [-o OUTPUT] [-a ASSEMBLY] [-O] [-r]
options:
-h, --help show this help message and exit
-d, --debug debug?
--who print out student IDs and NAMEs of authors
-f FILE, --file FILE filename to process
-o OUTPUT, --output OUTPUT
output filename for compiled code. default (a.out)
-a ASSEMBLY, --assembly ASSEMBLY
output filename for generated assembly code
-O, --optimize run simple optimization steps on the generated assembly code
-r, --run run the compiled code after compilation
```
By default when the program is run, it compiles file pointed by `-f` into binary called _a.out_. If you want the binary to be named something different, you can use the `-o` flag to specify output file for the generated binary.
If you are interested in the generated assembly, you can use the `-a` flag. This flag names a file where the generated assembly is dumped. Without this flag, no assembly is written to any file.
If you want to run the optimization pass, you can specify the `-O` flag. This flag takes no arguments. If both this flag and the `-a` flag are specified, the dumped assembly will the optimized verision.
To run the program automatically after compilation, you can specify the `-r` flag. This flag calls the output binary specified by `-o` flag and executes it.
### Requirements
For this program to function properly, you must have GCC installed and the required PLY library for python. Without GCC compilation step will fail. Assembly will be dumped even if there is no GCC available, as that happens before the invocation of GCC.