In my previous blog post, I discussed the PIC10F200/202/204/206 series of microcontrollers and explained the structure of their opcodes. In this post, we will create an assembler together.
What is an Assembler?
An assembler converts human-readable assembly code into machine language, or machine-readable code. For example, consider the following assembly line:
GOTO 0x03
We can understand this code, but our CPU or microcontroller cannot. The assembler takes this assembly line and converts it into machine code, like so:
GOTO 0x03
101 000000011
Each opcode (operation code) has an identifier. In this case, the identifier is 0b101
, and it may be followed by one or more operands—like 0x03
in our example, which is represented as 0b000000011
. The assembler combines these values into 0b101000000011
, forming a 12-bit binary value. This means every assembly line results in a 12-bit binary, and these binary values are then concatenated and exported as a binary or executable file that the CPU or machine can understand. This is the core function of an assembler.
Why Create an Assembler?
Now that we understand what an assembler does, why create one? Creating an assembler is an excellent learning experience for junior programmers. It helps deepen your understanding of both the microcontroller’s architecture and the programming language you’re working with. In this case, we will write our assembler in C, as it is fast and performance is important for our needs.
Challenges
When creating an assembler, you’ll encounter several challenges. For instance, not all opcodes have a single operand—some may have two operands, while others might not require any. We must design a program capable of understanding these variations and generating the corresponding machine code.
Breaking It Down: Our Methodology
We will start by writing a simple program that generates binary code for a single opcode. Once that works, we will expand it to support additional opcodes.
In my previous blog post, I explained the 33 opcodes available in the PIC10F series of microcontrollers. Let’s revisit the structure of the GOTO
opcode:
GOTO:
┌─ Literal Value (9 bits)
───────┴───
0b101K kkkk kkkk
─┬─
└─ Identifier (3 bits)
The last three bits are the GOTO
identifier, and the other nine bits represent the 9-bit operand.
Simple Program to handle GOTO opcode
Let’s start our journey by creating a simple program that generates the binary for the GOTO
opcode.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAXSTR 32
int main(void){
char line[2][MAXSTR] = { "GOTO", "42" };
int machine_code[10] = { 0 };
int midx = 0;
// Check that if the opcode is matches the "GOTO"
if(strcmp(line[0], "GOTO") == 0){
// an pointer to store unconverted parts of given string for strtol
char *endptr;
int result = 0;
// Exteract the numberic value of lines[1] and store in 'result' (base 10)
if((result = strtol(line[1], &endptr, 10)) != 0){
int code = 0b101000000000 | result;
machine_code[midx++] = code;
} else {
// exit the program if second operand is not a number
printf("Invalid operand");
exit(0);
}
} else {
printf("Unsupported Opcdoe \"%s\"", line[0]);
exit(0);
}
printf("Generated machine code\n");
for(int i = 0; i < midx; ++i){
printf("0x%.3X %s\n", machine_code[i]);
}
return 0;
}
Let’s see what this code does, step by step:
- It checks if the first element in our array is equal to
GOTO
; otherwise, it exits the program. - It extracts the numeric value of the
GOTO
operand and stores it inresult
. If the conversion fails, the program exits. - If converting the operand of
GOTO
to a number is successful, it generates the machine code using the identifier ofGOTO
, which is0b1010000000
, andresult
. - It saves the generated machine code in our
machine_code
list. - Using a
for
loop, we iterate through all the codes and print their hexadecimal values.
This program has several issues. First, the input to the assembler isn’t an array like { "GOTO", "42" }
; it’s a file containing lines of code. Second, we don’t want to handle just GOTO
; there are 33 opcodes that need to be managed. A binary view of the generated machine code would also be helpful. Another issue is that if no operand is provided for GOTO
, the assembler should still handle the case. Additionally, the operand for GOTO
can be in various formats, such as 0x0F
or 06H
, and the assembler must be capable of detecting and processing these. Lastly, we should have flags to allow for different output options, like -v
to view the generated binary. Let’s address these issues one by one and expand our simple assembler program.
Before having an assembly file, we must know its location to be able to read it; let’s address this challenge first.
Challenge 1: CLI Flags
For our assembler to work, we don’t need to add all the opcodes as an array of characters in the program; instead, we want to assemble the .asm
file by providing the path to the input file. Let’s write a function that stores our CLI (Command Line Interface) commands into a predefined structure called GFLAGS
for Global Flags.
#define MAX_PATH 512
typedef struct {
int verbose;
char input[MAX_PATH];
char output[MAX_PATH];
} GFLAGS;
We have three items in our structure. The first is a variable called verbose
, which we enable when the -v
flag is provided. This helps us see the generated binary output. The second is input
, a character array (string) used to store the input file path. The last item, also a character array called output
, is for specifying the output file path. However, we don’t always need to define an output path, so output
, like verbose
, is optional. The input
field is mandatory for our assembler because it contains the assembly code.
Now that we have defined our structure, let’s write a function that captures argc
and argv
from the main function and updates the given pointer to GFLAGS
.
/* Update global flags */
void update_gflags(GFLAGS *gflags, int argc, char *argv[]) {
// Exit the program if there are not enough arguments
if (argc < 2) {
// Print usage instructions
printf("%s \"<filename>\" -[options]\n", argv[0]);
exit(0); // Terminate the program
}
int i; // Loop variable for arguments
int j; // Loop variable for characters in an argument
// Initialize global flags structure
gflags->verbose = 0; // Default verbose mode off
memset(gflags->input, 0, sizeof(gflags->input)); // Clear input file path
memset(gflags->output, 0, sizeof(gflags->output)); // Clear output file path
// Set default input and output file paths
strcpy(gflags->input, argv[1]); // Input file path from first argument
strcpy(gflags->output, "asm_out.bin"); // Default output file path
int save = 0; // Flag to indicate the next argument is an output file path
// Iterate over program arguments starting from the second one
for (i = 2; i < argc; i++) {
// Save output path if '-o' option was found
if (save) {
strcpy(gflags->output, argv[i]); // Store output file path
save = 0; // Reset save flag
}
// Process each character of the current argument
for (j = 0; j < (int)strlen(argv[i]); j++) {
if (argv[i][0] == '-') { // Check if it's an option
switch (argv[i][j]) {
case 'v': // Verbose mode
gflags->verbose = 1;
break;
case 'o': // Output file option
save = 1; // Indicate the next argument is the output path
break;
default:
break; // Ignore unknown options
}
}
}
}
// Check if '-o' was used without specifying an output path
if (save) {
// Error message
printf("No output file!\nAfter '-o' output path needed\n");
exit(0); // Terminate the program
}
}
We create a function called update_gflags
. This function takes three arguments: the first is GFLAGS *gflags
, which is a pointer used to update the given GFLAGS
structure; the second is argc
, which contains the length of the input arguments; and the last is argv
, which holds the input arguments themselves.
This function saves the second argument in input
(the first one is the program itself) and loops through each subsequent argument. If an argument starts with -
, it checks for possible flags like -v
for enabling the verbose
flag. If it encounters the -o
flag, it enables the save
variable, and the next argument is stored in the output
path. If there is no -o
flag, the output path is already set to the default value of ./asm_out.bin
.
We can use update_gflags
in our main
function like this:
int main(int argc, char *argv[]){
GFLAGS gflags;
update_gflags(&gflags, argc, argv);
// ...
}
We can test our flags and observe the different outputs:
./assembler ./test.asm -v -o ./output.bin
The resulting data would be:
input: "./test.asm"
output: "./output.bin"
verbose: 1
Now that we have the input file path thanks to the update_gflags
function, we need to read the input file, if possible, and then proceed to read each line and process it for our assembler.
Challenge 2: Read Input File
It’s easier for our assembler to have each line as a char *
before breaking it down into words and extracting the operands. So, we need a structure to store our lines, and a table would be ideal. We will store each line in a variable of type char **
, but we don’t know where the end of the buffer is, so we will use an int
to keep track of how many lines of code are stored in our table. Let’s start by writing our Table structure and calling it TBL
.
#define MAX_STR 256 // Maximum length of a single line
#define ASM_BUFF 1024 // Maximum number of lines in the table
// Structure to store lines of assembly code
typedef struct {
// Array to store the lines (up to ASM_BUFF lines, each up to MAX_STR characters)
char lines[ASM_BUFF][MAX_STR];
// Variable to keep track of the number of lines stored in the table
int len;
} TBL;
A function for copying TBL
would be nice to help us store the original lines afterward.
void copytbl(TBL *dst, TBL *src){
dst->len = src->len;
for(int i = 0; i < src->len; ++i){
strcpy(dst->lines[i], src->lines[i]);
}
}
Now that we have our table, we can write a function to read each line of the input file. Since we already have the input file path stored in our GFLAGS
, let’s call this function io_read
.
The function io_read
takes two arguments: the first is a pointer to our predefined TBL
structure, since we want to store each line in our table, and the second argument is the input file’s path.
/* Read the file at 'path' and load it into 'tbl' (if an error occurs, finish the program) */
void io_read(TBL *tbl, char path[]){
FILE *fp;
// Clear the lines array in the table
memset(tbl->lines, 0, sizeof(tbl->lines));
tbl->len = 0;
// Open the file for reading
fp = fopen(path, "r");
// If the file doesn't exist, print an error message and exit
if(fp == NULL){
printf("File \"%s\" does not exist!\n", path);
exit(0);
}
char buff[MAX_STR] = { 0 };
// Read each line from the file
while(fgets(buff, sizeof(buff), fp) != NULL){
// Copy the current line to the table and increment the line count
strcpy(tbl->lines[tbl->len], buff);
tbl->len++;
}
// Close the file after reading
fclose(fp);
}
- We define a file pointer
FILE *fp
calledfp
. - We clear our table (
tbl
). - We check if
fp
isNULL
; if yes, we finish the program because it means we couldn’t read the file. - We read each line and store it into
buff
, then copybuff
to our table’s line. - We close the file.
Now that we have the io_read
function, let’s add it to our main
function, just below the update_gflags
.
int main(int argc, char *argv[]){
GFLAGS gflags;
update_gflags(&gflags, argc, argv);
TBL file;
io_read(&file, gflags.input);
// ...
}
Thanks to update_gflags
and io_read
, we now have the input file read and stored in our TBL
structure, named file
. Now, we can process each line.
Challenge 3: Breaking Down the Lines
The io_read
function helped us read the input file and store it in our TBL
structure, where each line is stored in an array like this:
{
// ...
"GOTO 43",
"NOP",
"CLRW"
// ...
}
However, earlier in our example, we had arrays for each instruction. For example, "GOTO 43"
would be stored as { "GOTO", "42" }
. This structure was useful because it wasn’t a single line like "GOTO 43"
, so we need a function to help us convert lines like "GOTO 42"
into arrays like { "GOTO", "42" }
. This will make it much easier for processing.
Now that we understand why we need a function to break down a string, let’s write one and call it str_break
. However, we also need a structure to store our data. It should be something similar to TBL
, which we already defined, but smaller, as the table is too large for this purpose. We need a compact structure to store our operands, so let’s define a structure and call it OPR
, to store the result of the str_break
function.
// Maximum number of operands
#define MAX_OPERAND 5
typedef struct {
char lines[MAX_OPERAND][MAX_STR];
int len;
} OPR;
So, now that we have our OPR
structure, let’s write the str_break
function. The str_break
function requires a character array input (char *
) and a pointer to the OPR
structure to store the data.
But there is a problem: we don’t want the str_break
function to behave like split()
in other languages. We need to write this function smart enough to detect quoted letters, like 'A'
, 'B'
, etc., which are enclosed in single quotes. This is useful because not all operands of functions are integers; they may be letters, like in the previous “Hello, World” example. Therefore, we need to track quotes as well.
void str_break(char input[], OPR *tbl) {
int q = 0; // Flag to track if inside quotes
int bi = 0; // Index for the line in the table
int f = 0; // Index for characters in the current line
memset(tbl->lines, 0, sizeof(tbl->lines)); // Initialize the lines array in the table to 0
int was_space = 0; // Flag to track if the previous character was a space
// Iterate through each character of the input string
while (*input) {
// If the current character is not a space or we're inside quotes, add it to the current line
if (*input != ' ' || q == 1) {
tbl->lines[bi][f++] = *input; // Add character to the current line
tbl->lines[bi][f + 1] = '\0'; // Null-terminate the line
was_space = 0; // Reset the space flag
} else {
// If we encounter a space and were not previously inside a space, move to the next line
if (was_space == 0) {
bi++; // Move to the next line
f = 0; // Reset the character index for the new line
was_space = 1; // Set space flag
}
}
// If the character is a quote, toggle the inside-quote flag
if (*input == '\'') q = q ? 0 : 1;
input++; // Move to the next character
}
int size = sizeof(tbl->lines) / sizeof(tbl->lines[0]); // Get the number of lines available in the table
tbl->len = 0; // Initialize the line count to 0
// Loop through all lines in the table
for (int i = 0; i < size; ++i) {
str_trim(tbl->lines[i]); // Trim whitespace from the line
if (strcmp(tbl->lines[i], "") == 0) { // If the line is empty, stop processing
break;
} else {
tbl->len++; // Increment the line count for non-empty lines
}
}
}
The str_break
function is able to break down a given string (character array char *
) into tokens and store them in the OPR
structure. It first checks for the single quote character '
and toggles a value to help break down the spaces. After that, it counts the non-empty lines and updates the len
field in the OPR
structure.
Challenge 4: Assemble function
Now, thanks to str_break
, we are able to break down a given line, which is very helpful for processing the operands of an opcode. Next, we need a function to process each line for us. Since this is an assembler program, let’s call the function assemble
. We already have a TBL
structure for the extracted lines from the input (using io_read
).
Now, we need another structure to help the assemble
function store its data. Remember earlier when we used printf
and a for
loop to see the result of each code in hexadecimal? Wouldn’t it be better to already have the processed line stored? This is useful when the verbose
flag (-v
) is set. Since an assembler generates an executable output, we also need to store each numeric value of an opcode in an array. This is essential for concatenating the executable file.
But what if there’s an invalid opcode? The assemble
function should be able to handle that and return a proper error message with enough details to help us locate the issue, such as the line number or even the invalid line itself. Lastly, since the PIC10F200
has limited ROM (range 256 to 512), it’s useful to track the number of generated words and the used addresses.
Now that we know what the assemble
function needs to do, let’s create an appropriate structure for it and call it ASMBL
, which will be responsible for storing the processed data from the assemble
function. This structure will also use ASM_LEN
and ASM_ERR
to keep track of word length and any errors that occur.
First, let’s start by defining ASM_ERR
. This structure has four variables. The first is a variable to store the line number (let’s call it lnum
). Next, we need a character array to store the message (such as msg
). We also need another character array called line
to store the error line itself. Lastly, we need an object (obj
), which should also be a character array. The obj
will store the invalid part of the error message, such as the opcode, to help the user pinpoint where the issue in the line occurred—whether it’s with the opcode, the operands, or something else.
#define ASM_LINE 128
// Structure to store error information
typedef struct {
int lnum; // Line number where the error occurred
char msg[MAX_STR]; // Message describing the error
char line[ASM_LINE]; // The line where the error occurred
char obj[MAX_STR]; // The specific object (opcode, operand, etc.) causing the error
} ASM_ERR;
The second structure we need to create to assist in writing ASMBL
is ASM_LEN
, which will help the assemble
function and ASMBL
keep track of memory usage and the generated words from opcodes.
typedef struct {
int mem; // Number of Used memory
int words; // Total number of generated words
} ASM_LEN;
Now that we’ve written both ASM_ERR
and ASM_LEN
, let’s define ASMBL
:
typedef struct {
int mcode[MAX_CODE]; // Machine codes
char lines[MAX_STR][ASM_LINE]; // lines (verbose)
ASM_ERR err; // Error struct
ASM_LEN len; // Length struct
int ecode; // exit code
} ASMBL;
We’ve already discussed the ASM_ERR
and ASM_LEN
structures, as well as the machine code (mcode
) and verbose lines (lines
) in the ASMBL
structure. However, it would also be useful to have an exit code (ecode
) parameter. This will allow us to handle different types of errors. For example, 0
could indicate that everything is fine, 1
could represent a general error, and other values could be used to indicate specific issues, such as an incorrect number of operands.
Now that we have defined the three structures—ASMBL
, ASM_ERR
, and ASM_LEN
—let’s define a function for each of them to initialize the structures. Since C compilers load these structures with some junk values from the heap, we need to set them to 0. Each function will take a pointer to its respective type to initialize it.
/* initialize ASM_ERR */
void empty_err(ASM_ERR *err){
err->lnum = 0;
memset(err->msg, 0, sizeof(err->msg));
memset(err->obj, 0, sizeof(err->obj));
memset(err->line, 0, sizeof(err->line));
}
/* initialize ASM_LEN */
void empty_asmlen(ASM_LEN *len){
len->mem = 0;
len->words = 0;
}
/* initialize ASMBL */
void empty_asm(ASMBL *asmbl){
asmbl->ecode = 0;
empty_err(&asmbl->err);
empty_asmlen(&asmbl->len);
memset(asmbl->mcode, 0, sizeof(asmbl->mcode));
memset(asmbl->lines, 0, sizeof(asmbl->lines));
}
Now that we have defined the ASMBL
structure, let’s write the assemble
function. We need an input, which we already have from the io_read
function, of type TBL
that stores each line. We also need a pointer to the ASMBL
structure to help the assemble
function load its data into it. The assemble
function will use a for
loop to go through each line and process it.
void assemble(ASMBL *asmbl, TBL *input_tbl){
empty_asm(asmbl); // To initialie 'ASMBL' struct
OPR oprs; // Operands
for(i = 0; i < tbl.len; ++i){
str_break(tbl.lines[i], &oprs); // load line's tokens into 'oprs'
// define and load opcode
char opcode[20]; // Opcode
strcpy(opcode, oprs.lines[0]); // Copy first element of `oprs` to opcode
asmbl->err.lnum = i + 1; // Set the lnum in ASM_ERR to current line
strcpy(asmbl->err.line, tbl.lines[i]); // Load the current line into the ASM_ERR
// Check that if opcode is "GOTO"
if(strcmp(opcode, "GOTO") == 0){
char *endptr;
int result = 0;
// Exteract the numberic value of lines[1] and store in 'result' (base 10)
if((result = strtol(oprs.lines[1], &endptr, 10)) != 0){
int code = 0b101000000000 | result;
asmbl->mcode[asmbl->len.words++] = code;
} else {
// Load ASM_ERR
strcpy(asmbl->err.msg, "Invalid Operand");
strcpy(asmbl->err.obj, oprs.lines[1]);
asmbl->ecode = 1;
}
} else {
// Load ASM_ERR
strcpy(asmbl->err.msg, "Invalid Opcode");
strcpy(asmbl->err.obj, opcode);
asmbl->ecode = 1;
}
}
}
This is what the assemble
function looks like, but there are some problems. First, it only handles the GOTO
opcode. There is no dedicated function to update the ASM_ERR
field in the ASMBL
structure, so we have to handle it manually for each error occurrence. Additionally, the GOTO
opcode only processes base-10 integers, but the assembler must be able to handle different formats, such as hexadecimal or even binary. Lastly, if there is an empty or invalid line, the assembler crashes because it cannot process or skip such lines.
First, let’s start by handling useless lines and comments.
Challenge 5: Useless parts
In assembly language, everything after ;
is treated as a comment, but only if the ;
is not in the middle of two single quotes.
; This is a comment in assembly language
It would be useful to have a function that updates the given line by trimming all its whitespaces and removing comments, allowing us to easily detect and compare lines using strcmp
. Another advantage is that if the comment appears after the opcode, this method ensures the comment will not pass into str_break
, resulting in cleaner operands.
// Detect empty line
if(strcmp(line, "") == 0){
continue;
}
So, let’s write a function that removes all leading and trailing whitespaces from the given line (character array char *
) and updates the line accordingly.
void str_trim(char buff[]) {
// If the buffer is NULL, exit the function
if (buff == NULL) {
return;
}
// Trim leading whitespace
char *start = buff; // Pointer to traverse the beginning of the string
while (isspace((unsigned char)*start)) {
start++; // Move the pointer forward while encountering whitespace
}
// If leading whitespace is found, shift the string to remove it
if (start != buff) {
char *dst = buff; // Pointer to write the trimmed string
while (*start) {
*dst++ = *start++; // Copy characters from start to destination
}
*dst = '\0'; // Null-terminate the trimmed string
}
// Trim trailing whitespace
char *end = buff + (int)strlen(buff) - 1; // Pointer to the last character in the string
while (end >= buff && isspace((unsigned char)*end)) {
*end-- = '\0'; // Move backwards and replace trailing whitespace with null terminators
}
}
The str_trim
function will help us achieve this. Now, we need a function to remove comments from the given character array (char *
).
/* skip_comment: remove comments */
void skip_comment(char buff[]) {
int i = 0; // Index to traverse the character array
int quote = 0; // Flag to track if inside a quote
str_trim(buff); // Trim leading and trailing whitespace from the input string
// Traverse the string character by character
while (buff[i] != '\0') {
// Toggle the quote flag if a single quote is encountered
if (buff[i] == '\'') {
quote = quote == 0; // Toggle quote flag
}
// If a semicolon is found outside of quotes, terminate the string
if (buff[i] == ';' && quote == 0) {
buff[i] = '\0'; // Replace the semicolon with a null terminator
break; // Exit the loop, as the comment has been removed
}
i++; // Move to the next character
}
}
The skip_comment
function will remove everything after the ;
character, trimming all the whitespaces beforehand using the str_trim
function.
Now, we just need to add the str_trim
and skip_comment
functions into the assemble
function’s loop.
//...
for(i = 0; i < tbl.len; ++i){
skip_comment(tbl.lines[i]);
str_trim(tbl.lines[i]);
if(strcmp(tbl.lines[i], "") == 0){ continue; }
//...
Now that we are able to remove comments, it would be useful to remove ,
as well because multi-operand opcodes separate their operands with commas. Replacing all valid commas (excluding those between two quotes) with whitespace will greatly simplify processing for str_break
. Additionally, if a line contains only commas, it will be ignored due to the str_trim
and strcmp
logic we already added. Let’s implement a function to replace specific characters.
/* char_replace: replaces all occurrences of 'src' with 'dst' in the given string 'buff',
but skips characters inside single quotes. Returns 0 after completion. */
int char_replace(char buff[], char src, char dst) {
int i = 0; // Index to traverse the character array
int quote = 0; // Flag to track if inside a quote
str_trim(buff); // Trim leading and trailing whitespace from the input string
// Traverse the string character by character
while (buff[i] != '\0') {
if (buff[i] == '\'') {
quote = quote == 0; // Toggle the quote flag when a single quote is encountered
}
// Replace 'src' with 'dst' if found outside of quotes
if (buff[i] == src && quote == 0) {
buff[i] = dst; // Perform the replacement
}
i++; // Move to the next character
}
return 0; // Return 0 after the operation is complete
}
The char_replace
function takes a buffer
of type char *
, a src
character, and a dst
character. It loops through the buffer
and replaces any occurrences of the src
character with the dst
character, but only if the src
character is not between single quotes.
Now that we have implemented the char_replace
function, let’s add it to the loop in the assemble
function.
//...
for(i = 0; i < tbl.len; ++i){
char_replace(tbl.lines[i], ',', ' '); // Replace commas with whitespace
skip_comment(tbl.lines[i]);
str_trim(tbl.lines[i]);
if(strcmp(tbl.lines[i], "") == 0){ continue; }
//...
Now we have managed to remove all the unnecessary parts of our code using these functions.
Challenge 6: Labels and EQUs
If you look at how “GOTO” behaves in the previous post, you’ll notice that we can pass labels for the GOTO address!
start:
; Do something
GOTO end
end:
GOTO start
A label contains an address, just like 0x06
or 42
. However, if you notice, not only should GOTO
be able to use the previous address (e.g., start
), but it should also reference addresses that come after it (e.g., end
)—even though we haven’t reached them in the loop yet!
To handle this, we need to determine these addresses before processing.
The same applies to EQU
, but with a key difference: we don’t need the values of EQU
until we encounter them in the code.
To achieve this, we need a set of functions to store and retrieve these addresses and EQU
values from a list (or array). Let’s write these functions first, so we can implement a for
loop before the main processing loop to handle labels and EQU
definitions.
First, let’s define a struct that allows us to store data like a dictionary, with a key
and a value
.
typedef struct {
char key[MAX_STR];
int value;
} DICT;
The DICT
allows us to associate a value
with a specific key
. This is particularly useful for storing all the labels and EQU
s.
We also need an enum
to specify whether we want to store a label or an EQU
, and we’ll call it elem_t
, meaning element type.
typedef enum {
EQU_ELEMENT,
LABEL_ELEMENT,
} elem_t;
Now that we have defined our enum
and structures, let’s define some static
global variables, one for storing labels and one for storing EQU
s.
We also need to define two int
variables to keep track of each array.
/* EQU */
static DICT equ_arr [128];
static int equ_arr_len = 0;
/* LABEL */
static DICT label_arr [128];
static int label_arr_len = 0;
Now that the variables are defined, let’s start by writing our save_element
function. But there’s an issue: we can save labels with the same name, which causes an error when we want to process them. Wouldn’t it be nice to have a function that helps us determine if the element already exists? Let’s write this function and call it elem_contains
, which will take only the type
(elem_t
) and name
as arguments. The value
won’t matter for us because we only want to ensure that each name is unique.
/* element contains: checks if an element with the given name exists in the specified array */
int elem_contains(elem_t type, char name[]) {
int i;
// If the element type is EQU_ELEMENT, search in the equ_arr array
if (type == EQU_ELEMENT) {
for (i = 0; i < equ_arr_len; i++) {
// If the name matches an existing key in equ_arr, return 1
if (strcmp(equ_arr[i].key, name) == 0) {
return 1;
}
}
} else {
// Otherwise, search in the label_arr array
for (i = 0; i < label_arr_len; i++) {
// If the name matches an existing key in label_arr, return 1
if (strcmp(label_arr[i].key, name) == 0) {
return 1;
}
}
}
return 0; // Return 0 if the element with the given name is not found
}
The elem_contains
function looks at the specified array based on elem_t
. If the name exists, it returns 1 (TRUE); if the name doesn’t exist, it returns 0 (FALSE).
Now that we can store our labels and EQU
values correctly, let’s write a function that helps us store the elements in the specified array based on type
. The function will take name
and value
as input. If the element already exists in the array, the function will return 1 (indicating failure). Otherwise, it will return 0 (indicating the element has been saved correctly).
int save_element(elem_t type, char name[], int value) {
// Check if the element already exists in the corresponding array (either equ_arr or label_arr)
if (elem_contains(type, name)) {
return 1; // Return 1 if the element already exists
}
// If the element type is EQU_ELEMENT, store the element in the equ_arr array
if (type == EQU_ELEMENT) {
strcpy(equ_arr[equ_arr_len].key, name); // Copy the name to the key field
equ_arr[equ_arr_len].value = value; // Set the value for the element
equ_arr_len++; // Increment the length of the equ_arr array
} else {
// If the element type is not EQU_ELEMENT, store it in the label_arr array
strcpy(label_arr[label_arr_len].key, name); // Copy the name to the key field
label_arr[label_arr_len].value = value; // Set the value for the element
label_arr_len++; // Increment the length of the label_arr array
}
return 0; // Return 0 to indicate the element has been successfully saved
}
Now that we can store our labels and EQU
s correctly, it’s important to have a way to retrieve these elements. Let’s write the get_element
function, which will allow us to fetch the value associated with a given name
from the specified array by type
. To do this, we can use a pointer to return the value, or we can return the value directly. Since a label or an EQU
might have a value of 0
, returning 0
would conflict with indicating a valid value. Instead, we will return -1
to indicate that the element doesn’t exist in the specified array, as -1
is a value that won’t be used in valid addresses or EQU
values.
/* get_element: returns -1 if the element with the given name does not exist */
int get_element(elem_t type, char name[]) {
// If the element does not exist in the corresponding array, return -1
if (elem_contains(type, name) == 0) {
return -1;
}
// Determine the maximum length based on the element type (either equ_arr or label_arr)
int max = type == EQU_ELEMENT ? equ_arr_len : label_arr_len;
// Loop through the appropriate array (either equ_arr or label_arr)
for (int i = 0; i < max; ++i) {
// If the element is of type EQU_ELEMENT, compare with the equ_arr array
if (type == EQU_ELEMENT) {
if (strcmp(equ_arr[i].key, name) == 0) { // Check if the name matches
return equ_arr[i].value; // Return the value if found
}
} else {
// If the element is not EQU_ELEMENT, compare with the label_arr array
if (strcmp(label_arr[i].key, name) == 0) { // Check if the name matches
return label_arr[i].value; // Return the value if found
}
}
}
return -1; // Return -1 if the element is not found in the array
}
Now that we have our functions related to storing EQU
and labels, let’s detect them in our assemble
function.
Challenge 7: Preprocessing for Labels and EQUs
Because processing each line requires having the EQU
s and labels, it makes sense to use another for
loop in the assemble
function, just before the main processing loop (the one used for GOTO
). The loop should behave similarly to the main loop (skipping empty lines, comments, etc.), but it must detect the EQU
keyword and check for :
at the end of the line to determine if it’s a label. If it’s not a label or an EQU
, we skip it; otherwise, we attempt to store it. If storing fails (returns 1), we throw an error and terminate the program because duplicate labels or EQU
s with the same name are not allowed.
Now that we are going to create detailed errors, it would be helpful to create a simple function to set the data and call it update_err
, which updates the ASM_ERR
in the ASMBL
structure.
void update_err(ASMBL *asmbl, const char *msg, const char *obj){
// Set message (msg) if possible
if(msg != NULL){
strcpy(asmbl->err.msg, msg);
}
// Set object (obj) if possible
if(obj != NULL){
strcpy(asmbl->err.obj, obj);
}
// Set exit code to 1 (error)
asmbl->ecode = 1;
}
First, let’s implement the detection for EQU
because it’s simpler and doesn’t require checking for :
at the end. For this, we use the strstr
function provided by the C language (based on ANSI libraries).
The process is straightforward: each EQU
line follows the same structure. First, there is the name, followed by the EQU
keyword, and finally, the value, like so:
GPIO EQU 6
For this, we can use the str_break
function we created earlier to extract the operands from the given line. Now it’s clear how useful the str_break
function is.
// ...
for(i = 0; i < tbl.len; ++i){
// Skip empty lines
char_replace(tbl.lines[i], ',', ' ');
skip_comment(tbl.lines[i]);
str_trim(tbl.lines[i]);
if(strcmp(tbl.lines[i], "") == 0){ continue; }
// Check for EQU
if(strstr(tbl.lines[i], " EQU ") != NULL){
str_break(tbl.lines[i], &oprs);
int value = atoi(oprs.lines[2]); // Convert array of char to int
int failed = save_element(EQU_ELEMENT, oprs.lines[0], value);
if(failed){
update_err(asmbl, "EQU already exists", oprs.lines[0]);
return;
}
continue;
}
// Check for Label
// ...
}
// ...
for(i = 0; i < tbl.len; ++i){
// ...
If you notice, we used atoi
, which is a standard C language function that only converts decimal characters to integers, like "255"
to 255
. However, the value of EQU
can also be binary, like 0b00000110
; various forms of hexadecimal, such as 0x06
or 06H
; or decimal. Sometimes, it can even be an ASCII value, such as 'H'
or 'A'
(which is why we check for quotes, as explained before). Additionally, we may need an EQU
value like MOVWF GPIO
, where GPIO
is a predefined EQU
constant.
Wouldn’t it be nice to have a function for that? A function that behaves similarly to the get_element
function—returning a negative value like -1
if any error occurs and a valid value (>= 0
) otherwise. Let’s call this function extract_value
. However, before we start writing it, we must implement a set of functions to detect each numeric type as explained.
Let’s start with detecting characters, as it’s simple. We check the length and use the sscanf
function, which is already provided by ANSI libraries, to extract the character. If it’s a sequence character like \n
, we use a switch-case
statement to determine each of them and generate the valid value. Otherwise, we return \0
as a 0
value if no character is found.
Let’s call the function quoted_letter
.
char quoted_letter(char *str) {
char result = '\0';
char temp;
if(sscanf(str, "'%c'", &temp) == 1 && (int)strlen(str) == 3){
result = temp;
} else if(sscanf(str, "'\\%c'", &temp) == 1 && (int)strlen(str) == 4){
switch (temp) {
case 'n':
result = '\n';
break;
case 't':
result = '\t';
break;
case '\\':
result = '\\';
break;
default:
result = '\0';
break;
}
}
return result;
}
Now that we know how to detect character values, let’s dive into detecting hex. We can create a function called hsti
, which stands for “hex string to integer.” The return values will be the same as the quoted_letter
function, helping us detect hex.
/* hsti: converts a hexadecimal string to an integer */
int hsti(const char *hexstr) {
int result = 0; // Stores the final converted integer
int length = strlen(hexstr); // Get the length of the hexadecimal string
// Iterate through each character in the string except the last one
for (int i = 0; i < length - 1; i++) {
int digit = hcti(hexstr[i]); // Convert the current hexadecimal character to its integer value
result = (result << 4) | digit; // Shift result by 4 bits and add the new digit
}
return result; // Return the converted integer
}
The other function must help us detect an 8-bit binary that starts with 0b
and extract the value.
Detecting a valid 8-bit binary and extracting it might be a bit different. During the detection process, we need to check for 0b
at the beginning and verify that each subsequent character is either 0
or 1
. A function to handle this would be useful and can be extended later if needed. We also need another function to convert the binary string into a valid integer value.
Let’s write the detect_8bit_binary
function, which returns 1
(TRUE) if the given string (char *
) is a valid binary and 0
(FALSE) if it’s not. After that, we can create the btoi
function to help extract the integer value.
int detect_8bit_binary(char *input) {
int i;
// Check if the input starts with "0b"
if (strncmp(input, "0b", 2) != 0){
return 0;
}
// Check if the remaining part is 8 bits
if (strlen(input) - 2 != 8){
return 0;
}
// Check if all characters are either '0' or '1'
for (i = 2; i < (int)strlen(input); i++){
if (input[i] != '0' && input[i] != '1'){
return 0;
}
}
return 1; // Valid 8-bit binary pattern
}
Now that we are able to detect the string, let’s write a function to extract the value called btoi
, which stands for binary to integer.
int btoi(const char *input){
int result = 0;
int power = 0;
input += 2; // Skip "0b"
for (int i = strlen(input) - 1; i >= 0; i--) {
if (input[i] == '1'){
result |= (1 << power); // Use bitwise OR to accumulate the value
}
power++;
}
return result;
}
Now that we have all of the necessary functions to write extract_value
, let’s write it. This function takes a string char *
and a value that determines whether we are able to detect EQU
or not. This is useful for our case because when extracting an EQU
variable, we don’t want previous EQU
values to interfere.
int extract_value(char *inpt, int allow_equ) {
if (allow_equ) {
// Check if the input can be found as an EQU element
int result = get_element(EQU_ELEMENT, inpt);
if (result >= 0) {
return result; // Return the value if found in EQU
}
}
// Try interpreting as a quoted letter
char ch = 0;
if ((ch = quoted_letter(inpt)) != '\0') {
return (int)ch; // Return the ASCII value of the quoted letter
}
// Try interpreting as a hexadecimal number (ending with 'H')
int len = (int)strlen(inpt);
if (inpt[len - 1] == 'H') {
return hsti(inpt); // Convert the hex string to an integer
}
// Try interpreting as a decimal integer
char *endptr;
int num;
num = strtol(inpt, &endptr, 10);
if (strcmp(endptr, "") == 0 && (num >= 0 && num <= 255)) {
return num; // Return the decimal value if valid
}
// Try interpreting as an 8-bit binary number
if (detect_8bit_binary(inpt)) {
return btoi(inpt); // Convert the binary string to an integer
}
// Try interpreting as a hexadecimal number with '0X' prefix
num = strtol(inpt, &endptr, 16);
if (strcmp(endptr, "") == 0 && (num >= 0 && num <= 255)) {
return num; // Return the hexadecimal value if valid
}
return -1; // Return -1 if none of the conditions match
}
Now, with the help of extract_value
, we can update the first loop in the assemble
function from using atoi
to using extract_value
.
// ...
// Check for EQU
if(strstr(tbl.lines[i], " EQU ") != NULL){
str_break(tbl.lines[i], &oprs);
/* Detect EQU value using extract_value function */
int value = extract_value(oprs.lines[2], 0);
if(value < 0){
update_err(asmbl, "Invalid EQU value", oprs.lines[2]);
return;
}
int failed = save_element(EQU_ELEMENT, oprs.lines[0], value);
if(failed){
update_err(asmbl, "EQU already exists", oprs.lines[0]);
return;
}
continue;
}
// ...
The detection for EQU
is officially done. Now, we need to be able to detect labels in the loop.
The structure of labels is simple: a word followed by a :
sign. The unique part is the :
sign. We must check if a word contains :
at the end. If it does, we detect it as a label; otherwise, it is not a label.
; label start
start:
; ...
; label end
end:
; ...
Now that we understand how labels work, we need a function to help us detect them.
Let’s write a function called char_contains
that takes a buffer
(char *
) as the first argument and a char
as the second argument, and checks if the char
is contained in the buffer
by looping through it.
int char_contains(char buff[], char c) {
int i = 0; // Index for iterating through the buffer
int quote = 0; // Flag to track if inside a quoted section
str_trim(buff); // Remove leading and trailing whitespace from the buffer
while (buff[i] != '\0') {
if (buff[i] == '\'') {
// Toggle the quote flag when encountering a single quote
quote = quote == 0;
}
if (buff[i] == c && quote == 0) {
return i; // Return 1 if the character is found outside of quotes
}
i++; // Move to the next character in the buffer
}
return -1; // Return 0 if the character is not found
}
The char_contains
function returns the index of the first occurrence of the specified character in the given buffer. If the character does not exist in the buffer, the function returns -1
. This is useful for detecting if the last character in the buffer is equal to :
by utilizing the strlen
function provided by ANSI C.
Now, there is a problem: if we use only this function, we are merely detecting whether the :
sign is present, and we store the whole string, including the :
, in the labels array. The problem is that we don’t need the :
at the end of the label. For example, we don’t want to use GOTO start:
to set the value of GOTO
. Therefore, we must remove the last character (:
) from the line before storing it in the array.
Let’s call this simple function str_end
. It will help us remove a character from the end of the string by inserting '\0'
at the calculated position (length of string - end
).
void str_end(char *buff, int end){
int len = (int)strlen(buff);
buff[len - end] = '\0';
}
Now we can effectively detect labels in our preprocess loop within the assemble
function. However, there is a problem: we don’t know the address of the detected label to provide its value to the save_element
function. To solve this, we need a way to track valid parts of the code that result in machine code (mcode
).
To achieve this, let’s define an int
variable called codes
. This variable will keep track of non-comment, non-empty lines, and lines that are neither EQU
nor labels. This will help us determine the value associated with each label.
// ...
int codes = 0; // Keep track of valid codes (for label address)
for(i = 0; i < tbl.len; ++i){
// ...
// Detect EQU
// ...
// Check for label
int idx = 0;
if((idx = char_contains(tbl.lines[i], ':'))){
// Make sure that the last character is equal to ':'
if(idx != (int)strlen(tbl.lines[i]) - 1){
update_err(asmbl, "Invalid label syntax", tbl.lines[i]);
return;
}
str_break(tbl.lines[i], &oprs);
str_end(oprs.lines[0], 1);
int failed = save_element(LABEL_ELEMENT, oprs.lines[0], codes);
if(failed){
update_err(asmbl, "Label already exists", oprs.lines[0]);
return;
}
continue;
}
codes++; // Add to 'codes' by 1, meaning 1 more valid code
}
// ...
for(i = 0; i < tbl.len; ++i){
// ...
By using strlen()
, we can check whether the :
is at the end of the string. If it’s not, we update the ASM_ERR
function with the appropriate error. After that, we save the label (without the :
at the end) to the labels array using save_element
. If the save_element
function returns 1
, we terminate the program with the provided error. This process prevents the double definition of labels, similar to how we handled EQU
.
Challenge 8: Better Way To Handle Opcodes
Now that we have all the labels and EQU
s, it’s time to process more opcodes, not just the GOTO
opcode. We used the second loop in the assemble
function to process opcodes, but we must expand it if we want to handle more opcodes. However, there is a problem: if we use strcmp
with if-else
statements to detect each opcode, the code will become messy. A more efficient solution would be to provide an array of names with their corresponding handlers. When we reach a name, we can call the handler to get the machine code. This approach is far cleaner and more efficient than the if-else
method. Let’s create such a structure.
First, we need a structure to help us with this task. It should have a label and a function pointer to allow us to call the correct handler.
But what parameters should we give to the function pointer (handler)? Since we need to update errors, we require the ASMBL
structure. Therefore, we will pass a pointer to ASMBL
. We also have operands, which are generated by str_break
, so we need to pass those as well. However, there’s a problem. The str_break
function breaks down the entire line, but we only need the operands. To address this, we must write a function that shifts all operands one place to the left and removes the first operand (which is the opcode). We already store the opcode in the opcode
character array (char *
).
Additionally, we need a TBL
to store the unmodified, exact same lines. This will allow us to update the line number (lnum
) and line error (line
) in the ASM_ERR
field of the ASMBL
structure.
So before we dive in further, let’s define the structure to make things clearer, and we will call it OP_HNDL
.
typedef struct OP_HNDL {
char *lable;
int (*func)(ASMBL *, OPR *);
} OP_HNDL;
Now that we have the OP_HNDL
structure, let’s create a simple handler for GOTO
. However, before that, we need to provide operands by removing the first element, which is the opcode
itself, leaving only the operands.
void shift_lines_left(OPR *tbl) {
if (tbl == NULL || tbl->len <= 0) return; // Handle null pointer or empty lines
for (int i = 1; i < tbl->len; i++) {
memcpy(tbl->lines[i - 1], tbl->lines[i], MAX_STR); // Move line i to i-1
}
memset(tbl->lines[tbl->len - 1], 0, MAX_STR); // Clear the last line
tbl->len--; // Decrease the length of lines
}
void copy_shift_oprs(OPR *dst, OPR *src) {
int i;
for(i = 0; i < src->len; ++i) {
strcpy(dst->lines[i], src->lines[i]); // Copy lines from src to dst
}
dst->len = src->len; // Set length of dst to match src
shift_lines_left(dst); // Shift lines in dst to the left
}
First, there is the function copy_shift_oprs
, and there is shift_lines_left
. The shift_lines_left
function is self-explanatory: it shifts all of the lines to the left by one and updates the len
in the OPR
structure. The copy_shift_oprs
function copies the src
OPR
to the dst
OPR
and shifts all the dst
OPR
by calling shift_lines_left
. By doing this, we remove the first element in the oprs
, which is the opcode itself.
Now that we are able to update the assemble
function, let’s create a handler for GOTO
first. We can add more handlers later in the post for each opcode.
/* {GOTO} */
int handle_goto(ASMBL *asmbl, OPR *operands){
char *label = operands->lines[0];
int lvalue = get_element(LABEL_ELEMENT, label);
if(lvalue >= 0){
return 0xA00 | lvalue; // 0b101000000000
}
lvalue = extract_value(label, 1);
if(lvalue < 0){
update_err(asmbl, "Invalid label", label);
return -1;
}
return 0xA00 | lvalue; // 0b101000000000
}
It’s good that all of our handlers have the same structure: handle_
followed by the opcode’s name, like handle_goto
. We have already implemented the numeric return for the get_element
function, which we use for our handlers, meaning that -1
indicates an error, and 0 >=
indicates a valid opcode.
The handle_goto
function first checks the labels array. If the label is not found (e.g., GOTO start
where start
is not defined as a label), it uses the extract_value
function to check for different types of values such as hex, binary, or decimal. Finally, it returns the generated opcode.
Now let’s use the handle_goto
in the assemble
function. Just after processing the labels and the EQU
loop, and before the main process loop, let’s create an array of type OP_HNDL
to store our handlers like below, and a value called oplen
to help us determine the length of our array.
// ...
OP_HNDL hndls[] = {
{"GOTO", handle_goto},
// more handlers...
};
int oplen = sizeof(hndls) / sizeof(hndls[0]); // length of handlers array
// ...
Now we update our second (main) loop!
static TBL tbl; // Origina lines
static OPR opr; // Operands
void assemble(ASMBL *asmbl, TBL *input_tbl){
// Clear 'tbl' and load 'input_tbl' to 'tbl'
tbl.len = 0;
memset(tbl.lines, 0, sizeof(tbl.lines));
copytbl(&tbl, input_tbl);
int i = 0;
OPR oprs;
// The lable and EQU preprocessor
OP_HNDL hndls[] = {
{"GOTO", handle_goto},
// More opcodes
}
int oplen = sizeof(hndls) / sizeof(hndls[0]); // length of handlers array
for(i = 0; i < tbl.len; ++i){
skip_comment(tbl.lines[i]);
str_trim(tbl.lines[i]);
if(strcmp(tbl.lines[i], "") == 0){ continue; } // Skip empyt line
if(strstr(tbl.lines[i], " EQU ") != NULL){ continue; } // Skip EQU
if(char_contains(tbl.lines[i], ':')){ continue; } // Skip label
int j; // For checking opcodes
int opfound = 0; // any OPcode FOUND
// Update 'lnum' and 'line' in ASM_ERR
strcpy(asmbl->err.line, input_tbl->lines[i]);
asmbl->err.lnum = i + 1;
// define variable 'opcode'
str_break(tbl.lines[i], &oprs);
char opcode[20];
strcpy(opcode, oprs.lines[0]);
// A loop for checking opcodes
for(j = 0; j < oplen; j++){
// check the opcode
if(strcmp(hndls[j].lable, opcode) == 0){
opfound = 1; // Set `opfound` to 1 (match opcode found)
// clear 'opr' and remove the remove the first item (opcode)
opr.len = 0;
memset(opr.lines, 0, sizeof(opr.lines));
copy_shift_oprs(&opr, &oprs);
// Call the handler
int instruction = hndls[j].func(asmbl, &opr);
if(instruction >= 0){
// Add machine code to 'mcode'
asmbl->mcode[asmbl->len.words] = instruction;
asmbl->len.words++;
} else {
// error happend
update_err(asmbl, "Faild to process opcode", opcode);
asmbl->ecode = 1;
return;
}
}
}
// End the program if the match opcode did not found
if(opfound == 0){
update_err(asmbl, "Invlaid opcode", oprs.lines[0]);
return;
}
}
}
Let’s define a variable called opfound
to keep track of whether any opcode is found. If not, it means the word is an invalid instruction. After the loop, we throw an error. Otherwise, after calling our handler and checking the instruction
, if it’s 0 or positive, we update our words
number and mcode
. If not, we throw an error, meaning failed to process the opcode.
Challenge 9: Providing Verbose Log
In our assemble
function, we used the following line to add machine code:
asmbl->mcode[asmbl->len.words] = instruction;
But the ASMBL
structure also has a lines
property. Wouldn’t it be nice to update the lines
too, now that we have all the operands, machine code, etc.?
So, let’s provide a set of functions to help us do that!
Input file:
GPIO EQU 0x06
start:
BSF GPIO, 0
NOP
BCF GPIO, 0
GOTO start
Output (verbose log):
BSF 0x06 0 0b010100000110
NOP 0b000000000000
BCF 0x06 0 0b010000000110
GOTO 0x00 0b101000000000
I think for viewing the binary, having the code itself along with it would be much more helpful. So, let’s write a function that attaches the operands and converts the first one to its numeric value.
#include <stdarg.h>
/* sstrcatf: formated strcatf using stdarg.h */
void sstrcatf(char* dst, const char * frmt, ...){
char tmp[MAX_STR];
va_list arglist;
va_start(arglist, frmt);
vsprintf(tmp, frmt, arglist);
va_end(arglist);
strcat(dst, tmp);
}
void strfy_inst(OPR *ops, char buff[]){
// Check for numeric value
int first = extract_value(ops->lines[0], 1);
if(first == -1){
// Check for lable name
first = get_element(LABEL_ELEMENT, ops->lines[0]);
}
// Set to 0 if it's not label and it's not a numeric value
if(first == -1){ first = 0; }
// Update the buffer using 'sstrcatf'
if(ops->len == 1){
sstrcatf(buff, "0x%.2X", first);
} else if(ops->len == 2){
sstrcatf(buff, "0x%.2X %s", first, ops->lines[1]);
}
}
The strfy_inst
function gets a pointer to operands (OPR *
) and a buffer (char buff[]
), attaches the operands together, and updates the buffer. The first item of the operands is converted to its numeric value because it’s more helpful. The sstrcatf
function attaches strings together like strcat
, but it’s formatted like printf
and uses the stdarg.h
header file provided by ANSI to do that.
Now we need to convert the machine code to a 12-bit binary. Let’s write a function to do that. Note that a 12-bit binary starts with 0b
and ends with \0
, so we need a buffer with a size of 12 + 2 + 1
, totaling 15
. Let’s call it integer-to-binary or itob
.
/* itob: integer to binary */
void itob(int num, char *binary) {
binary[0] = '0';
binary[1] = 'b';
for (int i = 11; i >= 0; i--) {
binary[13 - i] = (num & (1 << i)) ? '1' : '0';
}
binary[14] = '\0'; // Null-terminate the string
}
This function will convert the given number num
to binary and update the buffer binary
. For example, if the input is 255
, the updated buffer will be: 0b000011111111
.
Now we use these functions to generate some verbose logs, but we don’t get aligned output. For example, if the input is:
GPIO EQU 0x06
start:
BSF GPIO, 0
NOP
BCF GPIO, 0
GOTO start
The output will be:
BSF 0x06 0 0b010100000110
NOP 0b000000000000
BCF 0x06 0 0b010000000110
GOTO 0x00 0b101000000000
It would be nice to fill every instruction string with spaces to exceed a certain size.
Let’s write a function to do that and fill our array with spaces. We can call it fill_space
, which takes a buffer and a numeric value to specify the size to fill.
void fill_space(char *buff, int len){
for(int i = 0; i < len; i++){
if(buff[i] == '\0'){
buff[i] = ' ';
}
}
}
We use = { 0 }
with a value of 0
for our lines, which sets the entire line to zero. Then, we fill those '\0'
characters with spaces. This way, the terminator already exists in the array.
Now, let’s update the assemble
function’s loop where it updates the machine code to also update the lines
.
// ...
if(instruction >= 0){
asmbl->mcode[asmbl->len.words] = instruction;
// Update verbose line
char line[MAX_STR] = { 0 };
char bin[15] = { 0 };
strfy_inst(&opr, line);
itob(instruction, bin);
char prefix[MAX_STR] = { 0 };
sstrcatf(prefix, "%s %s", opcode, line);
fill_space(prefix, 20);
sprintf(asmbl->lines[asmbl->len.words], "%s %20s", prefix, bin);
asmbl->len.words++;
} else {
// ...
Challenge 10: Handler for Opcodes with No Operands
There are some opcodes that don’t actually need any operands, and the machine code is just a fixed number every time. For example, NOP
, CLRW
, SLEEP
, etc.
So writing handlers for them shouldn’t be that hard. We just need a handler that returns a value, but our handlers must have an OPR *
and ASMBL *
. We’ll provide them as arguments, but we aren’t actually going to use them. It’s just to prevent errors in some compilers.
/* {CLRWDT} */
int handle_clrwdt(ASMBL *_, OPR *__){
return 0x04; // 0b000000000100
}
/* {NOP} */
int handle_nop(ASMBL *_, OPR *__){
return 0x000; // 0b000000000000
}
/* {SLEEP} */
int handle_sleep(ASMBL *_, OPR *__){
return 0x003; // 0b000000000011
}
/* {CLRW} */
int handle_clrw(ASMBL *_, OPR *__){
return 0x040; // 0b000001000000
}
/* {OPTION} */
int handle_option(ASMBL *_, OPR *__){
return 0x002; // 0b000000000010
}
We can add them to the array of handlers in the assemble
function.
OP_HNDL hndls[] = {
// ...
{"NOP", handle_nop},
{"SLEEP", handle_sleep},
{"CLRW", handle_clrw},
{"CLRWDT", handle_clrwdt},
{"OPTION", handle_option},
// ...
}
Challenge 10: Handler for Opcodes with Destination
There are some opcodes that have a destination bit and an address. In this section, we’ll discuss them, but it would be nice if we had a function for that, which we could use for each opcode by providing an identifier.
However, there is another thing to remember. Earlier in the post, we defined the ASM_LEN
structure with two attributes: one for words
and another for memory usage, called mem
. A function with the same behavior as save_element
would be helpful to track unique addresses. We need this because the opcodes that have a destination also have an address, and it would be nice if we wrote a function to help us track them.
So first, let’s start by writing a set of functions to help us track memory. We need a function to get the index of a value (if it exists in the array), otherwise returning a negative value (we already explained this mechanism), and let’s call it get_mem_idx
. We need another function to save into the memory, if possible, so let’s call it add_to_mem
. This is the one that we’ll use in our handlers. Lastly, we need the total amount of memory used for mem
in the ASM_LEN
structure, so let’s call the last function get_used_mem
.
Let’s start by writing the get_mem_idx
function and defining some global variables for that.
static int used_mem[MAX_STR] = { 0 };
static int used_mem_idx = 0;
/* get_mem_idx: return negative if failed */
int get_mem_idx(int val){
for(int i = 0; i < used_mem_idx; ++i){
if(val == used_mem[i]){
return i;
}
}
return -1;
}
Because every address is numeric, we defined a static
integer called used_mem
and a used_mem_idx
to help us track the array.
The second function, called add_to_mem
, is responsible for adding the unique address to the array. It uses get_mem_idx
, and if the result is -1
(negative), it adds the address to the memory.
void add_to_mem(char *v){
int result = extract_value(v, 1);
if(result >= 0){
int midx = get_mem_idx(result);
if(midx == -1){
used_mem[used_mem_idx++] = result;
}
}
}
And lastly, we need a function to get the total number of unique addresses, which we will call get_used_mem
. This function simply returns the used_mem_idx
.
int get_used_mem(void){
return used_mem_idx;
}
Now that we have add_to_mem
, we can write the handlers that have a destination. For the destination itself, we must be able to detect 0
or W
, w
as 0
, and 1
or F
, f
as 1
for the destination bit. A function to help us with this would be useful.
We can simply use a switch/case
for this and call the function check_dest
. If the output is -1
(negative), it indicates an invalid destination bit.
int check_dist(char *inpt){
if((int)strlen(inpt) != 1){
return -1;
}
switch (inpt[0]){
case '1': case 'F': case 'f':
return 1;
case '0': case 'W': case 'w':
return 0;
default:
return -1;
}
return -1;
}
Finally, the handler for opcodes that have a destination would be something like this:
int check_op_num(ASMBL *asmbl, OPR *operands, int len){
if(operands->len != len){
update_err(asmbl, "Incorrect amount of operands", "");
return 1;
}
return 0;
}
int set_dist_code(ASMBL *asmbl, OPR *operands, int code){
if(check_op_num(asmbl, operands, 2)){ return -1; }
int addr;
if((addr = extract_value(operands->lines[0], 1)) < 0){
update_err(asmbl, "Invalid register", operands->lines[0]);
return -1;
}
int dist;
if((dist = check_dist(operands->lines[1])) < 0){
update_err(asmbl, "Invalid distination", operands->lines[1]);
return -1;
}
add_to_mem(operands->lines[0]);
return code | (dist << 5) | addr;
}
This function gets a pointer to ASMBL
to update the ASM_ERR
if needed, a list of OPR
that we extracted and shifted, and lastly, a code
which is the identifier for our machine code. All of the opcodes that contain a destination, as mentioned before, have the same structure but only with different identifiers. So, it would be helpful to pass the value of the identifier as code
. Finally, the function returns 0 >=
if everything goes fine; otherwise, it returns a negative value because an instruction is not able to be negative, as also mentioned before for the behavior of our OP_HDNL
structure.
To check if the number of operands is exactly 2 (and not less or more), a function to help us detect this would be awesome. The function check_op_num
will help us to detect such cases and return 1
if the number of operands doesn’t match, and 0
if it does.
Now we can write our handlers for each opcode that contains a destination bit.
/* {DECF} */
int handle_decf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x0C0); // 0b000011000000
}
/* {DECFSZ} */
int handle_decfsz(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x2C0); // 0b001011000000
}
/* {INCF} */
int handle_incf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x280); // 0b001010000000
}
/* {INCFSZ} */
int handle_incfsz(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x3C0); // 0b001111000000
}
/* {ADDWF} */
int handle_addwf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x1C0); // 0b000111000000
}
/* {ANDWF} */
int handle_andwf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x140); // 0b000101000000
}
/* {COMF} */
int handle_comf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x240); // 0b001001000000
}
/* {IORWF} */
int handle_iorwf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x100); // 0b000100000000
}
/* {MOVF} */
int handle_movf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x200); // 0b001000000000
}
/* {RLF} */
int handle_rlf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x340); //0b001101000000
}
/* {RRF} */
int handle_rrf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x300); // 0b001100000000
}
/* {SUBWF} */
int handle_subwf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x080); // 0b000010000000
}
/* {SWAPF} */
int handle_swapf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x380); // 0b001110000000
}
/* {XORWF} */
int handle_xorwf(ASMBL *asmbl, OPR *operands){
return set_dist_code(asmbl, operands, 0x180); // 0b000110000000
}
Also, we can add them to the array of handlers that we wrote earlier.
OP_HNDL hndls[] = {
// ...
{"DECF", handle_decf},
{"DECFSZ", handle_decfsz},
{"INCF", handle_incf},
{"INCFSZ", handle_incfsz},
{"ADDWF", handle_addwf},
{"ANDWF", handle_andwf},
{"COMF", handle_comf},
{"IORWF", handle_iorwf},
{"MOVF", handle_movf},
{"RLF", handle_rlf},
{"RRF", handle_rrf},
{"SUBWF", handle_subwf},
{"SWAPF", handle_swapf},
{"XORWF", handle_xorwf},
// ...
}
And finally, since we have get_used_mem
, we can use it at the end of our assemble
function.
void assemble(ASMBL *asmbl, TBL *input_tbl){
// ...
asmbl->len.mem = get_used_mem();
}
Challenge 11: Handler for Bit Manipulation Opcodes
We have 4 opcodes that manipulate bits or test them. Two of the commands have the same instruction, like BTFSS
and BTFSC
, which are for testing, and the opcodes for manipulating bits are BSF
and BCF
. Each pair of opcodes has the same structure but different identifiers. One way is to write each of them by itself, and the other way is to write a function that takes different identifiers, like we did for opcodes with destinations.
However, unlike the destination bit, we can use an EQU
as a value for the second argument for these opcodes. It’s quite simple because we already have the extract_value
function. So, let’s start by writing handlers for BTFSS
and BTFSC
by providing a common function to handle them, and let’s call it get_test_op
.
int get_tst_op(ASMBL *asmbl, OPR *operands, int code){
if(check_op_num(asmbl, operands, 2)){ return -1; }
int addr;
if((addr = extract_value(operands->lines[0], 1)) < 0){
update_err(asmbl, "Invalid register", operands->lines[0]);
return -1;
}
int bit;
if((bit = extract_value(operands->lines[1], 1)) < 0){
if((bit = is_number(operands->lines[1])) < 0){
update_err(asmbl, "Invalid bit", operands->lines[1]);
return -1;
}
}
add_to_mem(operands->lines[0]);
return code | (bit << 5) | addr;
}
The structure of the function get_test_op
is similar to set_dist_code
, and for the BTFSS
and BTFSC
handlers, we just need to add the function to their handlers.
/* {BTFSS} */
int handle_btfss(ASMBL *asmbl, OPR *operands){
return get_tst_op(asmbl, operands, 0x700); // 0b011100000000
}
/* {BTFSC} */
int handle_btfsc(ASMBL *asmbl, OPR *operands){
return get_tst_op(asmbl, operands, 0x600); // 0b011000000000
}
And if we want to write a handler for BSF
and BCF
, we can write a common function to help us, like before. Let’s call it bit_man_codes
, which generates machine code based on the provided identifier. However, we must ensure that the values of the assembly code are valid addresses, so we also need a function for that. Let’s call it check_bit_reg
.
To check if the provided address is correct (1 for correct and 0 if not a valid address), we use the code below.
int check_bit_reg(ASMBL *asmbl, int reg, int bit, char *regstr){
int bbb_size = 3;
int fff_size = 4;
if (bit > (1 << bbb_size) - 1){
char buff[20];
itoar(bit, buff);
update_err(asmbl, "Invalid bit", buff);
return 1;
}
if(reg > (1 << fff_size) - 1){
update_err(asmbl, "Invalid register", regstr);
return 1;
}
return 0;
}
And for the BSF
and BCF
handlers, we need a common function to generate the machine code based on the provided identifier.
/* bit_man_codes: bit manipulation codes */
int bit_man_codes(ASMBL *asmbl, OPR *operands, int code){
if(check_op_num(asmbl, operands, 2)){ return -1; }
int result = extract_value(operands->lines[0], 1);
int bit;
if((bit = extract_value(operands->lines[1], 1)) == -1 || bit > 8 ){
update_err(asmbl, "Invalid bit number", operands->lines[0]);
return -1;
}
int test;
if((test = check_bit_reg(asmbl, bit, result, operands->lines[0])) != 0){
return -1;
}
if(result >= 0){
add_to_mem(operands->lines[0]);
return code | (bit << 5) | result;
}
update_err(asmbl, "Failed to handle", operands->lines[0]);
return -1;
}
Now we can add these four handlers to our handlers array.
OP_HNDL hndls[] = {
// ...
{"BTFSS", handle_btfss},
{"BTFSC", handle_btfsc},
{"BSF", handle_bsf},
{"BCF", handle_bcf},
// ...
};
Challenge 12: Single Operand Opcodes
There are 9 unhandled opcodes left, and each of them needs to be handled differently. Some work with a literal, some with an address, and others with a unique address. So, let’s write handlers for each of them. First, let’s start with MOVWF
.
#define SET_BY_MASK(inst, mask, val) ((inst & ~mask) | (val & mask))
/* {MOVWF} */
int handle_movwf(ASMBL *asmbl, OPR *operands){
if(check_op_num(asmbl, operands, 1)){ return -1; }
int result;
if((result = extract_value(operands->lines[0], 1)) >= 0){
add_to_mem(operands->lines[0]);
return SET_BY_MASK(0x020, 0x01F, result); // 0b000000100000, 0b000000011111
}
return -1;
}
The handler for MOVWF
is simple. It only extracts the value of the address and uses a MACRO
called SET_BY_MASK
. This macro is responsible for creating the machine code. It updates the given identifier inst
, a mask mask
, and a value to fill the mask val
, and finally generates the machine code using them.
The next one is the CLRF
handler, which is called handle_clrf
. The structure of this handler is quite similar to the handle_movwf
function.
/* {CLRF} */
int handle_clrf(ASMBL *asmbl, OPR *operands){
if(check_op_num(asmbl, operands, 1)){ return -1; }
int result;
if((result = extract_value(operands->lines[0], 1)) >= 0){
add_to_mem(operands->lines[0]);
return SET_BY_MASK(0x060, 0x01F, result); // 0b000001100000, 0b000000011111
}
return -1;
}
The handler for TRIS
is similar to the previous handlers, but with one difference: the value of TRIS
can only be 6
or 7
. We must check that the value is in the correct range; otherwise, we throw an error using update_err
and return -1
.
/* {TRIS} */
int handle_tris(ASMBL *asmbl, OPR *operands){
if(check_op_num(asmbl, operands, 1)){ return -1; }
int value;
if((value = extract_value(operands->lines[0], 1)) < 0){
update_err(asmbl, "Invalid literal value", operands->lines[0]);
return -1;
}
if(value == 6 || value == 7){
return 0x00 | value; // 0b000000000000
}
char buff[20] = { 0 };
itoar(value, buff);
update_err(asmbl, "Invalid \"TRIS\" value", buff);
return -1;
}
The opcodes of MOVLW
, ANDLW
, IORLW
, RETLW
, and XORLW
only take a literal, so they need a common function like before. Let’s call this function extract_literal
.
int extract_literal(ASMBL *asmbl, OPR *operands, int code, int uerr){
if(check_op_num(asmbl, operands, 1)){ return -1; }
int val;
if((val = extract_value(operands->lines[0], 1)) < 0){
if(uerr){
update_err(asmbl, "Invalid literal value", operands->lines[0]);
}
return -1;
}
return code | val;
}
The extract_literal
function uses extract_value
with the addition of checking the number of operands and updating the AMS_ERR
structure. It then generates the opcode using the provided code
identifier.
Now, the handlers for each opcode would be similar, as they would all utilize the extract_literal
function to handle the literal values and generate the corresponding machine code for each operation. This allows us to avoid redundant code and maintain consistency across the handlers for the specified opcodes.
/* {MOVLW} */
int handle_movlw(ASMBL *asmbl, OPR *operands){
return extract_literal(asmbl, operands, 0xC00, 1); // 0b110000000000
}
/* {ANDLW} */
int handle_andlw(ASMBL *asmbl, OPR *operands){
return extract_literal(asmbl, operands, 0xE00, 1); // 0b111000000000
}
/* {IORLW} */
int handle_iorlw(ASMBL *asmbl, OPR *operands){
return extract_literal(asmbl, operands, 0xD00, 1); // 0b110100000000
}
/* {RETLW} */
int handle_retlw(ASMBL *asmbl, OPR *operands){
return extract_literal(asmbl, operands, 0x800, 1); // 0b100000000000
}
/* {XORLW} */
int handle_xorlw(ASMBL *asmbl, OPR *operands){
return extract_literal(asmbl, operands, 0xF00, 1); // 0b111100000000
}
The last remaining opcode is CALL
. The CALL
opcode is similar to GOTO
, so let’s create a common function for it and call it set_by_label
.
int set_by_label(ASMBL *asmbl, OPR* operands, int code){
if(check_op_num(asmbl, operands, 1)){ return -1; }
char *label = operands->lines[0];
int lvalue = get_element(LABEL_ELEMENT, label);
if(lvalue >= 0){
return code | lvalue; // 0b101000000000
}
lvalue = extract_value(label, 1);
if(lvalue < 0){
update_err(asmbl, "Invalid label", label);
return -1;
}
return code | lvalue; // 0b101000000000
}
The set_by_label
extracts the literal if there and also checks the label array too. Now that we have this function, let’s update handle_goto
and create the handle_call
function.
/* {GOTO} */
int handle_goto(ASMBL *asmbl, OPR *operands){
return set_by_label(asmbl, operands, 0xA00); // 0b101000000000
}
/* {CALL} */
int handle_call(ASMBL *asmbl, OPR *operands){
return set_by_label(asmbl, operands, 0x900); // 0b100100000000
}
Now we can add their handlers to our array. By doing this, we have handled all of the 33 opcodes and completed our assemble
function. The next step is to provide some functions for getting output or generating a binary file.
OP_HNDL hndls[] = {
// ...
{"MOVLW", handle_movlw},
{"ANDLW", handle_andlw},
{"IORLW", handle_iorlw},
{"RETLW", handle_retlw},
{"XORLW", handle_xorlw},
{"CALL", handle_call}
};
Challenge 13: Generating Output
We need to create some outputs now. We have already updated our ASM_ERR
and ASM_LEN
structures, thanks to the assemble
function. Now, we just need to create some output.
Since we know from the ecode
in the ASMBL
function whether the assemble
process has failed or succeeded, let’s imagine the process failed. We’ll create a function to update the given buffer with error diagnostics.
void show_err(ASM_ERR *err, char buffer[]){
char obj_buff[MAX_STR + 10] = { 0 };
if(strcmp(err->obj, "") != 0){
str_trim(err->obj);
sprintf(obj_buff, " (%s)", err->obj);
}
str_trim(err->line);
sprintf(buffer, "%s%s:\n %-3d| %s\n |\n", err->msg, obj_buff, err->lnum, err->line);
}
The show_err
function will update the given buffer char buffer[]
by using the properties of the ASM_ERR
pointer that we provided for it. It will create a message that we can show in the terminal output or other places, such as when using WASM
.
Now, let’s imagine the process succeeded. We need to create an output file using the provided output path from GFLAGS
, the mcode
from ASMBL
, and the total number of words in ASM_LEN
.
The opposite of io_read
, let’s call this one io_write
.
/* io_write: Write into external files */
void io_write(char *path, int buff[], int len) {
FILE *fp;
// Attempt to open the file in binary write mode
if((fp = fopen(path, "wb+")) == NULL){
// If the file cannot be opened, print an error message and exit
printf("Failed to write in \"%s\"", path);
exit(0);
return;
}
unsigned char bytes[2];
// Loop through each value in the buffer
for (int i = 0; i < len; i++) {
// Extract the MSB (Most Significant Byte)
bytes[0] = (buff[i] >> 8) & 0xFF;
// Extract the LSB (Least Significant Byte)
bytes[1] = buff[i] & 0xFF;
// Write both bytes to the file
fwrite(bytes, 1, sizeof(bytes), fp);
}
// Close the file after writing
fclose(fp);
}
The io_write
function is responsible for converting our machine codes into a binary file using the provided path. It works by breaking down the MSB (Most Significant Bit) and LSB (Least Significant Bit) of each byte and saving them into the file.
Now that we have all of the essential functions, let’s update our main
function and finish writing our program.
int main(int argc, char *argv[]){
// Already Added:
GFLAGS gflags;
update_gflags(&gflags, argc, argv);
TBL file;
io_read(&file, gflags.input);
ASMBL asmbl;
assemble(&asmbl, &file);
// New parts:
if(asmbl.ecode){
static char err_buff[MAX_STR] = { 0 };
show_err(&asmbl.err, err_buff);
printf("%s\n", err_buff);
return 1;
} else {
io_write(gflags.output, asmbl.mcode, asmbl.len.words);
if(gflags.verbose){
for(int i = 0; i < asmbl.len.words; ++i){
printf("%s\n", asmbl.lines[i]);
}
printf("\n\n");
}
printf("Total Words: %d\nNumber of Used Memory: %d\n",
asmbl.len.words, asmbl.len.mem);
}
return 0;
}
By using asmbl.ecode
, we check if the process failed or not. If it failed, we use show_err
to print the error and return 1
, indicating failure. Otherwise, we create the binary file using io_write
. If the verbose
flag (-v
) is checked, we loop through each line and print them. Finally, we print the total number of words (asmbl.len.words
) and the total usage of memory (asmbl.len.mem
), regardless of whether any flags are checked or not. And that’s about our assembler program.
Example
In the previous post, we created an example program. Now, let’s assemble the assembly code with the -v
flag on to see the output.
MOVWF 0x06 0b000000100110
BSF 0x06 7 0b010111100110
BCF 0x06 7 0b010011100110
MOVLW 0x01 0b110000000001
INCF 0x0A 1 0b001010101010
ADDWF 0x0A 0 0b000111001010
ADDWF 0x02 1 0b000111100010
NOP 0b000000000000
NOP 0b000000000000
RETLW 0x48 0b100001001000
RETLW 0x65 0b100001100101
RETLW 0x6C 0b100001101100
RETLW 0x6C 0b100001101100
RETLW 0x6F 0b100001101111
RETLW 0x2C 0b100000101100
RETLW 0x20 0b100000100000
RETLW 0x57 0b100001010111
RETLW 0x6F 0b100001101111
RETLW 0x72 0b100001110010
RETLW 0x6C 0b100001101100
RETLW 0x64 0b100001100100
RETLW 0x21 0b100000100001
RETLW 0x0A 0b100000001010
CLRF 0x0A 0b000001101010
CLRF 0x06 0b000001100110
CLRW 0b000001000000
GOTO 0x00 0b101000000000
Total Words: 27
Number of Used Memory: 3
If we create an invalid program, like the code below, we will have a different output and no binary file. The error message will be displayed, and the program will not proceed to generate the binary file.
GPIO EQU 0x06
MOVLW 'A'
BSF GPIO, 7
INVALID ;; Err
CLRF GPIO
The output that we will get is the error message generated by show_err
, which updates the buffer with error diagnostics. This message will be displayed, indicating the specific error, and no binary file will be created.
Invlaid opcode (INVALID):
6 | INVALID ;; Err
|
You can find all of the codes for this project, many more examples, and a WASM version of it in this GitHub repository.