Journey Of Creating An Assembler

In my previous blog post, I discussed the PIC10F200/202/204/206 series of microcontrollers and explained the structure of their opcodes. In this post, we will create an assembler together.

What is an Assembler?

An assembler converts human-readable assembly code into machine language, or machine-readable code. For example, consider the following assembly line:

ASM
GOTO 0x03

We can understand this code, but our CPU or microcontroller cannot. The assembler takes this assembly line and converts it into machine code, like so:

ASM
GOTO       0x03  
101        000000011  

Each opcode (operation code) has an identifier. In this case, the identifier is 0b101, and it may be followed by one or more operands—like 0x03 in our example, which is represented as 0b000000011. The assembler combines these values into 0b101000000011, forming a 12-bit binary value. This means every assembly line results in a 12-bit binary, and these binary values are then concatenated and exported as a binary or executable file that the CPU or machine can understand. This is the core function of an assembler.

Why Create an Assembler?

Now that we understand what an assembler does, why create one? Creating an assembler is an excellent learning experience for junior programmers. It helps deepen your understanding of both the microcontroller’s architecture and the programming language you’re working with. In this case, we will write our assembler in C, as it is fast and performance is important for our needs.

Challenges

When creating an assembler, you’ll encounter several challenges. For instance, not all opcodes have a single operand—some may have two operands, while others might not require any. We must design a program capable of understanding these variations and generating the corresponding machine code.

Breaking It Down: Our Methodology

We will start by writing a simple program that generates binary code for a single opcode. Once that works, we will expand it to support additional opcodes.

In my previous blog post, I explained the 33 opcodes available in the PIC10F series of microcontrollers. Let’s revisit the structure of the GOTO opcode:

GOTO:  
            ┌─ Literal Value (9 bits)  
     ───────┴───  
0b101K kkkk kkkk  
  ─┬─  
   └─ Identifier (3 bits)  

The last three bits are the GOTO identifier, and the other nine bits represent the 9-bit operand.

Simple Program to handle GOTO opcode

Let’s start our journey by creating a simple program that generates the binary for the GOTO opcode.

C
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXSTR 32

int main(void){
    char line[2][MAXSTR] = { "GOTO", "42" };

    int machine_code[10] = { 0 };
    int midx = 0;

    // Check that if the opcode is matches the "GOTO"
    if(strcmp(line[0], "GOTO") == 0){
        // an pointer to store unconverted parts of given string for strtol
        char *endptr;
        int result = 0;

        // Exteract the numberic value of lines[1] and store in 'result' (base 10)
        if((result = strtol(line[1], &endptr, 10)) != 0){
            int code = 0b101000000000 | result;
            machine_code[midx++] = code;

        } else {

            // exit the program if second operand is not a number
            printf("Invalid operand");
            exit(0);
        }

    } else {
        printf("Unsupported Opcdoe \"%s\"", line[0]);
        exit(0);
    }

    printf("Generated machine code\n");

    for(int i = 0; i < midx; ++i){
        printf("0x%.3X       %s\n", machine_code[i]);
    }

    return 0;
}

Let’s see what this code does, step by step:

  1. It checks if the first element in our array is equal to GOTO; otherwise, it exits the program.
  2. It extracts the numeric value of the GOTO operand and stores it in result. If the conversion fails, the program exits.
  3. If converting the operand of GOTO to a number is successful, it generates the machine code using the identifier of GOTO, which is 0b1010000000, and result.
  4. It saves the generated machine code in our machine_code list.
  5. Using a for loop, we iterate through all the codes and print their hexadecimal values.

This program has several issues. First, the input to the assembler isn’t an array like { "GOTO", "42" }; it’s a file containing lines of code. Second, we don’t want to handle just GOTO; there are 33 opcodes that need to be managed. A binary view of the generated machine code would also be helpful. Another issue is that if no operand is provided for GOTO, the assembler should still handle the case. Additionally, the operand for GOTO can be in various formats, such as 0x0F or 06H, and the assembler must be capable of detecting and processing these. Lastly, we should have flags to allow for different output options, like -v to view the generated binary. Let’s address these issues one by one and expand our simple assembler program.

Before having an assembly file, we must know its location to be able to read it; let’s address this challenge first.

Challenge 1: CLI Flags

For our assembler to work, we don’t need to add all the opcodes as an array of characters in the program; instead, we want to assemble the .asm file by providing the path to the input file. Let’s write a function that stores our CLI (Command Line Interface) commands into a predefined structure called GFLAGS for Global Flags.

C
#define MAX_PATH 512

typedef struct {
    int verbose;
    char input[MAX_PATH];
    char output[MAX_PATH];
} GFLAGS;

We have three items in our structure. The first is a variable called verbose, which we enable when the -v flag is provided. This helps us see the generated binary output. The second is input, a character array (string) used to store the input file path. The last item, also a character array called output, is for specifying the output file path. However, we don’t always need to define an output path, so output, like verbose, is optional. The input field is mandatory for our assembler because it contains the assembly code.

Now that we have defined our structure, let’s write a function that captures argc and argv from the main function and updates the given pointer to GFLAGS.

C
/* Update global flags */
void update_gflags(GFLAGS *gflags, int argc, char *argv[]) {

    // Exit the program if there are not enough arguments
    if (argc < 2) {
        // Print usage instructions
        printf("%s \"<filename>\" -[options]\n", argv[0]);
        exit(0); // Terminate the program
    }

    int i;  // Loop variable for arguments
    int j;  // Loop variable for characters in an argument

    // Initialize global flags structure
    gflags->verbose = 0;                                // Default verbose mode off
    memset(gflags->input, 0, sizeof(gflags->input));    // Clear input file path
    memset(gflags->output, 0, sizeof(gflags->output));  // Clear output file path

    // Set default input and output file paths
    strcpy(gflags->input, argv[1]);        // Input file path from first argument
    strcpy(gflags->output, "asm_out.bin"); // Default output file path

    int save = 0; // Flag to indicate the next argument is an output file path

    // Iterate over program arguments starting from the second one
    for (i = 2; i < argc; i++) {

        // Save output path if '-o' option was found
        if (save) {
            strcpy(gflags->output, argv[i]); // Store output file path
            save = 0; // Reset save flag
        }

        // Process each character of the current argument
        for (j = 0; j < (int)strlen(argv[i]); j++) {
            if (argv[i][0] == '-') { // Check if it's an option
                switch (argv[i][j]) {
                    case 'v': // Verbose mode
                        gflags->verbose = 1;
                        break;

                    case 'o': // Output file option
                        save = 1; // Indicate the next argument is the output path
                        break;

                    default:
                        break; // Ignore unknown options
                }
            }
        }
    }

    // Check if '-o' was used without specifying an output path
    if (save) {
        // Error message
        printf("No output file!\nAfter '-o' output path needed\n");
        exit(0); // Terminate the program
    }
}

We create a function called update_gflags. This function takes three arguments: the first is GFLAGS *gflags, which is a pointer used to update the given GFLAGS structure; the second is argc, which contains the length of the input arguments; and the last is argv, which holds the input arguments themselves.

This function saves the second argument in input (the first one is the program itself) and loops through each subsequent argument. If an argument starts with -, it checks for possible flags like -v for enabling the verbose flag. If it encounters the -o flag, it enables the save variable, and the next argument is stored in the output path. If there is no -o flag, the output path is already set to the default value of ./asm_out.bin.

We can use update_gflags in our main function like this:

C
int main(int argc, char *argv[]){
    GFLAGS gflags;
    update_gflags(&gflags, argc, argv);
    // ...
}

We can test our flags and observe the different outputs:

Bash
./assembler ./test.asm -v -o ./output.bin

The resulting data would be:

Plaintext
input: "./test.asm"  
output: "./output.bin"  
verbose: 1  

Now that we have the input file path thanks to the update_gflags function, we need to read the input file, if possible, and then proceed to read each line and process it for our assembler.

Challenge 2: Read Input File

It’s easier for our assembler to have each line as a char * before breaking it down into words and extracting the operands. So, we need a structure to store our lines, and a table would be ideal. We will store each line in a variable of type char **, but we don’t know where the end of the buffer is, so we will use an int to keep track of how many lines of code are stored in our table. Let’s start by writing our Table structure and calling it TBL.

C
#define MAX_STR 256    // Maximum length of a single line
#define ASM_BUFF 1024  // Maximum number of lines in the table

// Structure to store lines of assembly code
typedef struct {
    // Array to store the lines (up to ASM_BUFF lines, each up to MAX_STR characters)
    char lines[ASM_BUFF][MAX_STR];
    // Variable to keep track of the number of lines stored in the table
    int len;
} TBL;

A function for copying TBL would be nice to help us store the original lines afterward.

C
void copytbl(TBL *dst, TBL *src){
    dst->len = src->len;
    for(int i = 0; i < src->len; ++i){
        strcpy(dst->lines[i], src->lines[i]);
    }
}

Now that we have our table, we can write a function to read each line of the input file. Since we already have the input file path stored in our GFLAGS, let’s call this function io_read.

The function io_read takes two arguments: the first is a pointer to our predefined TBL structure, since we want to store each line in our table, and the second argument is the input file’s path.

C
/* Read the file at 'path' and load it into 'tbl' (if an error occurs, finish the program) */
void io_read(TBL *tbl, char path[]){
    FILE *fp;

    // Clear the lines array in the table
    memset(tbl->lines, 0, sizeof(tbl->lines));
    tbl->len = 0;

    // Open the file for reading
    fp = fopen(path, "r");

    // If the file doesn't exist, print an error message and exit
    if(fp == NULL){
        printf("File \"%s\" does not exist!\n", path);
        exit(0);
    }

    char buff[MAX_STR] = { 0 };

    // Read each line from the file
    while(fgets(buff, sizeof(buff), fp) != NULL){
        // Copy the current line to the table and increment the line count
        strcpy(tbl->lines[tbl->len], buff);
        tbl->len++;
    }

    // Close the file after reading
    fclose(fp);
}

  1. We define a file pointer FILE *fp called fp.
  2. We clear our table (tbl).
  3. We check if fp is NULL; if yes, we finish the program because it means we couldn’t read the file.
  4. We read each line and store it into buff, then copy buff to our table’s line.
  5. We close the file.

Now that we have the io_read function, let’s add it to our main function, just below the update_gflags.

C
int main(int argc, char *argv[]){
    GFLAGS gflags;
    update_gflags(&gflags, argc, argv);

    TBL file;
    io_read(&file, gflags.input);

    // ...
}

Thanks to update_gflags and io_read, we now have the input file read and stored in our TBL structure, named file. Now, we can process each line.

Challenge 3: Breaking Down the Lines

The io_read function helped us read the input file and store it in our TBL structure, where each line is stored in an array like this:

C
{
    // ...
    "GOTO 43",
    "NOP",
    "CLRW"
    // ...
}

However, earlier in our example, we had arrays for each instruction. For example, "GOTO 43" would be stored as { "GOTO", "42" }. This structure was useful because it wasn’t a single line like "GOTO 43", so we need a function to help us convert lines like "GOTO 42" into arrays like { "GOTO", "42" }. This will make it much easier for processing.

Now that we understand why we need a function to break down a string, let’s write one and call it str_break. However, we also need a structure to store our data. It should be something similar to TBL, which we already defined, but smaller, as the table is too large for this purpose. We need a compact structure to store our operands, so let’s define a structure and call it OPR, to store the result of the str_break function.

C
// Maximum number of operands
#define MAX_OPERAND 5

typedef struct {
    char lines[MAX_OPERAND][MAX_STR];
    int len;
} OPR;

So, now that we have our OPR structure, let’s write the str_break function. The str_break function requires a character array input (char *) and a pointer to the OPR structure to store the data.

But there is a problem: we don’t want the str_break function to behave like split() in other languages. We need to write this function smart enough to detect quoted letters, like 'A', 'B', etc., which are enclosed in single quotes. This is useful because not all operands of functions are integers; they may be letters, like in the previous “Hello, World” example. Therefore, we need to track quotes as well.

C
void str_break(char input[], OPR *tbl) {
    int q = 0;  // Flag to track if inside quotes
    int bi = 0; // Index for the line in the table
    int f = 0;  // Index for characters in the current line
    memset(tbl->lines, 0, sizeof(tbl->lines)); // Initialize the lines array in the table to 0
    int was_space = 0; // Flag to track if the previous character was a space

    // Iterate through each character of the input string
    while (*input) {
        // If the current character is not a space or we're inside quotes, add it to the current line
        if (*input != ' ' || q == 1) {
            tbl->lines[bi][f++] = *input; // Add character to the current line
            tbl->lines[bi][f + 1] = '\0'; // Null-terminate the line
            was_space = 0; // Reset the space flag
        } else {
            // If we encounter a space and were not previously inside a space, move to the next line
            if (was_space == 0) {
                bi++;  // Move to the next line
                f = 0; // Reset the character index for the new line
                was_space = 1; // Set space flag
            }
        }

        // If the character is a quote, toggle the inside-quote flag
        if (*input == '\'') q = q ? 0 : 1;

        input++; // Move to the next character
    }

    int size = sizeof(tbl->lines) / sizeof(tbl->lines[0]); // Get the number of lines available in the table

    tbl->len = 0; // Initialize the line count to 0

    // Loop through all lines in the table
    for (int i = 0; i < size; ++i) {
        str_trim(tbl->lines[i]); // Trim whitespace from the line
        if (strcmp(tbl->lines[i], "") == 0) { // If the line is empty, stop processing
            break;
        } else {
            tbl->len++; // Increment the line count for non-empty lines
        }
    }
}

The str_break function is able to break down a given string (character array char *) into tokens and store them in the OPR structure. It first checks for the single quote character ' and toggles a value to help break down the spaces. After that, it counts the non-empty lines and updates the len field in the OPR structure.

Challenge 4: Assemble function

Now, thanks to str_break, we are able to break down a given line, which is very helpful for processing the operands of an opcode. Next, we need a function to process each line for us. Since this is an assembler program, let’s call the function assemble. We already have a TBL structure for the extracted lines from the input (using io_read).

Now, we need another structure to help the assemble function store its data. Remember earlier when we used printf and a for loop to see the result of each code in hexadecimal? Wouldn’t it be better to already have the processed line stored? This is useful when the verbose flag (-v) is set. Since an assembler generates an executable output, we also need to store each numeric value of an opcode in an array. This is essential for concatenating the executable file.

But what if there’s an invalid opcode? The assemble function should be able to handle that and return a proper error message with enough details to help us locate the issue, such as the line number or even the invalid line itself. Lastly, since the PIC10F200 has limited ROM (range 256 to 512), it’s useful to track the number of generated words and the used addresses.

Now that we know what the assemble function needs to do, let’s create an appropriate structure for it and call it ASMBL, which will be responsible for storing the processed data from the assemble function. This structure will also use ASM_LEN and ASM_ERR to keep track of word length and any errors that occur.

First, let’s start by defining ASM_ERR. This structure has four variables. The first is a variable to store the line number (let’s call it lnum). Next, we need a character array to store the message (such as msg). We also need another character array called line to store the error line itself. Lastly, we need an object (obj), which should also be a character array. The obj will store the invalid part of the error message, such as the opcode, to help the user pinpoint where the issue in the line occurred—whether it’s with the opcode, the operands, or something else.

C
#define ASM_LINE 128

// Structure to store error information
typedef struct {
    int lnum;               // Line number where the error occurred
    char msg[MAX_STR];      // Message describing the error
    char line[ASM_LINE];     // The line where the error occurred
    char obj[MAX_STR];      // The specific object (opcode, operand, etc.) causing the error
} ASM_ERR;

The second structure we need to create to assist in writing ASMBL is ASM_LEN, which will help the assemble function and ASMBL keep track of memory usage and the generated words from opcodes.

C
typedef struct {
    int mem;      // Number of Used memory
    int words;    // Total number of generated words
} ASM_LEN;

Now that we’ve written both ASM_ERR and ASM_LEN, let’s define ASMBL:

C
typedef struct {
    int mcode[MAX_CODE];            // Machine codes
    char lines[MAX_STR][ASM_LINE];  // lines (verbose)
    ASM_ERR err;                    // Error struct
    ASM_LEN len;                    // Length struct
    int ecode;                      // exit code
} ASMBL;

We’ve already discussed the ASM_ERR and ASM_LEN structures, as well as the machine code (mcode) and verbose lines (lines) in the ASMBL structure. However, it would also be useful to have an exit code (ecode) parameter. This will allow us to handle different types of errors. For example, 0 could indicate that everything is fine, 1 could represent a general error, and other values could be used to indicate specific issues, such as an incorrect number of operands.

Now that we have defined the three structures—ASMBL, ASM_ERR, and ASM_LEN—let’s define a function for each of them to initialize the structures. Since C compilers load these structures with some junk values from the heap, we need to set them to 0. Each function will take a pointer to its respective type to initialize it.

C
/* initialize ASM_ERR */
void empty_err(ASM_ERR *err){
    err->lnum = 0;
    memset(err->msg, 0, sizeof(err->msg));
    memset(err->obj, 0, sizeof(err->obj));
    memset(err->line, 0, sizeof(err->line));
}


/* initialize ASM_LEN */
void empty_asmlen(ASM_LEN *len){
    len->mem = 0;
    len->words = 0;
}

/* initialize ASMBL */
void empty_asm(ASMBL *asmbl){
    asmbl->ecode = 0;
    empty_err(&asmbl->err);
    empty_asmlen(&asmbl->len);
    memset(asmbl->mcode, 0, sizeof(asmbl->mcode));
    memset(asmbl->lines, 0, sizeof(asmbl->lines));
}

Now that we have defined the ASMBL structure, let’s write the assemble function. We need an input, which we already have from the io_read function, of type TBL that stores each line. We also need a pointer to the ASMBL structure to help the assemble function load its data into it. The assemble function will use a for loop to go through each line and process it.

C
void assemble(ASMBL *asmbl, TBL *input_tbl){
    empty_asm(asmbl);  // To initialie 'ASMBL' struct
    OPR oprs;          // Operands

    for(i = 0; i < tbl.len; ++i){
        str_break(tbl.lines[i], &oprs);  // load line's tokens into 'oprs'

        // define and load opcode
        char opcode[20];                 // Opcode
        strcpy(opcode, oprs.lines[0]);   // Copy first element of `oprs` to opcode

        asmbl->err.lnum = i + 1;   // Set the lnum in ASM_ERR to current line
        strcpy(asmbl->err.line, tbl.lines[i]);  // Load the current line into the ASM_ERR

        // Check that if opcode is "GOTO"
        if(strcmp(opcode, "GOTO") == 0){
            char *endptr;
            int result = 0;

            // Exteract the numberic value of lines[1] and store in 'result' (base 10)
            if((result = strtol(oprs.lines[1], &endptr, 10)) != 0){
                int code = 0b101000000000 | result;
                asmbl->mcode[asmbl->len.words++] = code;
            } else {

                // Load ASM_ERR
                strcpy(asmbl->err.msg, "Invalid Operand");
                strcpy(asmbl->err.obj, oprs.lines[1]);
                asmbl->ecode = 1;
            }
        } else {

            // Load ASM_ERR
            strcpy(asmbl->err.msg, "Invalid Opcode");
            strcpy(asmbl->err.obj, opcode);
            asmbl->ecode = 1;
        }
    }
}

This is what the assemble function looks like, but there are some problems. First, it only handles the GOTO opcode. There is no dedicated function to update the ASM_ERR field in the ASMBL structure, so we have to handle it manually for each error occurrence. Additionally, the GOTO opcode only processes base-10 integers, but the assembler must be able to handle different formats, such as hexadecimal or even binary. Lastly, if there is an empty or invalid line, the assembler crashes because it cannot process or skip such lines.

First, let’s start by handling useless lines and comments.

Challenge 5: Useless parts

In assembly language, everything after ; is treated as a comment, but only if the ; is not in the middle of two single quotes.

ASM
; This is a comment in assembly language

It would be useful to have a function that updates the given line by trimming all its whitespaces and removing comments, allowing us to easily detect and compare lines using strcmp. Another advantage is that if the comment appears after the opcode, this method ensures the comment will not pass into str_break, resulting in cleaner operands.

C
// Detect empty line
if(strcmp(line, "") == 0){
    continue;
}

So, let’s write a function that removes all leading and trailing whitespaces from the given line (character array char *) and updates the line accordingly.

C
void str_trim(char buff[]) {
    // If the buffer is NULL, exit the function
    if (buff == NULL) { 
        return; 
    }

    // Trim leading whitespace
    char *start = buff; // Pointer to traverse the beginning of the string
    while (isspace((unsigned char)*start)) { 
        start++; // Move the pointer forward while encountering whitespace
    }

    // If leading whitespace is found, shift the string to remove it
    if (start != buff) { 
        char *dst = buff; // Pointer to write the trimmed string
        while (*start) { 
            *dst++ = *start++; // Copy characters from start to destination
        }
        *dst = '\0'; // Null-terminate the trimmed string
    }

    // Trim trailing whitespace
    char *end = buff + (int)strlen(buff) - 1; // Pointer to the last character in the string
    while (end >= buff && isspace((unsigned char)*end)) { 
        *end-- = '\0'; // Move backwards and replace trailing whitespace with null terminators
    }
}

The str_trim function will help us achieve this. Now, we need a function to remove comments from the given character array (char *).

C
/* skip_comment: remove comments */
void skip_comment(char buff[]) {
    int i = 0; // Index to traverse the character array
    int quote = 0; // Flag to track if inside a quote

    str_trim(buff); // Trim leading and trailing whitespace from the input string

    // Traverse the string character by character
    while (buff[i] != '\0') {
        // Toggle the quote flag if a single quote is encountered
        if (buff[i] == '\'') { 
            quote = quote == 0; // Toggle quote flag
        }

        // If a semicolon is found outside of quotes, terminate the string
        if (buff[i] == ';' && quote == 0) {
            buff[i] = '\0'; // Replace the semicolon with a null terminator
            break; // Exit the loop, as the comment has been removed
        }
        i++; // Move to the next character
    }
}

The skip_comment function will remove everything after the ; character, trimming all the whitespaces beforehand using the str_trim function.

Now, we just need to add the str_trim and skip_comment functions into the assemble function’s loop.

C
    //...

    for(i = 0; i < tbl.len; ++i){
        skip_comment(tbl.lines[i]);
        str_trim(tbl.lines[i]);
        if(strcmp(tbl.lines[i], "") == 0){ continue; }

    //...

Now that we are able to remove comments, it would be useful to remove , as well because multi-operand opcodes separate their operands with commas. Replacing all valid commas (excluding those between two quotes) with whitespace will greatly simplify processing for str_break. Additionally, if a line contains only commas, it will be ignored due to the str_trim and strcmp logic we already added. Let’s implement a function to replace specific characters.

C
/* char_replace: replaces all occurrences of 'src' with 'dst' in the given string 'buff',
   but skips characters inside single quotes. Returns 0 after completion. */
int char_replace(char buff[], char src, char dst) {
    int i = 0;        // Index to traverse the character array
    int quote = 0;    // Flag to track if inside a quote

    str_trim(buff);   // Trim leading and trailing whitespace from the input string

    // Traverse the string character by character
    while (buff[i] != '\0') {
        if (buff[i] == '\'') { 
            quote = quote == 0; // Toggle the quote flag when a single quote is encountered
        }

        // Replace 'src' with 'dst' if found outside of quotes
        if (buff[i] == src && quote == 0) {
            buff[i] = dst; // Perform the replacement
        }
        i++; // Move to the next character
    }

    return 0; // Return 0 after the operation is complete
}

The char_replace function takes a buffer of type char *, a src character, and a dst character. It loops through the buffer and replaces any occurrences of the src character with the dst character, but only if the src character is not between single quotes.

Now that we have implemented the char_replace function, let’s add it to the loop in the assemble function.

C
    //...

    for(i = 0; i < tbl.len; ++i){
        char_replace(tbl.lines[i], ',', ' ');  // Replace commas with whitespace
        skip_comment(tbl.lines[i]);
        str_trim(tbl.lines[i]);
        if(strcmp(tbl.lines[i], "") == 0){ continue; }

    //...

Now we have managed to remove all the unnecessary parts of our code using these functions.

Challenge 6: Labels and EQUs

If you look at how “GOTO” behaves in the previous post, you’ll notice that we can pass labels for the GOTO address!

ASM
start:
    ; Do something
    GOTO end

end:
    GOTO start

A label contains an address, just like 0x06 or 42. However, if you notice, not only should GOTO be able to use the previous address (e.g., start), but it should also reference addresses that come after it (e.g., end)—even though we haven’t reached them in the loop yet!

To handle this, we need to determine these addresses before processing.

The same applies to EQU, but with a key difference: we don’t need the values of EQU until we encounter them in the code.

To achieve this, we need a set of functions to store and retrieve these addresses and EQU values from a list (or array). Let’s write these functions first, so we can implement a for loop before the main processing loop to handle labels and EQU definitions.

First, let’s define a struct that allows us to store data like a dictionary, with a key and a value.

C
typedef struct {
    char key[MAX_STR];
    int value;
} DICT;

The DICT allows us to associate a value with a specific key. This is particularly useful for storing all the labels and EQUs.

We also need an enum to specify whether we want to store a label or an EQU, and we’ll call it elem_t, meaning element type.

C
typedef enum {
    EQU_ELEMENT,
    LABEL_ELEMENT,
} elem_t;

Now that we have defined our enum and structures, let’s define some static global variables, one for storing labels and one for storing EQUs.

We also need to define two int variables to keep track of each array.

C
/* EQU */
static DICT equ_arr [128];
static int equ_arr_len = 0;

/* LABEL */
static DICT label_arr [128];
static int label_arr_len = 0;

Now that the variables are defined, let’s start by writing our save_element function. But there’s an issue: we can save labels with the same name, which causes an error when we want to process them. Wouldn’t it be nice to have a function that helps us determine if the element already exists? Let’s write this function and call it elem_contains, which will take only the type (elem_t) and name as arguments. The value won’t matter for us because we only want to ensure that each name is unique.

C
/* element contains: checks if an element with the given name exists in the specified array */
int elem_contains(elem_t type, char name[]) {
    int i;

    // If the element type is EQU_ELEMENT, search in the equ_arr array
    if (type == EQU_ELEMENT) {
        for (i = 0; i < equ_arr_len; i++) {
            // If the name matches an existing key in equ_arr, return 1
            if (strcmp(equ_arr[i].key, name) == 0) {
                return 1;
            }
        }
    } else {
        // Otherwise, search in the label_arr array
        for (i = 0; i < label_arr_len; i++) {
            // If the name matches an existing key in label_arr, return 1
            if (strcmp(label_arr[i].key, name) == 0) {
                return 1;
            }
        }
    }

    return 0; // Return 0 if the element with the given name is not found
}

The elem_contains function looks at the specified array based on elem_t. If the name exists, it returns 1 (TRUE); if the name doesn’t exist, it returns 0 (FALSE).

Now that we can store our labels and EQU values correctly, let’s write a function that helps us store the elements in the specified array based on type. The function will take name and value as input. If the element already exists in the array, the function will return 1 (indicating failure). Otherwise, it will return 0 (indicating the element has been saved correctly).

C
int save_element(elem_t type, char name[], int value) {
    // Check if the element already exists in the corresponding array (either equ_arr or label_arr)
    if (elem_contains(type, name)) { 
        return 1; // Return 1 if the element already exists
    }

    // If the element type is EQU_ELEMENT, store the element in the equ_arr array
    if (type == EQU_ELEMENT) {
        strcpy(equ_arr[equ_arr_len].key, name); // Copy the name to the key field
        equ_arr[equ_arr_len].value = value;     // Set the value for the element
        equ_arr_len++;                          // Increment the length of the equ_arr array
    } else {
        // If the element type is not EQU_ELEMENT, store it in the label_arr array
        strcpy(label_arr[label_arr_len].key, name); // Copy the name to the key field
        label_arr[label_arr_len].value = value;     // Set the value for the element
        label_arr_len++;                            // Increment the length of the label_arr array
    }

    return 0; // Return 0 to indicate the element has been successfully saved
}

Now that we can store our labels and EQUs correctly, it’s important to have a way to retrieve these elements. Let’s write the get_element function, which will allow us to fetch the value associated with a given name from the specified array by type. To do this, we can use a pointer to return the value, or we can return the value directly. Since a label or an EQU might have a value of 0, returning 0 would conflict with indicating a valid value. Instead, we will return -1 to indicate that the element doesn’t exist in the specified array, as -1 is a value that won’t be used in valid addresses or EQU values.

C
/* get_element: returns -1 if the element with the given name does not exist */
int get_element(elem_t type, char name[]) {
    // If the element does not exist in the corresponding array, return -1
    if (elem_contains(type, name) == 0) { 
        return -1; 
    }

    // Determine the maximum length based on the element type (either equ_arr or label_arr)
    int max = type == EQU_ELEMENT ? equ_arr_len : label_arr_len;

    // Loop through the appropriate array (either equ_arr or label_arr)
    for (int i = 0; i < max; ++i) {
        // If the element is of type EQU_ELEMENT, compare with the equ_arr array
        if (type == EQU_ELEMENT) {
            if (strcmp(equ_arr[i].key, name) == 0) { // Check if the name matches
                return equ_arr[i].value; // Return the value if found
            }
        } else {
            // If the element is not EQU_ELEMENT, compare with the label_arr array
            if (strcmp(label_arr[i].key, name) == 0) { // Check if the name matches
                return label_arr[i].value; // Return the value if found
            }
        }
    }

    return -1; // Return -1 if the element is not found in the array
}

Now that we have our functions related to storing EQU and labels, let’s detect them in our assemble function.

Challenge 7: Preprocessing for Labels and EQUs

Because processing each line requires having the EQUs and labels, it makes sense to use another for loop in the assemble function, just before the main processing loop (the one used for GOTO). The loop should behave similarly to the main loop (skipping empty lines, comments, etc.), but it must detect the EQU keyword and check for : at the end of the line to determine if it’s a label. If it’s not a label or an EQU, we skip it; otherwise, we attempt to store it. If storing fails (returns 1), we throw an error and terminate the program because duplicate labels or EQUs with the same name are not allowed.

Now that we are going to create detailed errors, it would be helpful to create a simple function to set the data and call it update_err, which updates the ASM_ERR in the ASMBL structure.

C
void update_err(ASMBL *asmbl, const char *msg, const char *obj){
    // Set message (msg) if possible
    if(msg != NULL){
        strcpy(asmbl->err.msg, msg);
    }
    // Set object (obj) if possible
    if(obj != NULL){
        strcpy(asmbl->err.obj, obj);
    }
    // Set exit code to 1 (error)
    asmbl->ecode = 1;
}

First, let’s implement the detection for EQU because it’s simpler and doesn’t require checking for : at the end. For this, we use the strstr function provided by the C language (based on ANSI libraries).

The process is straightforward: each EQU line follows the same structure. First, there is the name, followed by the EQU keyword, and finally, the value, like so:

GPIO EQU 6

For this, we can use the str_break function we created earlier to extract the operands from the given line. Now it’s clear how useful the str_break function is.

C
    // ...

    for(i = 0; i < tbl.len; ++i){

        // Skip empty lines
        char_replace(tbl.lines[i], ',', ' ');
        skip_comment(tbl.lines[i]);
        str_trim(tbl.lines[i]);
        if(strcmp(tbl.lines[i], "") == 0){ continue; }

        // Check for EQU
        if(strstr(tbl.lines[i], " EQU ") != NULL){
            str_break(tbl.lines[i], &oprs);
            int value = atoi(oprs.lines[2]);  // Convert array of char to int
            int failed = save_element(EQU_ELEMENT, oprs.lines[0], value);
            if(failed){
                update_err(asmbl, "EQU already exists", oprs.lines[0]);
                return;
            }
            continue;
        }

        // Check for Label
        // ...
    }

    // ...

    for(i = 0; i < tbl.len; ++i){
        // ...

If you notice, we used atoi, which is a standard C language function that only converts decimal characters to integers, like "255" to 255. However, the value of EQU can also be binary, like 0b00000110; various forms of hexadecimal, such as 0x06 or 06H; or decimal. Sometimes, it can even be an ASCII value, such as 'H' or 'A' (which is why we check for quotes, as explained before). Additionally, we may need an EQU value like MOVWF GPIO, where GPIO is a predefined EQU constant.

Wouldn’t it be nice to have a function for that? A function that behaves similarly to the get_element function—returning a negative value like -1 if any error occurs and a valid value (>= 0) otherwise. Let’s call this function extract_value. However, before we start writing it, we must implement a set of functions to detect each numeric type as explained.

Let’s start with detecting characters, as it’s simple. We check the length and use the sscanf function, which is already provided by ANSI libraries, to extract the character. If it’s a sequence character like \n, we use a switch-case statement to determine each of them and generate the valid value. Otherwise, we return \0 as a 0 value if no character is found.

Let’s call the function quoted_letter.

C
char quoted_letter(char *str) {
    char result = '\0';
    char temp;

    if(sscanf(str, "'%c'", &temp) == 1 && (int)strlen(str) == 3){
        result = temp;
    } else if(sscanf(str, "'\\%c'", &temp) == 1 && (int)strlen(str) == 4){
        switch (temp) {
            case 'n':
                result = '\n';
                break;

            case 't':
                result = '\t';
                break;

            case '\\':
                result = '\\';
                break;

            default:
                result = '\0';
                break;
        }
    }

    return result;
}

Now that we know how to detect character values, let’s dive into detecting hex. We can create a function called hsti, which stands for “hex string to integer.” The return values will be the same as the quoted_letter function, helping us detect hex.

C
/* hsti: converts a hexadecimal string to an integer */
int hsti(const char *hexstr) {
    int result = 0;              // Stores the final converted integer
    int length = strlen(hexstr); // Get the length of the hexadecimal string

    // Iterate through each character in the string except the last one
    for (int i = 0; i < length - 1; i++) {
        int digit = hcti(hexstr[i]);    // Convert the current hexadecimal character to its integer value
        result = (result << 4) | digit; // Shift result by 4 bits and add the new digit
    }

    return result; // Return the converted integer
}

The other function must help us detect an 8-bit binary that starts with 0b and extract the value.

Detecting a valid 8-bit binary and extracting it might be a bit different. During the detection process, we need to check for 0b at the beginning and verify that each subsequent character is either 0 or 1. A function to handle this would be useful and can be extended later if needed. We also need another function to convert the binary string into a valid integer value.

Let’s write the detect_8bit_binary function, which returns 1 (TRUE) if the given string (char *) is a valid binary and 0 (FALSE) if it’s not. After that, we can create the btoi function to help extract the integer value.

C
int detect_8bit_binary(char *input) {
    int i;

    // Check if the input starts with "0b"
    if (strncmp(input, "0b", 2) != 0){
        return 0;
    }

    // Check if the remaining part is 8 bits
    if (strlen(input) - 2 != 8){
        return 0;
    }

    // Check if all characters are either '0' or '1'
    for (i = 2; i < (int)strlen(input); i++){
        if (input[i] != '0' && input[i] != '1'){
            return 0;
        }
    }

    return 1; // Valid 8-bit binary pattern
}

Now that we are able to detect the string, let’s write a function to extract the value called btoi, which stands for binary to integer.

C
int btoi(const char *input){
    int result = 0;
    int power = 0;
    input += 2;  // Skip "0b"
    for (int i = strlen(input) - 1; i >= 0; i--) {
        if (input[i] == '1'){
            result |= (1 << power); // Use bitwise OR to accumulate the value
        }
        power++;
    }
    return result;
}

Now that we have all of the necessary functions to write extract_value, let’s write it. This function takes a string char * and a value that determines whether we are able to detect EQU or not. This is useful for our case because when extracting an EQU variable, we don’t want previous EQU values to interfere.

C
int extract_value(char *inpt, int allow_equ) {
    if (allow_equ) {
        // Check if the input can be found as an EQU element
        int result = get_element(EQU_ELEMENT, inpt);
        if (result >= 0) {
            return result; // Return the value if found in EQU
        }
    }

    // Try interpreting as a quoted letter
    char ch = 0;
    if ((ch = quoted_letter(inpt)) != '\0') {
        return (int)ch; // Return the ASCII value of the quoted letter
    }

    // Try interpreting as a hexadecimal number (ending with 'H')
    int len = (int)strlen(inpt);
    if (inpt[len - 1] == 'H') {
        return hsti(inpt); // Convert the hex string to an integer
    }

    // Try interpreting as a decimal integer
    char *endptr;
    int num;
    num = strtol(inpt, &endptr, 10);
    if (strcmp(endptr, "") == 0 && (num >= 0 && num <= 255)) {
        return num; // Return the decimal value if valid
    }

    // Try interpreting as an 8-bit binary number
    if (detect_8bit_binary(inpt)) {
        return btoi(inpt); // Convert the binary string to an integer
    }

    // Try interpreting as a hexadecimal number with '0X' prefix
    num = strtol(inpt, &endptr, 16);
    if (strcmp(endptr, "") == 0 && (num >= 0 && num <= 255)) {
        return num; // Return the hexadecimal value if valid
    }

    return -1; // Return -1 if none of the conditions match
}

Now, with the help of extract_value, we can update the first loop in the assemble function from using atoi to using extract_value.

C
        // ...

        // Check for EQU
        if(strstr(tbl.lines[i], " EQU ") != NULL){
            str_break(tbl.lines[i], &oprs);

            /* Detect EQU value using extract_value function */
            int value = extract_value(oprs.lines[2], 0);
            if(value < 0){
                update_err(asmbl, "Invalid EQU value", oprs.lines[2]);
                return;
            }

            int failed = save_element(EQU_ELEMENT, oprs.lines[0], value);
            if(failed){
                update_err(asmbl, "EQU already exists", oprs.lines[0]);
                return;
            }
            continue;
        }

        // ...

The detection for EQU is officially done. Now, we need to be able to detect labels in the loop.

The structure of labels is simple: a word followed by a : sign. The unique part is the : sign. We must check if a word contains : at the end. If it does, we detect it as a label; otherwise, it is not a label.

ASM
; label start
start:
    ; ...

; label end
end:
    ; ...

Now that we understand how labels work, we need a function to help us detect them.
Let’s write a function called char_contains that takes a buffer (char *) as the first argument and a char as the second argument, and checks if the char is contained in the buffer by looping through it.

C
int char_contains(char buff[], char c) {
    int i = 0;       // Index for iterating through the buffer
    int quote = 0;   // Flag to track if inside a quoted section

    str_trim(buff);  // Remove leading and trailing whitespace from the buffer

    while (buff[i] != '\0') { 
        if (buff[i] == '\'') { 
            // Toggle the quote flag when encountering a single quote
            quote = quote == 0;
        }
        if (buff[i] == c && quote == 0) {
            return i;  // Return 1 if the character is found outside of quotes
        }
        i++;  // Move to the next character in the buffer
    }

    return -1;  // Return 0 if the character is not found
}

The char_contains function returns the index of the first occurrence of the specified character in the given buffer. If the character does not exist in the buffer, the function returns -1. This is useful for detecting if the last character in the buffer is equal to : by utilizing the strlen function provided by ANSI C.

Now, there is a problem: if we use only this function, we are merely detecting whether the : sign is present, and we store the whole string, including the :, in the labels array. The problem is that we don’t need the : at the end of the label. For example, we don’t want to use GOTO start: to set the value of GOTO. Therefore, we must remove the last character (:) from the line before storing it in the array.

Let’s call this simple function str_end. It will help us remove a character from the end of the string by inserting '\0' at the calculated position (length of string - end).

C
void str_end(char *buff, int end){
    int len = (int)strlen(buff);
    buff[len - end] = '\0';
}

Now we can effectively detect labels in our preprocess loop within the assemble function. However, there is a problem: we don’t know the address of the detected label to provide its value to the save_element function. To solve this, we need a way to track valid parts of the code that result in machine code (mcode).

To achieve this, let’s define an int variable called codes. This variable will keep track of non-comment, non-empty lines, and lines that are neither EQU nor labels. This will help us determine the value associated with each label.

C
    // ...

    int codes = 0;  // Keep track of valid codes (for label address)

    for(i = 0; i < tbl.len; ++i){

        // ...
        // Detect EQU
        // ...

        // Check for label
        int idx = 0;
        if((idx = char_contains(tbl.lines[i], ':'))){

            // Make sure that the last character is equal to ':'
            if(idx != (int)strlen(tbl.lines[i]) - 1){
                update_err(asmbl, "Invalid label syntax", tbl.lines[i]);
                return;
            }

            str_break(tbl.lines[i], &oprs);
            str_end(oprs.lines[0], 1);
            int failed = save_element(LABEL_ELEMENT, oprs.lines[0], codes);
            if(failed){
                update_err(asmbl, "Label already exists", oprs.lines[0]);
                return;
            }
            continue;
        }

        codes++;  // Add to 'codes' by 1, meaning 1 more valid code
    }

    // ...

    for(i = 0; i < tbl.len; ++i){
        // ...

By using strlen(), we can check whether the : is at the end of the string. If it’s not, we update the ASM_ERR function with the appropriate error. After that, we save the label (without the : at the end) to the labels array using save_element. If the save_element function returns 1, we terminate the program with the provided error. This process prevents the double definition of labels, similar to how we handled EQU.

Challenge 8: Better Way To Handle Opcodes

Now that we have all the labels and EQUs, it’s time to process more opcodes, not just the GOTO opcode. We used the second loop in the assemble function to process opcodes, but we must expand it if we want to handle more opcodes. However, there is a problem: if we use strcmp with if-else statements to detect each opcode, the code will become messy. A more efficient solution would be to provide an array of names with their corresponding handlers. When we reach a name, we can call the handler to get the machine code. This approach is far cleaner and more efficient than the if-else method. Let’s create such a structure.

First, we need a structure to help us with this task. It should have a label and a function pointer to allow us to call the correct handler.

But what parameters should we give to the function pointer (handler)? Since we need to update errors, we require the ASMBL structure. Therefore, we will pass a pointer to ASMBL. We also have operands, which are generated by str_break, so we need to pass those as well. However, there’s a problem. The str_break function breaks down the entire line, but we only need the operands. To address this, we must write a function that shifts all operands one place to the left and removes the first operand (which is the opcode). We already store the opcode in the opcode character array (char *).

Additionally, we need a TBL to store the unmodified, exact same lines. This will allow us to update the line number (lnum) and line error (line) in the ASM_ERR field of the ASMBL structure.

So before we dive in further, let’s define the structure to make things clearer, and we will call it OP_HNDL.

C
typedef struct OP_HNDL {
    char *lable;
    int (*func)(ASMBL *, OPR *);
} OP_HNDL;

Now that we have the OP_HNDL structure, let’s create a simple handler for GOTO. However, before that, we need to provide operands by removing the first element, which is the opcode itself, leaving only the operands.

C
void shift_lines_left(OPR *tbl) {
    if (tbl == NULL || tbl->len <= 0) return; // Handle null pointer or empty lines
    for (int i = 1; i < tbl->len; i++) {
        memcpy(tbl->lines[i - 1], tbl->lines[i], MAX_STR); // Move line i to i-1
    }
    memset(tbl->lines[tbl->len - 1], 0, MAX_STR); // Clear the last line
    tbl->len--; // Decrease the length of lines
}


void copy_shift_oprs(OPR *dst, OPR *src) {
    int i;
    for(i = 0; i < src->len; ++i) {
        strcpy(dst->lines[i], src->lines[i]); // Copy lines from src to dst
    }
    dst->len = src->len; // Set length of dst to match src

    shift_lines_left(dst); // Shift lines in dst to the left
}

First, there is the function copy_shift_oprs, and there is shift_lines_left. The shift_lines_left function is self-explanatory: it shifts all of the lines to the left by one and updates the len in the OPR structure. The copy_shift_oprs function copies the src OPR to the dst OPR and shifts all the dst OPR by calling shift_lines_left. By doing this, we remove the first element in the oprs, which is the opcode itself.

Now that we are able to update the assemble function, let’s create a handler for GOTO first. We can add more handlers later in the post for each opcode.

C
/* {GOTO} */
int handle_goto(ASMBL *asmbl, OPR *operands){
    char *label = operands->lines[0];
    int lvalue = get_element(LABEL_ELEMENT, label);
    if(lvalue >= 0){
        return 0xA00 | lvalue;  // 0b101000000000
    }
    lvalue = extract_value(label, 1);
    if(lvalue < 0){
        update_err(asmbl, "Invalid label", label);
        return -1;
    }
    return 0xA00 | lvalue;  // 0b101000000000
}

It’s good that all of our handlers have the same structure: handle_ followed by the opcode’s name, like handle_goto. We have already implemented the numeric return for the get_element function, which we use for our handlers, meaning that -1 indicates an error, and 0 >= indicates a valid opcode.

The handle_goto function first checks the labels array. If the label is not found (e.g., GOTO start where start is not defined as a label), it uses the extract_value function to check for different types of values such as hex, binary, or decimal. Finally, it returns the generated opcode.

Now let’s use the handle_goto in the assemble function. Just after processing the labels and the EQU loop, and before the main process loop, let’s create an array of type OP_HNDL to store our handlers like below, and a value called oplen to help us determine the length of our array.

C
    // ...

    OP_HNDL hndls[] = {
        {"GOTO", handle_goto},
        // more handlers...
    };

    int oplen = sizeof(hndls) / sizeof(hndls[0]);  // length of handlers array

    // ...

Now we update our second (main) loop!

C
static TBL tbl;  // Origina lines
static OPR opr;  // Operands


void assemble(ASMBL *asmbl, TBL *input_tbl){

    // Clear 'tbl' and load 'input_tbl' to 'tbl'
       tbl.len = 0;
    memset(tbl.lines, 0, sizeof(tbl.lines));
    copytbl(&tbl, input_tbl);

    int i = 0;
    OPR oprs;

    // The lable and EQU preprocessor

       OP_HNDL hndls[] = {
        {"GOTO", handle_goto},
        // More opcodes
    }
    int oplen = sizeof(hndls) / sizeof(hndls[0]);  // length of handlers array


    for(i = 0; i < tbl.len; ++i){
        skip_comment(tbl.lines[i]);
        str_trim(tbl.lines[i]);
        if(strcmp(tbl.lines[i], "") == 0){ continue; }          // Skip empyt line
        if(strstr(tbl.lines[i], " EQU ") != NULL){ continue; }  // Skip EQU
        if(char_contains(tbl.lines[i], ':')){ continue; }       // Skip label


        int j;            // For checking opcodes
        int opfound = 0;  // any OPcode FOUND

        // Update 'lnum' and 'line' in ASM_ERR
        strcpy(asmbl->err.line, input_tbl->lines[i]);
        asmbl->err.lnum = i + 1;

        // define variable 'opcode'
        str_break(tbl.lines[i], &oprs);
        char opcode[20];
        strcpy(opcode, oprs.lines[0]);

        // A loop for checking opcodes
        for(j = 0; j < oplen; j++){

            // check the opcode
            if(strcmp(hndls[j].lable, opcode) == 0){
                opfound = 1;  // Set `opfound` to 1 (match opcode found)

                // clear 'opr' and remove the remove the first item (opcode)
                opr.len = 0;
                memset(opr.lines, 0, sizeof(opr.lines));
                copy_shift_oprs(&opr, &oprs);

                // Call the handler
                int instruction = hndls[j].func(asmbl, &opr);

                if(instruction >= 0){
                    // Add machine code to 'mcode'
                    asmbl->mcode[asmbl->len.words] = instruction;
                    asmbl->len.words++;
                } else {
                    // error happend
                    update_err(asmbl, "Faild to process opcode", opcode);
                    asmbl->ecode = 1;
                    return;
                }

            }
        }

        // End the program if the match opcode did not found
        if(opfound == 0){
            update_err(asmbl, "Invlaid opcode", oprs.lines[0]);
            return;
        }

    }
}

Let’s define a variable called opfound to keep track of whether any opcode is found. If not, it means the word is an invalid instruction. After the loop, we throw an error. Otherwise, after calling our handler and checking the instruction, if it’s 0 or positive, we update our words number and mcode. If not, we throw an error, meaning failed to process the opcode.

Challenge 9: Providing Verbose Log

In our assemble function, we used the following line to add machine code:

C
asmbl->mcode[asmbl->len.words] = instruction;

But the ASMBL structure also has a lines property. Wouldn’t it be nice to update the lines too, now that we have all the operands, machine code, etc.?

So, let’s provide a set of functions to help us do that!

Input file:

ASM
GPIO EQU 0x06
start:
    BSF GPIO, 0
    NOP
    BCF GPIO, 0
    GOTO start

Output (verbose log):

ASM
BSF 0x06 0                 0b010100000110
NOP                        0b000000000000
BCF 0x06 0                 0b010000000110
GOTO 0x00                  0b101000000000

I think for viewing the binary, having the code itself along with it would be much more helpful. So, let’s write a function that attaches the operands and converts the first one to its numeric value.

C
#include <stdarg.h>

/* sstrcatf: formated strcatf using stdarg.h */
void sstrcatf(char* dst, const char * frmt, ...){
    char tmp[MAX_STR];
    va_list arglist;
    va_start(arglist, frmt);
    vsprintf(tmp, frmt, arglist);
    va_end(arglist);
    strcat(dst, tmp);
}


void strfy_inst(OPR *ops, char buff[]){
    // Check for numeric value
    int first = extract_value(ops->lines[0], 1);

    if(first == -1){
        // Check for lable name
        first = get_element(LABEL_ELEMENT, ops->lines[0]);
    }

    // Set to 0 if it's not label and it's not a numeric value
    if(first == -1){ first = 0; }

    // Update the buffer using 'sstrcatf'
    if(ops->len == 1){
        sstrcatf(buff, "0x%.2X", first);
    } else if(ops->len == 2){
        sstrcatf(buff, "0x%.2X %s", first, ops->lines[1]);
    }
}

The strfy_inst function gets a pointer to operands (OPR *) and a buffer (char buff[]), attaches the operands together, and updates the buffer. The first item of the operands is converted to its numeric value because it’s more helpful. The sstrcatf function attaches strings together like strcat, but it’s formatted like printf and uses the stdarg.h header file provided by ANSI to do that.

Now we need to convert the machine code to a 12-bit binary. Let’s write a function to do that. Note that a 12-bit binary starts with 0b and ends with \0, so we need a buffer with a size of 12 + 2 + 1, totaling 15. Let’s call it integer-to-binary or itob.

C
/* itob: integer to binary */
void itob(int num, char *binary) {
    binary[0] = '0';
    binary[1] = 'b';
    for (int i = 11; i >= 0; i--) {
        binary[13 - i] = (num & (1 << i)) ? '1' : '0';
    }
    binary[14] = '\0'; // Null-terminate the string
}

This function will convert the given number num to binary and update the buffer binary. For example, if the input is 255, the updated buffer will be: 0b000011111111.

Now we use these functions to generate some verbose logs, but we don’t get aligned output. For example, if the input is:

ASM
GPIO EQU 0x06
start:
    BSF GPIO, 0
    NOP
    BCF GPIO, 0
    GOTO start

The output will be:

ASM
BSF 0x06 0                 0b010100000110
NOP                    0b000000000000
BCF 0x06 0                 0b010000000110
GOTO 0x00                 0b101000000000

It would be nice to fill every instruction string with spaces to exceed a certain size.

Let’s write a function to do that and fill our array with spaces. We can call it fill_space, which takes a buffer and a numeric value to specify the size to fill.

C
void fill_space(char *buff, int len){
    for(int i = 0; i < len; i++){
        if(buff[i] == '\0'){
            buff[i] = ' ';
        }
    }
}

We use = { 0 } with a value of 0 for our lines, which sets the entire line to zero. Then, we fill those '\0' characters with spaces. This way, the terminator already exists in the array.

Now, let’s update the assemble function’s loop where it updates the machine code to also update the lines.

C
            // ...

            if(instruction >= 0){

                asmbl->mcode[asmbl->len.words] = instruction;

                // Update verbose line
                char line[MAX_STR] = { 0 };
                char bin[15] = { 0 };
                strfy_inst(&opr, line);
                itob(instruction, bin);
                char prefix[MAX_STR] = { 0 };
                sstrcatf(prefix, "%s %s", opcode, line);
                fill_space(prefix, 20);
                sprintf(asmbl->lines[asmbl->len.words], "%s %20s", prefix, bin);

                asmbl->len.words++;
            } else {
                // ...

Challenge 10: Handler for Opcodes with No Operands

There are some opcodes that don’t actually need any operands, and the machine code is just a fixed number every time. For example, NOP, CLRW, SLEEP, etc.

So writing handlers for them shouldn’t be that hard. We just need a handler that returns a value, but our handlers must have an OPR * and ASMBL *. We’ll provide them as arguments, but we aren’t actually going to use them. It’s just to prevent errors in some compilers.

C
/* {CLRWDT} */
int handle_clrwdt(ASMBL *_, OPR *__){
    return 0x04;  // 0b000000000100
}


/* {NOP} */
int handle_nop(ASMBL *_, OPR *__){
    return 0x000;  // 0b000000000000
}


/* {SLEEP} */
int handle_sleep(ASMBL *_, OPR *__){
    return 0x003;  // 0b000000000011
}


/* {CLRW} */
int handle_clrw(ASMBL *_, OPR *__){
    return 0x040;  // 0b000001000000
}

/* {OPTION} */
int handle_option(ASMBL *_, OPR *__){
    return 0x002;  // 0b000000000010
}

We can add them to the array of handlers in the assemble function.

C
    OP_HNDL hndls[] = {
        // ...

        {"NOP", handle_nop},
        {"SLEEP", handle_sleep},
        {"CLRW", handle_clrw},
        {"CLRWDT", handle_clrwdt},
        {"OPTION", handle_option},

        // ...
    }

Challenge 10: Handler for Opcodes with Destination

There are some opcodes that have a destination bit and an address. In this section, we’ll discuss them, but it would be nice if we had a function for that, which we could use for each opcode by providing an identifier.

However, there is another thing to remember. Earlier in the post, we defined the ASM_LEN structure with two attributes: one for words and another for memory usage, called mem. A function with the same behavior as save_element would be helpful to track unique addresses. We need this because the opcodes that have a destination also have an address, and it would be nice if we wrote a function to help us track them.

So first, let’s start by writing a set of functions to help us track memory. We need a function to get the index of a value (if it exists in the array), otherwise returning a negative value (we already explained this mechanism), and let’s call it get_mem_idx. We need another function to save into the memory, if possible, so let’s call it add_to_mem. This is the one that we’ll use in our handlers. Lastly, we need the total amount of memory used for mem in the ASM_LEN structure, so let’s call the last function get_used_mem.

Let’s start by writing the get_mem_idx function and defining some global variables for that.

C
static int used_mem[MAX_STR] = { 0 };
static int used_mem_idx = 0;

/* get_mem_idx: return negative if failed */
int get_mem_idx(int val){
    for(int i = 0; i < used_mem_idx; ++i){
        if(val == used_mem[i]){
            return i;
        }
    }
    return -1;
}

Because every address is numeric, we defined a static integer called used_mem and a used_mem_idx to help us track the array.

The second function, called add_to_mem, is responsible for adding the unique address to the array. It uses get_mem_idx, and if the result is -1 (negative), it adds the address to the memory.

C
void add_to_mem(char *v){
    int result = extract_value(v, 1);
    if(result >= 0){
        int midx = get_mem_idx(result);
        if(midx == -1){
            used_mem[used_mem_idx++] = result;
        }
    }
}

And lastly, we need a function to get the total number of unique addresses, which we will call get_used_mem. This function simply returns the used_mem_idx.

C
int get_used_mem(void){
    return used_mem_idx;
}

Now that we have add_to_mem, we can write the handlers that have a destination. For the destination itself, we must be able to detect 0 or W, w as 0, and 1 or F, f as 1 for the destination bit. A function to help us with this would be useful.

We can simply use a switch/case for this and call the function check_dest. If the output is -1 (negative), it indicates an invalid destination bit.

C
int check_dist(char *inpt){
    if((int)strlen(inpt) != 1){
        return -1;
    }
    switch (inpt[0]){
        case '1': case 'F': case 'f':
            return 1;
        case '0': case 'W': case 'w':
            return 0;
        default:
            return -1;
    }
    return -1;
}

Finally, the handler for opcodes that have a destination would be something like this:

C
int check_op_num(ASMBL *asmbl, OPR *operands, int len){
    if(operands->len != len){
        update_err(asmbl, "Incorrect amount of operands", "");
        return 1;
    }
    return 0;
}


int set_dist_code(ASMBL *asmbl, OPR *operands, int code){
    if(check_op_num(asmbl, operands, 2)){ return -1; }

    int addr;
    if((addr = extract_value(operands->lines[0], 1)) < 0){
        update_err(asmbl, "Invalid register", operands->lines[0]);
        return -1;
    }

    int dist;
    if((dist = check_dist(operands->lines[1])) < 0){
        update_err(asmbl, "Invalid distination", operands->lines[1]);
        return -1;
    }

    add_to_mem(operands->lines[0]);
    return code | (dist << 5) | addr;
}

This function gets a pointer to ASMBL to update the ASM_ERR if needed, a list of OPR that we extracted and shifted, and lastly, a code which is the identifier for our machine code. All of the opcodes that contain a destination, as mentioned before, have the same structure but only with different identifiers. So, it would be helpful to pass the value of the identifier as code. Finally, the function returns 0 >= if everything goes fine; otherwise, it returns a negative value because an instruction is not able to be negative, as also mentioned before for the behavior of our OP_HDNL structure.

To check if the number of operands is exactly 2 (and not less or more), a function to help us detect this would be awesome. The function check_op_num will help us to detect such cases and return 1 if the number of operands doesn’t match, and 0 if it does.

Now we can write our handlers for each opcode that contains a destination bit.

C
/* {DECF} */
int handle_decf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x0C0);  // 0b000011000000
}

/* {DECFSZ} */
int handle_decfsz(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x2C0);  // 0b001011000000
}


/* {INCF} */
int handle_incf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x280);  // 0b001010000000
}

/* {INCFSZ} */
int handle_incfsz(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x3C0);  // 0b001111000000
}

/* {ADDWF} */
int handle_addwf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x1C0);  // 0b000111000000
}

/* {ANDWF} */
int handle_andwf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x140);  // 0b000101000000
}


/* {COMF} */
int handle_comf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x240);  // 0b001001000000
}

/* {IORWF} */
int handle_iorwf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x100);  // 0b000100000000
}


/* {MOVF} */
int handle_movf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x200);  // 0b001000000000
}


/* {RLF} */
int handle_rlf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x340);  //0b001101000000 
}

/* {RRF} */
int handle_rrf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x300);  // 0b001100000000
}

/* {SUBWF} */
int handle_subwf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x080);  // 0b000010000000
}

/* {SWAPF} */
int handle_swapf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x380);  // 0b001110000000
}

/* {XORWF} */
int handle_xorwf(ASMBL *asmbl, OPR *operands){
    return set_dist_code(asmbl, operands, 0x180);  // 0b000110000000
}

Also, we can add them to the array of handlers that we wrote earlier.

C
    OP_HNDL hndls[] = {
        // ...

        {"DECF", handle_decf},
        {"DECFSZ", handle_decfsz},
        {"INCF", handle_incf},
        {"INCFSZ", handle_incfsz},
        {"ADDWF", handle_addwf},
        {"ANDWF", handle_andwf},
        {"COMF", handle_comf},
        {"IORWF", handle_iorwf},
        {"MOVF", handle_movf},
        {"RLF", handle_rlf},
        {"RRF", handle_rrf},
        {"SUBWF", handle_subwf},
        {"SWAPF", handle_swapf},
        {"XORWF", handle_xorwf},

        // ...
    }

And finally, since we have get_used_mem, we can use it at the end of our assemble function.

C
void assemble(ASMBL *asmbl, TBL *input_tbl){
    // ...
    asmbl->len.mem = get_used_mem();
}

Challenge 11: Handler for Bit Manipulation Opcodes

We have 4 opcodes that manipulate bits or test them. Two of the commands have the same instruction, like BTFSS and BTFSC, which are for testing, and the opcodes for manipulating bits are BSF and BCF. Each pair of opcodes has the same structure but different identifiers. One way is to write each of them by itself, and the other way is to write a function that takes different identifiers, like we did for opcodes with destinations.

However, unlike the destination bit, we can use an EQU as a value for the second argument for these opcodes. It’s quite simple because we already have the extract_value function. So, let’s start by writing handlers for BTFSS and BTFSC by providing a common function to handle them, and let’s call it get_test_op.

C
int get_tst_op(ASMBL *asmbl, OPR *operands, int code){
    if(check_op_num(asmbl, operands, 2)){ return -1; }
    int addr;
    if((addr = extract_value(operands->lines[0], 1)) < 0){
        update_err(asmbl, "Invalid register", operands->lines[0]);
        return -1;
    }

    int bit;
    if((bit = extract_value(operands->lines[1], 1)) < 0){
        if((bit = is_number(operands->lines[1])) < 0){
            update_err(asmbl, "Invalid bit", operands->lines[1]);
            return -1;
        }
    }

    add_to_mem(operands->lines[0]);

    return code | (bit << 5) | addr;
}

The structure of the function get_test_op is similar to set_dist_code, and for the BTFSS and BTFSC handlers, we just need to add the function to their handlers.

C
/* {BTFSS} */
int handle_btfss(ASMBL *asmbl, OPR *operands){
    return get_tst_op(asmbl, operands, 0x700);  // 0b011100000000
}


/* {BTFSC} */
int handle_btfsc(ASMBL *asmbl, OPR *operands){
    return get_tst_op(asmbl, operands, 0x600);  // 0b011000000000
}

And if we want to write a handler for BSF and BCF, we can write a common function to help us, like before. Let’s call it bit_man_codes, which generates machine code based on the provided identifier. However, we must ensure that the values of the assembly code are valid addresses, so we also need a function for that. Let’s call it check_bit_reg.

To check if the provided address is correct (1 for correct and 0 if not a valid address), we use the code below.

C
int check_bit_reg(ASMBL *asmbl, int reg, int bit, char *regstr){
    int bbb_size = 3;
    int fff_size = 4;

    if (bit > (1 << bbb_size) - 1){
        char buff[20];
        itoar(bit, buff);
        update_err(asmbl, "Invalid bit", buff);
        return 1;
    }
    if(reg > (1 << fff_size) - 1){
        update_err(asmbl, "Invalid register", regstr);
        return 1;
    }
    return 0;
}

And for the BSF and BCF handlers, we need a common function to generate the machine code based on the provided identifier.

C
/* bit_man_codes: bit manipulation codes */
int bit_man_codes(ASMBL *asmbl, OPR *operands, int code){
    if(check_op_num(asmbl, operands, 2)){ return -1; }

    int result = extract_value(operands->lines[0], 1);
    int bit;
    if((bit = extract_value(operands->lines[1], 1)) == -1 || bit > 8 ){
        update_err(asmbl, "Invalid bit number", operands->lines[0]);
        return -1;
    }
    int test;

    if((test = check_bit_reg(asmbl, bit, result, operands->lines[0])) != 0){
        return -1;
    }

    if(result >= 0){
        add_to_mem(operands->lines[0]);
        return code | (bit << 5) | result;
    }

    update_err(asmbl, "Failed to handle", operands->lines[0]);
    return -1;
}

Now we can add these four handlers to our handlers array.

C
    OP_HNDL hndls[] = {
        // ...
        {"BTFSS", handle_btfss},
        {"BTFSC", handle_btfsc},
        {"BSF", handle_bsf},
        {"BCF", handle_bcf},
        // ...
    };

Challenge 12: Single Operand Opcodes

There are 9 unhandled opcodes left, and each of them needs to be handled differently. Some work with a literal, some with an address, and others with a unique address. So, let’s write handlers for each of them. First, let’s start with MOVWF.

C
#define SET_BY_MASK(inst, mask, val) ((inst & ~mask) | (val & mask))

/* {MOVWF} */
int handle_movwf(ASMBL *asmbl, OPR *operands){
    if(check_op_num(asmbl, operands, 1)){ return -1; }

    int result;
    if((result = extract_value(operands->lines[0], 1)) >= 0){
        add_to_mem(operands->lines[0]);
        return SET_BY_MASK(0x020, 0x01F, result);  // 0b000000100000, 0b000000011111
    }

    return -1;
}

The handler for MOVWF is simple. It only extracts the value of the address and uses a MACRO called SET_BY_MASK. This macro is responsible for creating the machine code. It updates the given identifier inst, a mask mask, and a value to fill the mask val, and finally generates the machine code using them.

The next one is the CLRF handler, which is called handle_clrf. The structure of this handler is quite similar to the handle_movwf function.

C
/* {CLRF} */
int handle_clrf(ASMBL *asmbl, OPR *operands){
    if(check_op_num(asmbl, operands, 1)){ return -1; }

    int result;
    if((result = extract_value(operands->lines[0], 1)) >= 0){
        add_to_mem(operands->lines[0]);
        return SET_BY_MASK(0x060, 0x01F, result);  // 0b000001100000, 0b000000011111
    }

    return -1;
}

The handler for TRIS is similar to the previous handlers, but with one difference: the value of TRIS can only be 6 or 7. We must check that the value is in the correct range; otherwise, we throw an error using update_err and return -1.

C
/* {TRIS} */
int handle_tris(ASMBL *asmbl, OPR *operands){
    if(check_op_num(asmbl, operands, 1)){ return -1; }

    int value;
    if((value = extract_value(operands->lines[0], 1)) < 0){
        update_err(asmbl, "Invalid literal value", operands->lines[0]);
        return -1;
    }


    if(value == 6 || value == 7){
        return 0x00 | value;  // 0b000000000000
    }

    char buff[20] = { 0 };
    itoar(value, buff);
    update_err(asmbl, "Invalid \"TRIS\" value", buff);
    return -1;
}

The opcodes of MOVLW, ANDLW, IORLW, RETLW, and XORLW only take a literal, so they need a common function like before. Let’s call this function extract_literal.

C
int extract_literal(ASMBL *asmbl, OPR *operands, int code, int uerr){
    if(check_op_num(asmbl, operands, 1)){ return -1; }

    int val;
    if((val = extract_value(operands->lines[0], 1)) < 0){
        if(uerr){
            update_err(asmbl, "Invalid literal value", operands->lines[0]);
        }
        return -1;
    }

    return code | val;
}

The extract_literal function uses extract_value with the addition of checking the number of operands and updating the AMS_ERR structure. It then generates the opcode using the provided code identifier.

Now, the handlers for each opcode would be similar, as they would all utilize the extract_literal function to handle the literal values and generate the corresponding machine code for each operation. This allows us to avoid redundant code and maintain consistency across the handlers for the specified opcodes.

C
/* {MOVLW} */
int handle_movlw(ASMBL *asmbl, OPR *operands){
    return extract_literal(asmbl, operands, 0xC00, 1); // 0b110000000000
}

/* {ANDLW} */
int handle_andlw(ASMBL *asmbl, OPR *operands){
    return extract_literal(asmbl, operands, 0xE00, 1);  // 0b111000000000
}

/* {IORLW} */
int handle_iorlw(ASMBL *asmbl, OPR *operands){
    return extract_literal(asmbl, operands, 0xD00, 1);  // 0b110100000000
}

/* {RETLW} */
int handle_retlw(ASMBL *asmbl, OPR *operands){
    return extract_literal(asmbl, operands, 0x800, 1);  // 0b100000000000
}

/* {XORLW} */
int handle_xorlw(ASMBL *asmbl, OPR *operands){
    return extract_literal(asmbl, operands, 0xF00, 1);  // 0b111100000000
}

The last remaining opcode is CALL. The CALL opcode is similar to GOTO, so let’s create a common function for it and call it set_by_label.

C
int set_by_label(ASMBL *asmbl, OPR* operands, int code){
    if(check_op_num(asmbl, operands, 1)){ return -1; }
    char *label = operands->lines[0];
    int lvalue = get_element(LABEL_ELEMENT, label);
    if(lvalue >= 0){
        return code | lvalue;  // 0b101000000000
    }
    lvalue = extract_value(label, 1);
    if(lvalue < 0){
        update_err(asmbl, "Invalid label", label);
        return -1;
    }
    return code | lvalue;  // 0b101000000000
}

The set_by_label extracts the literal if there and also checks the label array too. Now that we have this function, let’s update handle_goto and create the handle_call function.

C
/* {GOTO} */
int handle_goto(ASMBL *asmbl, OPR *operands){
    return set_by_label(asmbl, operands, 0xA00);  // 0b101000000000
}

/* {CALL} */
int handle_call(ASMBL *asmbl, OPR *operands){
    return set_by_label(asmbl, operands, 0x900);  // 0b100100000000
}

Now we can add their handlers to our array. By doing this, we have handled all of the 33 opcodes and completed our assemble function. The next step is to provide some functions for getting output or generating a binary file.

C
    OP_HNDL hndls[] = {
        // ...
        {"MOVLW", handle_movlw},
        {"ANDLW", handle_andlw},
        {"IORLW", handle_iorlw},
        {"RETLW", handle_retlw},
        {"XORLW", handle_xorlw},
        {"CALL", handle_call}
    };

Challenge 13: Generating Output

We need to create some outputs now. We have already updated our ASM_ERR and ASM_LEN structures, thanks to the assemble function. Now, we just need to create some output.

Since we know from the ecode in the ASMBL function whether the assemble process has failed or succeeded, let’s imagine the process failed. We’ll create a function to update the given buffer with error diagnostics.

C
void show_err(ASM_ERR *err, char buffer[]){
    char obj_buff[MAX_STR + 10] = { 0 };
    if(strcmp(err->obj, "") != 0){
        str_trim(err->obj);
        sprintf(obj_buff, " (%s)", err->obj);
    }

    str_trim(err->line);
    sprintf(buffer, "%s%s:\n %-3d| %s\n    |\n", err->msg, obj_buff, err->lnum, err->line);
}

The show_err function will update the given buffer char buffer[] by using the properties of the ASM_ERR pointer that we provided for it. It will create a message that we can show in the terminal output or other places, such as when using WASM.

Now, let’s imagine the process succeeded. We need to create an output file using the provided output path from GFLAGS, the mcode from ASMBL, and the total number of words in ASM_LEN.

The opposite of io_read, let’s call this one io_write.

C
/* io_write: Write into external files */ 
void io_write(char *path, int buff[], int len) {
    FILE *fp;

    // Attempt to open the file in binary write mode
    if((fp = fopen(path, "wb+")) == NULL){
        // If the file cannot be opened, print an error message and exit
        printf("Failed to write in \"%s\"", path);
        exit(0);
        return;
    }

    unsigned char bytes[2];

    // Loop through each value in the buffer
    for (int i = 0; i < len; i++) {
        // Extract the MSB (Most Significant Byte)
        bytes[0] = (buff[i] >> 8) & 0xFF;
        // Extract the LSB (Least Significant Byte)
        bytes[1] = buff[i] & 0xFF;
        // Write both bytes to the file
        fwrite(bytes, 1, sizeof(bytes), fp);
    }

    // Close the file after writing
    fclose(fp);
}

The io_write function is responsible for converting our machine codes into a binary file using the provided path. It works by breaking down the MSB (Most Significant Bit) and LSB (Least Significant Bit) of each byte and saving them into the file.

Now that we have all of the essential functions, let’s update our main function and finish writing our program.

C
int main(int argc, char *argv[]){

    // Already Added:
    GFLAGS gflags;
    update_gflags(&gflags, argc, argv);

    TBL file;
    io_read(&file, gflags.input);

    ASMBL asmbl;
    assemble(&asmbl, &file);


    // New parts:
    if(asmbl.ecode){
        static char err_buff[MAX_STR] = { 0 };
        show_err(&asmbl.err, err_buff);
        printf("%s\n", err_buff);
        return 1;

    } else {

        io_write(gflags.output, asmbl.mcode, asmbl.len.words);

        if(gflags.verbose){
            for(int i = 0; i < asmbl.len.words; ++i){
                printf("%s\n", asmbl.lines[i]);
            }
            printf("\n\n");
        }

        printf("Total Words: %d\nNumber of Used Memory: %d\n",
               asmbl.len.words, asmbl.len.mem);
    }
    return 0;
}

By using asmbl.ecode, we check if the process failed or not. If it failed, we use show_err to print the error and return 1, indicating failure. Otherwise, we create the binary file using io_write. If the verbose flag (-v) is checked, we loop through each line and print them. Finally, we print the total number of words (asmbl.len.words) and the total usage of memory (asmbl.len.mem), regardless of whether any flags are checked or not. And that’s about our assembler program.

Example

In the previous post, we created an example program. Now, let’s assemble the assembly code with the -v flag on to see the output.

ASM
MOVWF 0x06                 0b000000100110
BSF 0x06 7                 0b010111100110
BCF 0x06 7                 0b010011100110
MOVLW 0x01                 0b110000000001
INCF 0x0A 1                0b001010101010
ADDWF 0x0A 0               0b000111001010
ADDWF 0x02 1               0b000111100010
NOP                        0b000000000000
NOP                        0b000000000000
RETLW 0x48                 0b100001001000
RETLW 0x65                 0b100001100101
RETLW 0x6C                 0b100001101100
RETLW 0x6C                 0b100001101100
RETLW 0x6F                 0b100001101111
RETLW 0x2C                 0b100000101100
RETLW 0x20                 0b100000100000
RETLW 0x57                 0b100001010111
RETLW 0x6F                 0b100001101111
RETLW 0x72                 0b100001110010
RETLW 0x6C                 0b100001101100
RETLW 0x64                 0b100001100100
RETLW 0x21                 0b100000100001
RETLW 0x0A                 0b100000001010
CLRF 0x0A                  0b000001101010
CLRF 0x06                  0b000001100110
CLRW                       0b000001000000
GOTO 0x00                  0b101000000000


Total Words: 27
Number of Used Memory: 3

If we create an invalid program, like the code below, we will have a different output and no binary file. The error message will be displayed, and the program will not proceed to generate the binary file.

ASM
GPIO EQU 0x06

MOVLW 'A'
BSF GPIO, 7

INVALID  ;; Err

CLRF GPIO

The output that we will get is the error message generated by show_err, which updates the buffer with error diagnostics. This message will be displayed, indicating the specific error, and no binary file will be created.

Plaintext
Invlaid opcode (INVALID):
 6  | INVALID  ;; Err
    |

You can find all of the codes for this project, many more examples, and a WASM version of it in this GitHub repository.

Mastering PIC10F200/202/204/206: A Beginner’s Guide to Writing Assembly Code

Introduction

The Microchip company has a series of microcontrollers called PIC10F200/202/204/206. All of them use the same assembly language structure. The assembly language for them has 33 opcodes, which we will discuss in this post. After that, we will create an assembler, emulator, and compiler to demonstrate how they work. So, let’s begin by explaining the opcodes.

Every instruction is 12 bits in size in binary. Each instruction has different parts, except for one common element: an identifier that makes each instruction unique. Some instructions also include a destination bit, address, literals, etc.

Instructions affect certain bits of specific registers in the microcontroller, which varies for each opcode.

ADDWF f, d

Add W to f.

ADDWF adds the value of register W to register f. If d (destination) is 1, the result will be stored in register f; otherwise, the result remains in register W.
Since this microcontroller is 8-bit, the result cannot exceed 255 (0xFF). This means that if we add two values and the result exceeds 255, it wraps around to start from 0 due to overflow.

ADDWF:

          ┌─ Destination (1 bit)

 0b0001 11df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Let’s look at some example of this:

ASM
; [0x06] = [0x06] + W  
ADDWF 0x06, 1 ; Add W to register 0x06 and store the result in register 0x06

; W = [0x06] + W  
ADDWF 0x06, 0 ; Add W to register 0x06 and store the result in register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.
  • Bit DC in the STATUS register: If a carry from the 4th low-order bit of the result occurs, it is set to 1; otherwise, it is set to 0.
  • Bit C in the STATUS register: It is set to 1 if a carry occurred; otherwise, it is set to 0.

ANDWF f, d

And W with f.

This instruction performs an AND (&) operation between the value of the given address f and the value of register W. If d (destination) is set to 1, the result will be stored in f; otherwise, the result remains in register W.

ANDWF:

          ┌─ Destination (1 bit)

 0b0001 01df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Let’s look at some example for ANDWF:

ASM
; [0x06] = [0x06] & W
ANDWF 0x06, 1 ; Perform AND operation between W and the value of register 0x06, store the result in register 0x06

; W = [0x06] & W
ANDWF 0x06, 0 ; Perform AND operation between W and the value of register 0x06, store the result in register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

CLRF f

Clear f.

This instruction clears the value of the given address f, meaning that the value of the given address f becomes 0 (0b00000000).

CLRF:

            ┌─ Address (5 bit)
           ─┴────
 0b0000 011f ffff
   ─┬──────
    └─ Identifier (7 bit)

This is a simple example for CLRF:

ASM
; [0x06] = 0x00
CLRF 0x06 ; Clear the value of register 0x06

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

CLRW

Clear W.

CLRW works exactly like CLRF but only clears the value of register W and sets its value to 0 (0b00000000). CLRW has no operand.

CLRW:

 0b0000 0100 0000
   ─┬────────────
    └─ Identifier (8 bit)

The usage is very simple:

ASM
; W = 0
CLRW

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

COMF f, d

Complement f.

This instruction complements (~) the value of the given register f, meaning that it converts all the zeros in binary to ones and all the ones to zeros. You can say it reverses the bits. Depending on bit d (destination), if it is set to 1, the result will be stored in f; otherwise, if it is set to 0, the result stays in register W.

COMF:

          ┌─ Destination (1 bit)

 0b0010 01df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

This is a basic example for the COMF command on address 0x06:

ASM
; [0x06] = ~[0x06]
COMF 0x06, 1 ; Complement the value of register 0x06 and store the result in register 0x06

; W = ~[0x06]
COMF 0x06, 0 ; Complement the value of register 0x06 and store the result in register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

DECF f, d

Decrement f.

This code decrements the value of the given register f, and if bit d (destination) is set to 1, it stores the value back in register f; otherwise, the result stays in register W.

If the value of the given address f is set to 0, DECF is unaffected because you cannot decrement from 0.

DECF:

          ┌─ Destination (1 bit)

 0b0000 11df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

The following code is an example for the DECF command:

ASM
[0x06] = [0x06] - 1;  
DECF 0x06, 1 ; Decrement the value of register 0x06 and store the result in register 0x06

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

DECFSZ f, d

Decrement f, Skip if 0.

This is the first 2-cycle opcode we are talking about. This instruction decrements the value of the given register f, and if the result becomes 0, it skips the next instruction and executes a NOP (No Operation) instead. Otherwise, if the result is not 0, it behaves like DECF and does not skip any instruction.
If d (destination) is set to 0, the result stays in register W; otherwise, the result is placed back into address f.

DECFSZ:

          ┌─ Destination (1 bit)

 0b0010 11df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

You can use DECFSZ like this:

ASM
; [0x06] = [0x06] - 1

DECFSZ 0x06, 1 ; Decrement 1 from the value of 0x06 (place back into 0x06) and skip the next instruction if the value is 0  
COMF 0x06, 1 ; Complement the value of register 0x06 and place it back into register 0x06  
ADDWF 0x06, 0 ; Add 1 to register 0x06 and store the value in register W

The DECFSZ instruction does not affect any bits in the STATUS register.


INCF f, d

Increment f.

The value of register f will be incremented by 1, and if d (destination) is 1, the result is placed back into the given address f; otherwise, the result stays in register W.

If the value of the given register is already 0xFF (255), because this is an 8-bit CPU, the result will wrap around and become 0.

INCF:

          ┌─ Destination (1 bit)

 0b0010 10df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Let’s look at two examples for INCF with different destinations:

ASM
; [0x06] = [0x06] + 1
INCF 0x06, 1  ; Increment the value of register 0x06 and store the result in register 0x06

; W = [0x06] + 1
INCF 0x06, 0  ; Increment the value of register 0x06 and store the result in register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

INCFSZ f, d

Increment f, Skip if 0.

Just like INCF, this instruction increments the value of address f by 1, and if the d (destination) is set to 1, the result is placed back into the given address; otherwise, the result stays in register W.

INCFSZ, like DECFSZ, is a 2-cycle opcode, meaning that if the result of the operation becomes 0 (if the value of the given address was more than 255), it skips the next instruction and executes a NOP (No Operation).

INCFSZ:

          ┌─ Destination (1 bit)

 0b0011 11df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Let’s look at some examples for INCFSZ:

ASM
; [0x06] = [0x06] + 1

INCFSZ 0x06, 1 ; Increment 1 from the value of 0x06 (place back into 0x06) and skip the next instruction if the value is 0  
COMF 0x06, 1 ; Complement the value of register 0x06 and place it back into register 0x06  
ADDWF 0x06, 0 ; Add 1 to register 0x06 and store the value in register W

The INCFSZ instruction does not affect any bits in the STATUS register.


IORWF f, d

Inclusive OR W with f.

IORWF instruction performs an Inclusive OR (|) between the value of register W and the value of address f.
If d (destination) is set to 1, it means that the result will be placed back into address f; otherwise, the result stays in register W.

IORWF:

          ┌─ Destination (1 bit)

 0b0001 00df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

We can use this opcode like this:

ASM
; W = [0x06] | W  
IORWF 0x06, 0 ; Inclusive OR between W and 0x06, the result is placed back into register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

MOVF f, d

Move f.

This instruction is very useful, especially when writing a compiler. The MOVF instruction moves the value of the given address f to the given destination d. If d is set to 0, it means that the value of the given address f is loaded into register W. Otherwise, if d is set to 1, the value is placed back into the given address f. This is useful for checking a value because MOVF affects the Z bit in the STATUS register, which helps us determine whether the value of address f is 0 or not.

MOVF:

          ┌─ Destination (1 bit)

 0b0010 00df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Let’s see an example for this:

ASM
; W = [0x06]  
MOVF 0x06, 0 ; Move the value of register 0x06 to register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

MOVWF f

Move W to f.

Another useful instruction is MOVWF. This instruction moves the value of register W to the given address f. This is kind of the opposite of MOVF with destination (d) 0.
MOVF only takes an address f.

MOVWF:

          ┌─ Destination (1 bit)

 0b0000 01df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Opcode (6 bit)

Let’s look at a simple example for MOVWF:

ASM
; [0x06] = W  
MOVWF 0x06 ; Move the value of register W to register 0x06

MOVWF does not affect any bits in the STATUS register.


NOP

No Operation.

The NOP instruction does nothing and is primarily used to create a delay or to skip over other instructions, allowing NOP to execute in their place.

NOP:

 0b0000 0000 0000
   ─┬────────────
    └─ Identifier (12 bit)

Using NOP is simple:

ASM
NOP  ; No Operation

NOP does not affect any of the STATUS register bits.


RLF f, d

Rotate Left f Through Carry.

Each bit in the given address f is shifted one position to the left, with the leftmost bit moved to the carry flag, and the previous carry bit placed in the least significant bit. If d is set to 0, the result stays in the W register; otherwise, if d is set to 1, the result is stored in the given register address.

Rotate Left Through Carry:

 ┌─────────────────────────────────────────────────┐
 │                                                 │
 │    Carry    Binary                              │
 │    ┌───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐   │
 └──  │ 0 │ << │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ <─┘
      └───┘    └───┴───┴───┴───┴───┴───┴───┴───┘
       C       7   6   5   4   3   2   1   0

The structure for this instruction is as follows:

RLF:

          ┌─ Destination (1 bit)

 0b0011 01df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Using RLF is simple; you can shift the content of the given address to the left by 1.

ASM
; [0x06] = [0x06] << 1
RLF 0x06, 1 ; Rotate the contents of register 0x06 one bit to the left

Affected Bits

  • Bit C in the STATUS register: Loads the bit with the least significant bit (LSB) or most significant bit (MSB), respectively.

RRF f, d

Rotate Right f Through Carry.

Each bit in the given address f is shifted one position to the right, with the rightmost bit moved to the carry flag, and the previous carry bit placed in the most significant bit. If d is set to 0, the result stays in the W register; otherwise, if d is set to 1, the result is stored in the given register address.

Rotate Right Through Carry:

 ┌─────────────────────────────────────────────────┐
 │                                                 │
 │    Carry    Binary                              │
 │    ┌───┐    ┌───┬───┬───┬───┬───┬───┬───┬───┐   │
 └──> │ 1 │ >> │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ ──┘
      └───┘    └───┴───┴───┴───┴───┴───┴───┴───┘
       C       7   6   5   4   3   2   1   0

The structure for this instruction is as follows:

RRF:

          ┌─ Destination (1 bit)

 0b0011 00df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Using RRF is simple; you can shift the content of the given address to the right by 1.

ASM
; [0x06] = [0x06] >> 1
RRF 0x06, 1 ; Rotate the contents of register 0x06 one bit to the right

Affected Bits

  • Bit C in the STATUS register: Loads the bit with the least significant bit (LSB) or most significant bit (MSB), respectively.

SUBWF f, d

Subtract W from f.

This is one of the most used opcodes when it comes to writing a compiler. The SUBWF command subtracts the value of register W from the value of the given address f, and depending on d (destination), if set to 1, the result is placed back into the given address; otherwise, the result stays in register W.

If the value of the given address was already 0, the subtraction would not occur, but some of the bits in the STATUS register will be affected.

SUBWF:

          ┌─ Destination (1 bit)

 0b0000 10df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

You can subtract the value of register W from the given address.

ASM
; [0x06] = [0x06] - W  
SUBWF 0x06, 1 ; Subtract the value of register W from register 0x06 and store the result back into register 0x06

; W = [0x06] - W  
SUBWF 0x06, 0 ; Subtract the value of register W from register 0x06 and store the result back into register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.
  • Bit DC in the STATUS register: If a carry from the 4th low-order bit of the result did not occur, it is set to 1; otherwise, it is set to 0.
  • Bit C in the STATUS register: It is set to 1 if a borrow did not occur; otherwise, it is set to 0.

SWAPF f, d

Swap f.

The four leftmost bits of the given address f are swapped with the four rightmost bits of the same address. If d is 0, the result is placed in the W register. If d is 1, the result is placed in the given address.

SWAP (MSB & LSB)

            ┌────────────────┐
    ────────┴────────        +
    ┌───┬───┬───┬───┬───┬───┬───┬───┐
    │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
    └───┴───┴───┴───┴───┴───┴───┴───┘
      7   6   5   4   3   2   1   0
            +       ────────┬────────
            └───────────────┘

The structure for this instruction is as follows:

SWAPF:

          ┌─ Destination (1 bit)

 0b0011 10df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

Let’s see an example for using SWAPWF:

ASM
SWAPF 06H, 1 ; Swap the nibbles of register 0x06 and place the result back into the same register

SWAPF does not affect any of the STATUS register bits.


XORWF f, d

Exclusive OR between W and f.

This instruction is used for an Exclusive OR (XOR ^) operation between the value of register W and the given address f. If d (destination) is set to 0, the result stays in register W; otherwise, the result is placed back into the given address f.

XORWF:

          ┌─ Destination (1 bit)

 0b0001 10df ffff
   ─┬───── ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (6 bit)

You can use XORWF the same way as other destination-based instructions.

ASM
; [0x06] = W ^ [0x06]
XORWF 0x06, 1 ; Exclusive OR W with 0x06, and place the result back into register 0x06

; W = W ^ [0x06]
XORWF 0x06, 0 ; Exclusive OR W with 0x06, and place the result back into register W

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

BCF f, b

Bit Clear f.

The BCF opcode sets the given bit (b) of the given address (f) to 0. Basically, it clears the selected bit.

BCF:

         ┌─ Bit (3 bit)
        ─┴─
 0b0100 bbbf ffff
   ─┬──    ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (4 bit)

This is an example of how to use it:

ASM
BCF 0x06, 7 ; Clear bit number 7 (starting from 0) in register 0x06

BCF does not affect any of the STATUS register bits.


BSF f, b

Bit set f.

The BSF opcode sets the given bit (b) of the given address (f) to 1. Basically, it activates the selected bit.

BSF:

         ┌─ Bit (3 bit)
        ─┴─
 0b0101 bbbf ffff
   ─┬──    ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (4 bit)

This is an example of how to use it:

ASM
BSF 0x06, 7 ; Set bit number 7 (starting from 0) in register 0x06

BSF does not affect any of the STATUS register bits.


BTFSC f, b

Bit Test f, Skip if Clear.

The BTFSC opcode takes two parameters: the first is the address (f), and the second is the bit number (b). It tests if the specified bit of the address is set to 0. If so, the next instruction is skipped, and a NOP executes instead; otherwise, the next instruction is executed normally. The BTFSC opcode is a 2-cycle operation.

If bit b in register f is 0, the next instruction is skipped. The instruction fetched during the current execution is discarded, and a NOP is executed instead, making this a 2-cycle instruction.

BTFSC:

         ┌─ Bit (3 bit)
        ─┴─
 0b0110 bbbf ffff
   ─┬──    ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (4 bit)

In the first line, the 5th bit of register 0x06 is set to 0. In the second line, it checks whether bit number 5 is 0 (which it is), causing the next instruction, INCF, to be skipped (executed as a NOP). The instruction in line 4 is then executed.

ASM
BCF 0x06, 5   ; Clear bit 5 of register 0x06
BTFSC 0x06, 5 ; Test bit 5 of register 0x06, if 0 skip the next instruction  
INCF 0x06, 1  ; Increment value of 0x06 by 1 (place back into the register)  
NOP           ; No operation

BTFSC does not affect any of the STATUS register bits.


BTFSS

Bit Test f, Skip if Set.

Just like BTFSC, the BTFSS instruction takes two parameters: an address (f) and a bit number (b). It checks if the specified bit at the given address is set (is 1). If it is, the next instruction is skipped (a NOP is executed instead); otherwise, the next instruction is executed normally.

If bit b in address f is 1, the next instruction is skipped. The instruction fetched during the current execution is discarded, and a NOP is executed instead, making this a 2-cycle instruction.

BTFSS:

         ┌─ Bit (3 bit)
        ─┴─
 0b0111 bbbf ffff
   ─┬──    ──┬───
    │        └─ Address (5 bit)

    └─ Identifier (4 bit)

In the first line, the 5th bit of register 0x06 is set to 1. In the second line, it checks whether bit number 5 is 1 (which it is), causing the next instruction, DECF, to be skipped (executed as a NOP). The instruction in line 4 is then executed.

ASM
BSF 06H, 5    ; Set bit 5 of register 0x06 to 1
BTFSS 06H, 5  ; Tests bit 5 of register 0x06, if 0 skip the next instruction
DECFF 06H, 1  ; Increment value of register 0x06 by 1
NOP           ; No Operation

BTFSS does not affect any of the STATUS register bits.


ANDLW k

AND literal with W.

The value of register W is ANDed (&) with the given 8-bit literal k, and the result is placed back into the W register.

ANDLW:

             ┌─ Literal Value (8 bit)
        ─────┴───
 0b1110 kkkk kkkk
   ─┬──
    └─ Identifier (4 bit)

Here’s an example for ANDLW with a literal parameter:

ASM
ANDLW 0x0F  ; Perform AND operation between W and the literal 0x0F, store the result back in W

In this example, the value in register W is ANDed with the literal 0x0F, and the result is stored back in register W.

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

CALL k

Call subroutine.

The CALL instruction works with the STACK in the microcontroller, which we discuss in the emulator post. The CALL instruction takes an 8-bit literal k, changes the program counter to the address k, and saves the address of the next instruction after the CALL to the STACK. This opcode is useful for calling certain subroutines, and after finishing (reaching RETLW, which we will talk about), the program counter moves back to the instruction after the CALL, similar to a function call.

CALL:

             ┌─ Literal Value (8 bit)
        ─────┴───
 0b1001 kkkk kkkk
   ─┬──
    └─ Identifier (4 bit)

The program starts executing from the beginning, and in line 2, a CALL instruction is used on the label increase, which calls the increase subroutine. This causes the INCF instruction to execute. Since there is no RETLW (which we haven’t discussed yet), the CALL in line 2 functions somewhat like a GOTO (we will discuss GOTO further in the post).

ASM
start:
      CALL increase  ; Instead of a label name, an address can also be used.

skip:  ; Define label 'skip'
      NOP  ; No Operation
      NOP  ; No Operation

increase:  ; Define label 'increase'
      INCF 0x06, 0  ; Increment value of register 0x06 by 1 and store the result in register W

CALL does not affect any of the STATUS register bits.


CLRWDT

Clear Watchdog Timer.

This opcode, which takes no arguments (like NOP), is used to clear the watchdog timer.

CLRWDT:

 0b0000 0000 0100
   ─┬────────────
    └─ Identifier (12 bit)

There is no argument needed when using CLRWDT to clear the microcontroller’s watchdog timer.

ASM
CLRWDT  ; Clear Watchdog timer

Affected Bits

  • Bit TO in the STATUS register: When using CLRWDT, it sets the bit TO to 1, and if a watchdog timer time-out occurs, it sets it to 0.
  • Bit PD in the STATUS register: When using CLRWDT, it sets the bit PD to 1, or it is set to 1 after the microcontroller powers up. It is set to 0 when a SLEEP instruction is executed (which will be discussed later).

GOTO k

Unconditional Branch.

The GOTO opcode takes a 9-bit literal k and changes the microcontroller’s program counter (PC) to point to the given address k, causing the execution of the instructions at that address. This is very helpful for creating unconditional branches.

GOTO:

             ┌─ Literal Value (9 bit)
      ───────┴───
 0b101K kkkk kkkk
   ─┬──
    └─ Identifier (3 bit)

Let’s create an infinite loop using GOTO.

ASM
start:
    INCF 0x06, 1  ; Increment value of 0x06 by 1 and store the result in the same register
    GOTO start    ; Jump to 'start' and create an infinite loop

GOTO does not affect any of the STATUS register bits.


IORLW k

Inclusive OR Literal with W.

The contents of register W are OR’ed (|) with the 8-bit literal k. The result is placed in the W register.

IORLW:

             ┌─ Literal Value (8 bit)
        ─────┴───
 0b1101 kkkk kkkk
   ─┬──
    └─ Identifier (4 bit)

It is very easy to perform an inclusive OR (|) with the contents of register W and the literal k (0xFF).

ASM
IORLW 255 ; Perform inclusive OR between the contents of register W and literal 255 

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

MOVLW k

Move Literal to W.

Another useful instruction for creating a compiler is MOVLW, which takes an 8-bit literal value and places the result into the W register.

MOVLW:

             ┌─ Literal Value (8 bit)
        ─────┴───
 0b1100 kkkk kkkk
   ─┬──
    └─ Identifier (4 bit)

Let’s load register 0x06 with the value 0xFF using MOVLW.

ASM
MOVLW 0xFF   ; Load the W register with the value 0xFF
MOVWF 0x06   ; Move the value of W (0xFF) into register 0x06

MOVLW does not affect any of the STATUS register bits.


OPTION

Load OPTION register.

The content of register W is loaded into the OPTION register. The OPTION opcode does not take any arguments, just like NOP.

OPTION:

 0b0000 0000 0010
   ─┬────────────
    └─ Identifier (12 bit)

To use this opcode to load the OPTION register with the value of register W, simply use OPTION.

ASM
; OPTION = W
OPTION ; Loads the value of register W into the OPTION register

OPTION does not affect any of the STATUS register bits.


RETLW k

Return, Place Literal in W.

Another helpful opcode is RETLW. This opcode usually (but not always) works with the CALL opcode to perform function calls in Assembly. Just like CALL, RETLW interacts with the STACK. However, unlike CALL, this opcode pops one address off the stack, changes the program counter (PC) to the popped address (which is the address of the next instruction after the CALL that was pushed onto the stack), and loads the literal k into register W, similar to the MOVLW opcode. This makes RETLW very useful.

The W register is loaded with 8-bit literal k, The program counter is loaded from the top of the stack (the return address), this is a 2-cycle instruction.

RETLW:

             ┌─ Literal Value (8 bit)
        ─────┴───
 0b1000 kkkk kkkk
   ─┬──
    └─ Identifier (4 bit)

Let’s perform a function call using CALL and RETLW.

ASM
start:
    CALL myFunction    ; Call the subroutine 'myFunction'
    NOP                ; No operation (address of RETLW points to this)
    GOTO start         ; Loop back to 'start'

myFunction:
    INCF 0x06, 1       ; Increment value of register 0x06 and place the result back
    RETLW 0x00         ; Return from the function, load W with 0x00

RETLW does not affect any of the STATUS register bits.


SLEEP

Go into Standby mode.

When we use the SLEEP opcode (which takes no arguments), the CPU stops working (enters standby mode) and needs to be restarted to begin working again. When SLEEP is used, any instructions after it will not be executed.

SLEEP:

 0b0000 0000 0011
   ─┬────────────
    └─ Identifier (12 bit)

The program finishes when it reaches SLEEP.

ASM
MOVLW 128   ; Set value of register W to 128  
MOVWF 0x06  ; Move value of register W to register 0x06  
SLEEP       ; Enter sleep mode

Affected Bits

  • Bit TO in the STATUS register: When using CLRWDT, it sets the bit TO to 1, and if a watchdog timer time-out occurs, it sets it to 0.
  • Bit PD in the STATUS register: When using CLRWDT, it sets the bit PD to 1, or it is set to 1 after the microcontroller powers up. It is set to 0 when a SLEEP instruction is executed (which will be discussed later).

TRIS f

load TRIS register

The TRIS opcode takes one argument as an address (f = 0x06 or 0x07). The value of register W is loaded into the TRIS register. If a bit of the TRIS register is set to 0, it means that the corresponding pin is configured for input. Conversely, if a bit is set to 1, it indicates that the pin is configured for output.

TRIS:

               ┌─ Address (3 bit)
              ─┴─
 0b0000 0000 0fff
   ─┬─────────
    └─ Identifier (9 bit)

The first four bits are set to 0, meaning that the first four pins are configured for input, while the last four bits are set to 1, indicating that those pins are configured for output.

ASM
MOVLW 0b11110000     ; Set 0b11110000 (240 or 0xF0) to W register
TRIS 0x06            ; TRIS regiseter with address of 0x06 (GPIO)

TRIS does not affect any of the STATUS register bits.


XORLW k

Exclusive OR Literal to W

This opcode performs an Exclusive OR (XOR ^) between the value of register W and the given 8-bit literal k, and stores the result in the W register.

XORLW:

             ┌─ Literal Value (8 bit)
        ─────┴───
 0b1111 kkkk kkkk
   ─┬──
    └─ Identifier (4 bit)

You can simply use XORLW.

ASM
; W = W ^ 0xFF
XORLW 0xFF ; Exclusive OR between the value of register W and the literal (0xFF)

Affected Bits

  • Bit Z in the STATUS register: If the result is 0, it is set to 1; otherwise, Z is set to 0.

Examples

Now that we understand how the 33 opcode for this type of microcontroller works, let’s write a program using it.

Before writing the program, it’s important to know how a program starts executing. The microcontroller has a PC (Program Counter) or PCL register. The value of the PC can range from 0x00 to 0xFF (an 8-bit value), which points to the ROM address. For example, if the PC is set to 4, it means the microcontroller starts executing instructions at offset 4. After finishing executing this, it moves to the next instruction, or the PC changes because of 2-cycle instructions. Opcodes like GOTO, CALL, and RETLW can change the value of the PC (register PCL) to point to other places and start executing there. This is how a program starts working.

Let’s start by writing a simple Hello, World program.

To set an address for a label, you can use the EQU command. For example, let’s set GPIO to address 0x06.

ASM
GPIO EQU 0x06

Now, let’s print a character to the console:
The PIC10F200 microcontroller checks the first 7 bits (starting from 1) for a character value, and if bit number 8 (the last one) is set, it flushes the character to the console. Knowing this, let’s create a Hello, World program.

To print a character to the console, you need to do three things:

  1. Load the value of the character into the GPIO.
  2. Set bit 7 (the last one) of GPIO to 1.
  3. Clear the value of GPIO (to prevent printing the character again).
ASM
GPIO EQU 0x06    ; Set value of GPIO to 0x06
MOVLW 'H'        ; Move value of 'H' (0x48) to register W
MOVWF GPIO       ; Move value of register W to register GPIO (0x06)
BSF GPIO, 7      ; Set bit number 7 (last one, starting from 0) to 1
CLRF GPIO        ; Clear GPIO and set its value to 0x00

Now let’s improve this program by using CALL and RETLW to call a subroutine that is meant to flush one character at a time into the console.

ASM
GPIO EQU 0x06

main:
    MOVLW 'H'     ; Move value of 'H' (0x48) to register W
    CALL flush    ; Flush a character into the console

flush:
    MOVWF GPIO    ; Move value of register W to GPIO
    BSF GPIO, 7   ; Set 7th bit of register GPIO to 1 (flush the character)
    CLRF GPIO     ; Clear value of register GPIO and set it to 0x00
    RETLW 0       ; Return and load 0x00 to register W

In this block of code, we moved the value of 'H' into register W and called flush, which makes the character be written into the console. We can use this method to print “Hello, World”.

ASM
GPIO EQU 0x06

main:
    MOVLW 'H'   ; Move 'H' to register W
    CALL flush  ; Flush a character into the console

    MOVLW 'e'
    CALL flush

    MOVLW 'l'
    CALL flush

    MOVLW 'l'
    CALL flush

    MOVLW 'o'
    CALL flush

    MOVLW ' '
    CALL flush

    MOVLW 'W'
    CALL flush

    MOVLW 'o'
    CALL flush

    MOVLW 'r'
    CALL flush

    MOVLW 'l'
    CALL flush

    MOVLW 'd'
    CALL flush


flush:
    MOVWF GPIO    ; Move value of register W to GPIO
    BSF GPIO, 7   ; Set 7th bit of register GPIO to 1 (flush the character)
    CLRF GPIO     ; Clear value of register GPIO and set it to 0x00
    RETLW 0       ; Return and load 0x00 to register W

But this method is kind of inefficient and takes a lot of opcodes to write.
There is a better method, and we can use RETLW and the value of register PCL (PC) for it, which is much more efficient.

ASM
GPIO EQU 0x06   ; set address 0x06 for GPIO (Input/Output pin)
RAM EQU 0x0A    ; Use an address from RAM (random access memory)
PC EQU 0x02     ; Set PC to wit address register PCL

MOVWF GPIO      ; Move value of register W to register GPIO
BSF GPIO, 7     ; set 7th bit of register GPIO to 1
BCF GPIO, 7     ; clear the 7th bit of register GPIO


MOVLW 1         ; Move vlaue 0x01 to register W
INCF RAM, 1     ; Increment vlaue of RAM (store in RAM)
ADDWF RAM, 0    ; Add value of register W to RAM (store in register W)
ADDWF PC, 1     ; Add value of register W to PC (store in register PCL)

LOOP:
  NOP
  NOP
  RETLW 'H'     ; Return and Load register W with letter value of 'H'
  RETLW 'e'
  RETLW 'l'
  RETLW 'l'
  RETLW 'o'
  RETLW ','
  RETLW ' '
  RETLW 'W'
  RETLW 'o'
  RETLW 'r'
  RETLW 'l'
  RETLW 'd'
  RETLW '!'
  RETLW '\n'
  CLRF RAM     ; Clear 'RAM'
  CLRF GPIO    ; Clear register GPIO
  CLRW         ; Clear register W
  GOTO 0x00    ; Start Over again (infinite loop)

This code uses an array of characters to print Hello, World!\n. It’s interesting to know how we can implement an array with these opcodes. By changing the value of register PCL, we can cause the program counter (PC) of the microcontroller to point to other instructions. That’s what we are doing: using a sequence of RETLW with a value (letter) for each, and by using INCF and a temporary address (RAM), we can increment the value of the PC to point to one of these letters each time. Since we didn’t push anything to the microcontroller’s stack, RETLW has a value of 0, which causes the PC to start from the beginning (the place where we push one character at a time). Finally, when the PC points to the instruction after the last RETLW, we clear all the values and start over the loop. This creates an infinite loop that prints Hello, World!\n in the console.