top of page
Search
daveor

Writing a PDP-11 "decompiler"

I have been taking a look at the PDP-11 BASIC code and I thought it might be interesting to see if I could figure out how it works. I don't expect that I will be able to get a full picture, but it'll be interesting to see what I learn along the way.


The code consists of about 4,000 instructions when converted from absolute tape format, which is far too many to work through and manually figure out what each instruction is. Therefore, my first step on this journey was to write some code to make a first pass through the opcodes and attempt to figure out what they mean. The good news is that I have already had some success with this approach, including discovering the interesting approach to using the TRAP instructions that I wrote about yesterday.


Instructions patterns

The first thing I did was take a look through the available instructions and I noted that they can be sorted into a number of categories based on their structure:

  1. Zero operand instructions, such as HALT or WAIT.

  2. Single operand instructions, such as INC, DEC or TST.

  3. Double operand instructions, such as MOV or CMP.

  4. Instructions that take a register plus an operand, such as JSR, MUL or DIV.

  5. Branch instruction

  6. TRAP and EMT

  7. Operands that take a register argument only, such as RTS or FADD.

All instructions consist of a one word opcode plus zero or more additional words depending on the nature of the operands (see below).


Zero operand instructions are easy to identify because the entire word of the opcode represents the instruction. For example

  1. 000000 = HALT

  2. 000001 = WAIT

  3. 000002 = RTI

and so on.


All of the others require a bit more interpretation because the opcode consists of several parts. In the case of single operand instructions, for example, the first four octal digits of the opcode represent the instruction, with the remaining two digits representing the operand. Similarly, for two operand instructions, the first two octal digits represent the instruction, the next two digits are the source operand and the final two digits are the destination operand.


Some single operand examples would be:

  1. 0001XX = JMP XX

  2. 0005XX = CLR XX

  3. 0052XX = INC XX

and some double operand examples would be:

  1. 01SSDD = MOV SS, DD

  2. 02SSDD = CMP SS, DD

I will describe below how to interpret the operands, but let's keep focussing on the instruction part for now.


Two of the other categories of instructions take either a register argument, or a register argument plus an operand. In these cases, the register is represented by a single octet value (0-7). So you could have instructions like:

  1. 004RDD = JSR R, DD

  2. 070RDD = MUL R, DD

  3. 00020R = RTS R

  4. 07500R = FADD R

Then we have the branch instructions that work slightly differently. These are structured as a base value plus an offset, where the base value represents the opcode and the offset represents where to branch to. The high byte of a branch opcode represents the instruction and the low byte represents the branch value. Byte values and octal values don't line up perfectly, so I find it easier to think about the branch instructions in hex:


  1. 0x01NN = octal 000400 + NN = BR NN

  2. 0x02NN = octal 001000 + NN = BNE NN

  3. 0x03NN = octal 001400 + NN = BEQ NN

The branch value in these cases is a signed 2's complement value, allowing branching both forwards and backwards in the code from the instruction location.


I have already extensively described the TRAP and EMT instructions in a previous post, so briefly, they consist of a ranges of zero operand instructions:

  1. opcodes in the range 104000-104377 = EMT

  2. opcodes in the range 104400-104777 = TRAP

Operand patterns

Instruction operands are generally represented by two octal digits. The second of the two octal digits represents a register and the first digit indicates how to use the value of that register to get the operand (i.e. the addressing mode). This structure is how the various PDP-11 addressing modes are implemented, so if you are not familiar with the addressing modes, you can have a look on wikipedia or in any one of a multitude of documents such as the PDP-11/70 Handbook (chapter 3) or my favourite reference, the PDP-11 Programming Card.


Here are the meanings of the addressing modes:


  1. Register addressing: An addressing mode of "0" means that the operand is the register itself. For example "01" would mean that the operand is the value in R1. This is represented in code as "R1".

  2. Register deferred addressing: An addressing mode of "1" means that the memory address of the operand is contained in the register. For example "11" would mean that the operand can be found at the memory address contained in register R1. This is represented in code as (R1) or sometimes @R1.

  3. Autoincrement addressing: An addressing mode of "2" means that the memory address of the operand is contained in the register, and that address will be incremented after it is used. For example "21" would mean that the operand can be found at the memory address contained in R1, and that value will be autoincremented after it is used to point at the next memory address. This is represented in code as (R1)+. Note that the "+" is after the register value, indicating that the register is read and then incremented afterwards.

  4. Autoincrement deferred: An addressing mode of "3" means that the memory address of the operand is contained in the memory address contained in the register. Also, the operand is incremented after it is used. For example "31" would mean that the value of R1 is a memory address, and in that memory address there is another memory address. That second memory address, in turn, points at the operand. This is represented in code as @(R1)+. Note again the "+" indicating autoincrement after use.

  5. Autodecrement: An addressing mode of "4" means that the memory address of the operand is contained in the register, and that the address will be decremented before use. For example "11" would mean that the operand can be found at the memory address contained in R1, and that value will be decremented before it is used. This is represented in code as -(R1). Note that the "-" is before the register value, indicating that the register is decremented first and then read afterwards.

  6. Autodecrement deferred: An addressing mode of "5" means that the memory address of the operand is contained in the memory address contained in the register. Also, the operand is decremented before it is used. For example "51" would mean that the value of R1 is a memory address, and in that memory address there is another memory address. That second memory address, in turn, points at the operand. This is represented in code as @-(R1). Note again the "-" before the register indicating decrement before use.

  7. Index: An addressing mode of "6" means that the address is found at a fixed offset from the memory address stored in the register. For example "61" would mean that the next word in memory after the opcode contains a value that should be added to the value contained in R1. The operand will be found at the resulting memory address location. This is represented in code as "X(R1)" where the value X is found in the next word in memory after the opcode.

  8. Index deferred: An addressing mode of "7" means that a fixed offset is added to the memory address stored in the register. At this location there will be a memory address and at that address, in turn, will be the operand. For example "71" means that the next word in memory contains a value that should be added to the value contained in R1. At the resulting memory location there will be a memory address where the operand will be found. This is represented in code as "@X(R1)" where the value of X is found in the next word in memory after the opcode.

Although register 7 (i.e. the Program Counter) has a special purpose, these memory addressing modes can still be used with the program counter. This is so useful, and done so frequently, that the use of some of the addressing modes with the PC have different names and representations in code:


  1. Immediate: An operand value of "27" (i.e. addressing mode "2", register "7") means that the operand is contained in the next word of memory, immediately after the instruction. This is represented in code as "#X", where X is the value contained in the next word of memory.

  2. Absolute: An operand value of "37" means that the operand is contained at the memory address that can be found in the next word in memory. This is represented in code as "@#X", where X is the address that can be found in the next word of memory.

  3. Relative: An operand value of "67" means that the operand is contained in the memory address obtained by adding the value of the next word in memory to the program counter. This is usually represented in code as just the value of the memory address being referred to, for example the bare value "123".

  4. Relative deferred: An operand value of "77" means that the address of the operand is contained in the memory address obtained by adding the value of the next word in memory to the program counter. This is usually represented in code as "@123" where "123" is the result of adding the value of the next word to the value of the progam counter at the point this instruction would be executed.

If you think through each of these program counter specific modes carefully, you will see that they are exactly the same as the addressing modes applied to any other register, but because the progam counter has a special meaning , the consequences of applying the addressing modes are unique.


Special areas of memory

One final point of background information before moving on to the "decompiler" concept. There are certain areas of memory where the contents of those addresses should not be interpreted as instructions. The code that I have written considers one such block of memory and it is the addresses from 000000 to 000100. These are interrupt vectors and not areas that typically contain code.


Therefore, when I was writing the "decompiler" it skips addresses below "000 100" and does not (yet!) attempt to interpret them in any way.


"Decompiler" code concept

Finally, after all that, let me introduce you to the "decompiler" concept that I implemented. The general approach that I adopted was to have a "matcher" and a "parser" for each category of instructions that I mentioned earlier:

  1. Zero operand instructions, such as HALT or WAIT.

  2. Single operand instructions, such as INC, DEC or TST.

  3. Double operand instructions, such as MOV or CMP.

  4. Instructions that take a register plus an operand, such as JSR, MUL or DIV.

  5. Branch instruction

  6. TRAP and EMT

  7. Operands that take a register argument only, such as RTS or FADD.

I then read a memory address and word from input and check whether the memory address is in a special area (as mentioned above) and therefore should not be interpreted as an instruction. In these cases, for the time being, I just added a comment to the output indicating that this is a special memory address and not interpreted as an instruction.


If the input is not for a special memory area, the "matcher" for each category of instructions is run on the address until a matcher is found. If no appropriate matcher is found, some default text (in my case the word "NOTHING" is output). When a matcher is found, the corresponding "parser" is run to generate the code representing the instruction that is being considered. The input memory address and instruction are written to the output file, along with the code representing the instruction.


Sometimes, when parsing operands, it may be necessary to consume another word of input, which will be done as needed based on the instruction operands.


This process will be repeated until there are no lines left in the input file.


Implementation snippets

I implemented the code in C and I will just provide some key excerpts here. If anyone is interested in the full code, let me know in the comments and I will make it available somewhere.


Instructions are parsed into an "Instruction" structure consisting of the character representation of the memory address and the character representation of the instruction. I found that since the instructions were already strings, it was relatively easy to interpret them in their string form, in most cases.

typedef struct Instruction {
   char *address;
   char *instruction;
} Instruction;

Each "matcher" and "parser" are paired together into an InstructionParser:

typedef struct InstructionParser {
   int (*matchFunction)(Instruction *);
   char * (*parserFunction)(Instruction *, FILE *);
} InstructionParser;

The matchFunction() takes an Instruction pointer and returns 1 if the Instruction matches this parser, or 0 otherwise. The parserFunction() takes the Instruction pointer and the input FILE pointer (in case additional words need to be read from the file due to the nature of the operands) and returns a string pointer to an interpretation of the Instruction for inclusion in the output stream.


The match functions are based, for the most part, on string comparisons. Here is the zero argument match instruction and parser instruction, for example:

char identifiedInstruction[256];

int zeroOperandMatchFunction(Instruction *theInstruction) {
   char *instruction = theInstruction->instruction;
   if(strcmp(instruction, "000000") == 0) {
      strcpy(identifiedInstruction, "HALT");
      return 1;
   } else if(strcmp(instruction, "000001") == 0) {
      strcpy(identifiedInstruction, "WAIT");
      return 1;
   } else if(strcmp(instruction, "000002") == 0) {
      strcpy(identifiedInstruction, "RTI");
      return 1;
   } else if(strcmp(instruction, "000003") == 0) {
      strcpy(identifiedInstruction, "BPT");
      return 1;
   } else if(strcmp(instruction, "000004") == 0) {
      strcpy(identifiedInstruction, "IOT");
      return 1;
   } else if(strcmp(instruction, "000005") == 0) {
      strcpy(identifiedInstruction, "RESET");
      return 1;
   } else if(strcmp(instruction, "000006") == 0) {
      strcpy(identifiedInstruction, "RTT");
      return 1;
   } else if(strcmp(instruction, "000240") == 0) {
      strcpy(identifiedInstruction, "NOP");
      return 1;
   } else if(strcmp(instruction, "000241") == 0) {
      strcpy(identifiedInstruction, "CLC");
      return 1;
   } else if(strcmp(instruction, "000242") == 0) {
      strcpy(identifiedInstruction, "CLV");
      return 1;
   } else if(strcmp(instruction, "000244") == 0) {
      strcpy(identifiedInstruction, "CLZ");
      return 1;
   } else if(strcmp(instruction, "000250") == 0) {
      strcpy(identifiedInstruction, "CLN");
      return 1;
   } else if(strcmp(instruction, "000257") == 0) {
      strcpy(identifiedInstruction, "CCC");
      return 1;
   } else if(strcmp(instruction, "000261") == 0) {
      strcpy(identifiedInstruction, "SEC");
      return 1;
   } else if(strcmp(instruction, "000262") == 0) {
      strcpy(identifiedInstruction, "SEV");
      return 1;
   } else if(strcmp(instruction, "000264") == 0) {
      strcpy(identifiedInstruction, "SEZ");
      return 1;
   } else if(strcmp(instruction, "000270") == 0) {
      strcpy(identifiedInstruction, "SEN");
      return 1;
   } else if(strcmp(instruction, "000277") == 0) {
      strcpy(identifiedInstruction, "SCC");
      return 1;
   }
   return 0;
}

char *zeroOperandParserFunction(Instruction *theInstruction, FILE *filep) {
   return identifiedInstruction;
}

The match function in this case checks whether the target instruction is equal to the opcode for any of the zero argument instructions. If it is, the corresponding mnemonic (e.g. "HALT") is stored in a variable called identifiedInstruction. This value is then returned when the zero argument parser function is called.


The other match/parser pairs use the same general approach.


Operand parsing is handled through a shared support function. Here, for example is the single operand parse function:

char *singleOperandParserFunction(Instruction *theInstruction, FILE *filep) {
   char *instruction = theInstruction->instruction;
   char operand[3];
   strncpy(operand, instruction+4, 2);
   operand[2]=0;
   char *interpretedOperand = parseOperand(operand, filep);
   strcat(identifiedInstruction, interpretedOperand);
   free(interpretedOperand);
   return identifiedInstruction;
}

At this point, the corresponding singleOperandMatchFunction() has determined that the current instruction is a single operand instruction. The last two octal digits are placed in a temporary variable called "operand" and passed to the parseOperand() function. This returns a string called "interpretedOperand" which is appended to the identified instruction mnemonic and the resulting value is returned.


The parseOperand() function is quite long, so I won't copy the whole thing here. By way of example, here is the code for handling the register direct addressing mode:


char *parseOperand(char *operand, FILE *filep) {
   char *interpretedOperand = malloc(32);
   memset(interpretedOperand, 0, 32);
   if(operand[0]=='0') {
      if(strncmp("7", operand+1, 1) == 0) {
         strcat(interpretedOperand, "PC");
      } else if(strncmp("6", operand+1, 1) == 0) {
         strcat(interpretedOperand, "SP");
      } else {
         strcat(interpretedOperand, "R");
         strcat(interpretedOperand, operand+1);
      }
   } else if(operand[0]=='1') {
   ...

Some space is allocated for the result and then filled with zero byte values. Then the first byte of the operand is compared to zero. If it equals zero that means the operand is a register direct operand, so the output should be "R1", "R2", etc. The code looks at the second byte of the operand and appends it to the letter "R". However since "R7" is more commonly known as "PC" and "R6" is more commonly known as "SP", these two special cases are handled and appropriate output is generated. There are branches like this for each of the addressing modes.


Example output

Here is some example output:

...   
016442 005767 TST 17452
016446 003407 BLE 16466
016450 012767 MOV #177560, 13704
016456 012767 MOV #177564, 13706
016464 000406 BR 16502
016466 012767 MOV #177550, 13704
016474 012767 MOV #177554, 13706
016502 016701 MOV 17462, R1
...

On each line, the first item is the memory location, then the instruction and finally the "decompiled" instruction.


I have found this to be a very helpful technique for working out how code works. The piece of code above, for example, implements the following pseudo-code:

if (<something>) {
    move the memory address of the TTY receive CSW into 13704
    move the memory address of the TTY transmit CSW into 13706
} else {
    move the memory address of the tape reader CSW into 13704
    move the memory address of the tape reader CSW into 13706
}

It is not apparent from the code above what is being tested here (that's why I put in <something> as a placeholder) but it is clear that the value of memory address 17452 is being examined. PDP-11 BASIC is able to read BASIC programs from both TTY and tape so perhaps this code is used to configure where to read from, and address 17452 is, elsewhere in the code, set to a value representing the user-selected input source.


Possible Improvements

  1. As currently implemented the code takes as input a set of memory addresses and the word located at those addresses in memory, in a format produced by the absolute tape loader parser that I wrote previously. I don't think it wouldn't be too difficult for the "decompiler" to take input in a different format, such as a binary format.

  2. Another problem with the code, as written, is that everything is interpreted as an instruction, but that is not always the case. Sometimes there are byte strings and other non-code data contained in programs. Unfortunately there are no general characteristics that I have been able to identify so far to accurately distinguish non-code data from code.

  3. More meaningful output could be produced for the interrupt vector memory addresses.

338 views0 comments

Recent Posts

See All

Comments


bottom of page