Project 2: The SCRAM Tools

Projects are designed to test your mastery of course material as well as your programming skills; think of them as “take-home exams” and don’t communicate with anyone about possible solutions. This project focuses on development tools for the SCRAM architecture.

General Expectations

Problem 1: Disassembler (20%)

Your first job is to write a SCRAM disassembler called dis that can be used to comfortably examine a SCRAM object file. You can see what this means by looking at the example files in the archive posted on Piazza. The loop.scram object file contains a sequence of “raw” SCRAM instructions, the very bit patterns that would be in the memory (and thus the instruction register) of a SCRAM machine. So that’s a mess of zeros and ones: Without a specialized tool such as dis the only sensible way to look at this file is using a hexdump tool like xxd.

$ ls -la loop.scram
-rw-rw-r-- 1 phf phf 6 May 26  2015 loop.scram
$ xxd -g 1 loop.scram
0000000: 14 55 34 70 00 01                                .U4p..
$ xxd -b loop.scram
0000000: 00010100 01010101 00110100 01110000 00000000 00000001  .U4p..

As you can see, the object file contains only 6 bytes. The first xxd command displays those 6 bytes in hexademical notation, the second in binary notation. Now look at that first byte carefully: It’s actually an LDA instruction that loads from address 4! (Recall how the instruction encoding of the SCRAM works: upper 4 bits for the opcode, lower 4 bits for the address.) The SCRAM disassembler translates those bytes back into SCRAM assembly language, at least approximately:

$ ./dis <loop.scram
0: LDA 4
1: ADD 5
2: STA 4
3: JMP 0
4: HLT 0
5: HLT 1

Don’t get confused by the HLT instructions! The disassembler is allowed to be a bit simplistic: It doesn’t have to figure out that addresses 4 and 5 are not actually used as instructions in this program, they only hold data. (Note that the presence of indirect addressing on the SCRAM would indeed make that very hard to figure out in general.) Instead it always assumes an instruction even if the SCRAM never fetches that location for execution.

As you saw above, the disassembler reads SCRAM object code from standard input and writes the disassembly to standard output. Of course it has no way of knowing whether it is really looking at SCRAM object code or not:

$ ./dis
Hi!
0: STI 8
1: SUB 9
2: LDI 1
3: HLT a

Here the user typed Hi! followed by the RETURN key and CTRL-D to end the input. The disassembler reads byte after byte and dutifully prints the equivalent SCRAM instructions. It’s all just zeros and ones after all! Of course there might be inputs that don’t actually correspond to any valid SCRAM instruction, but dis simply prints ??? for an unknown opcode:

$ ./dis
Hö!
0: STI 8
1: ??? 3
2: ??? 6
3: LDI 1
4: HLT a

(If you want to understand why ö results in two bytes, you’ll need to learn about the UTF-8 encoding of Unicode, something that has nothing to do with this project.) For the empty input dis should print nothing at all.

There is only one “error condition” for dis namely that the input is longer than 16 bytes, the capacity of the SCRAM:

$ ./dis
This will be too long for dis!
0: ADD 4
1: SUB 8
2: SUB 9
3: JMP 3
4: LDI 0
5: JMP 7
6: SUB 9
7: SUB c
8: SUB c
9: LDI 0
a: SUB 2
b: SUB 5
c: LDI 0
d: JMP 4
e: SUB f
f: SUB f
dis: Program too long, truncated to 16 bytes.

That’s all you need to know to write your dis.c implementation, always assuming you also read the source code we gave you. Just be careful not to simply clone sim.c because it does way more than what is needed here. If you hand in a program that’s overly complicated we might take points off for that!

Problem 2: Assembler (80%)

Nobody wants to write SCRAM programs by hand-crafting byte sequences like the above. Instead, we’d like to at least be able to write something like this (see loop.z in the archive):

	LDA	4
	ADD	5
	STA	4
	JMP	0
	DAT	0
	DAT	1

The assembler is a program that translates a textual description like the above into an “equivalent” 6-byte SCRAM object file. It does so by

Note the pseudo-instruction DAT! This is not a SCRAM instruction, rather it’s a way for the programmer to tell the assembler that a given byte is supposed to hold a certain data value. While the addresses after LDA, JMP, … can only be 4 bits long, the value after a DAT can be up to 8 bits long since that value is written directly into the object file without a 4 bit opcode before it!

However, even that notation is not really comfortable because the programmer still has to manually track the addresses of all the various instructions and data bytes. What we would really like the assembler to take as input is a file like this (see loop.s in the archive):

# Simple counter program in SCRAM assembly.

start:	LDA	count
	ADD	one
	STA	count
	JMP	start

count:	DAT	0	# counter variable
one:	DAT	1	# constant 1

There are two innovations here: comments that can be used to explain pieces of a program and labels that can be used to automate address computations. Most assembly languages, including this one, are line-oriented which means that they are processed line-by-line by the assembler. In general, the structure of a line is as follows:

label: instruction # comment

All of the components are optional: we can have lines that are empty, lines that are only comments, lines that only define a label, and lines that only contain an instruction; or any combination thereof.

Comments start with the character “#” and continue to the end of the current line; anything between “#” and the end of the line is ignored by the assembler. Instructions consist of OPCODE address pairs separated by whitespace; to the list of actual SCRAM opcodes we add the DAT pseudo-opcode as described above; opcodes are always fully capitalized; addresses can be non-negative integers or label references. (Note that for the HLT instruction we could leave out the address part (why?); however, in the interest of making the assembler a little easier to write, we still require an address even for HLT.) Label definitions are sequences of letters (upper or lower case, but case matters!) that end with a colon; if a label definition is present, it has to come before the instruction in the same line (if any).

The way addresses are assigned to labels is straightforward: We start at address 0 for the first instruction; each (pseudo-)instruction will advance the current address by 1 since each of them occupies 1 byte of memory. So in the program above, the start label is 0 because there are no instructions preceeding it. The instructions following (LDA, ADD, STA, JMP) are each 1 byte long, meaning that the count label is address 4 whereas the one label is address 5.

The only problem with all of this is that a label may be referenced before it has been defined as is the case in the example program: When we process “LDA count” for the first time, we don’t know yet that count is actually 4. Most assemblers use a two-pass process to get around this: In the first pass they only process the addresses and not the instructions themselves, meaning the first pass determines what all the labels will be but doesn’t actually generate the finished object file yet. The second pass then uses this information to fill in the correct bit patterns for all instructions and write the final object file.

Your job is to write the SCRAM assembler. So you need to write a program that given something like loop.s on the standard input will produce a SCRAM object file like loop.scram on the standard output. Nothing more and nothing less. We suggest that you first write a version of the assembler that works only for inputs like loop.z and then extend it with comments and labels (and empty lines!) so it can also process inputs like loop.s. (If you’re feeling particularly mighty, you can of course also try to immediately hack the full version; but that’s not recommended.)

Please call your executable assembler sas (short for “Scram ASsembler”). Of course you actually submit the complete source code, not the executable.

Note on Error Messages: Your assembler should do error checking for the input program. So if a label is used that has never been defined, there should be an error. If a label is defined twice, there should be an error. If a label is out of range, or if the program is too long for the SCRAM, or if a number is too large (4-bit unsigned for addresses, 8-bit unsigned for data) there should be an error. If an unknown opcode is used or if an address is missing, there should be an error. And so on, and so forth! Please include a line number (starting at 1) with your error message to help the programmer correct their code, and please make sure you print error messages to standard error and not to standard output! No input, however cleverly crafted, should make your assembler crash!

Note on Line/String Lengths: As a special concession to using C as the implementation language for this assignment, you may assume that a line of input has at most 128 characters, excluding the final LF (line feed) character. Similarly, you may assume that a label has at most 32 characters, excluding the final : (colon) in case of a definition. However, you still need to stop with an error message if there is a longer line/label, your assembler may not crash!

General Hints

Deliverables

Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no object files, no executable files, etc.), but allows building all required derived files. Make sure to include a Makefile that sets the appropriate compiler flags as detailed on Piazza and builds all programs by default.

Include a plain text README file (not README.txt or README.docx or whatnot) that briefly explains what your programs do and contains any other notes you want us to check out before grading. Your answers to written problems should be in your README file as well! Make sure to include explanatory notes and detailed derivations that tell us how you solved the problem in question (and convince us that you really did the work).

Finally, make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Grading

For reference, here is a short explanation of the grading criteria; some of the criteria don’t apply to all problems, and not all of the criteria are used on all projects.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for projects on Piazza.

Style refers to both programming and presentation style. Programming style includes things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Style also includes proper modularization of your code (into functions, modules, etc.), proper use of static and extern, etc. Simple, clean, readable code is what you should be aiming for. For C (and, if allowed, C++) programs, make sure you follow the style guide posted on Piazza! Presentation style refers to your README file and (possibly!) your PDF files for diagrams. Your presentation should be clear, structured problem-by-problem, broken into sections (and paragraphs!) as appropriate. Lines should be at most 80 characters in length, broken by UNIX linefeeds. (You may use Markdown format if you so choose, but everything must still be perfectly readable without rendering Markdown to another format.) Diagrams should be clearly labeled, cleanly layed out, and generally a pleasure to look at.

Performance refers to how fast/with how little memory your programs or circuits can produce the required results compared to other submissions.

Functionality refers to your programs or circuits being able to do what they should according to the specification given above; if the specification is ambiguous, ask for clarification! (It also refers to you simply doing the required work, beyond programming or circuit design!)

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using the required compiler options given on Piazza we will take off 10% (except if you document a very good reason). If your programs cannot be built using make we will take off 10%. If valgrind detects memory errors in your programs, we will take off 10%. If your programs fail miserably even once, i.e. terminate with an exception of any kind or dump core, we will take off 10% (for each such case).