Project 5: The Power of Caches

Projects are designed to test your mastery of course material as well as your programming skills; think of them as “take-home exams” and don’t communicate with anyone about possible solutions. This project focuses on simulating and evaluating caches.

Overview

We’ll give you a number of memory traces from real benchmark programs. You’ll implement a program to simulate how a cache would perform on these traces given a variety of configuration parameters. You’ll then use your program and the given traces to determine the best overall cache configuration.

Programming Languages

You can use either C or C++ for this assignment. You’re allowed to use the standard library of your chosen language as much as you would like to, but you are not allowed to use any additional (non-standard) libraries.

We highly recommend that you opt for C++ because the C++ standard library contains many useful data structures that you would have to write from scratch in C! Don’t waste your time on data structures, spend your time working on the actual problem! However, if you pick C++ you must write real C++ code, not just “C with some classes added” as it were!

We must be able to build your program on the Lubuntu 16.04 LTS reference system using make with no additional arguments, so obviously you need to include a working Makefile with your source code. Make sure you use the C/C++ compiler flags posted on Piazza and follow the style guide posted there for your code.

Problem 1: Cache Simulator (90%)

You will design and implement a cache simulator that can be used to study and compare the effectiveness of various cache configurations. Your simulator will read a memory access trace from a given file, simulate what a cache based on certain parameters would do in response to these memory access patterns, and finally produce some summary statistics. Let’s start with the file format of the memory access traces:

s 0x1fffff50 1
l 0x1fffff58 1
l 0x1fffff88 6
l 0x1fffff90 2
l 0x1fffff98 2
l 0x200000e0 2
l 0x200000e8 2
l 0x200000f0 2
l 0x200000f8 2
l 0x30031f10 3
s 0x3004d960 0
s 0x3004d968 1
s 0x3004caa0 1
s 0x3004d970 1
s 0x3004d980 6
l 0x30000008 1
l 0x1fffff58 4
l 0x3004d978 4
l 0x1fffff68 4
l 0x1fffff68 2
s 0x3004d980 9
l 0x30000008 1

As you can see, each memory access performed by a program is recorded on a separate line. There are three “fields” separated by white space. The first field is either l or s depending on whether the processor is “loading” from or “storing” to memory. The second field is a 32-bit memory address given in hexadecimal; the 0x at the beginning means “the following is hexadecimal” and is not itself part of the address. You can ignore the third field for this assignment.

Your cache simulator will be configured with the following cache design parameters which are given as command-line arguments (see below):

Note that certain combinations of these design parameters account for direct-mapped, set-associative, and fully associative caches:

The smallest cache you must be able to simulate has 1 set with 1 block with 4 bytes; this cache can only remember a single 4-byte memory reference and nothing else; it can therefore only be beneficial if consecutive memory references in a trace go to the exact same address. You should probably use this tiny cache for basic sanity testing.

A brief reminder about the other three parameters:

The write-allocate parameter determines what happens for a cache miss during a store: if true (1), then a store brings the relevant memory block into the cache before it proceeds; if false (0), a cache miss during a store does not modify the cache; this parameter interacts with the following one.

The write-through parameter determines whether a store always writes to memory immediately or not: if true (1), then a store writes to the cache as well as to memory; if false (0), then a store writes to the cache only and marks the block dirty; if the block is evicted later, it has to be written back to memory before being replaced. It doesn’t make sense to combine no-write-allocate with write-back because we wouldn’t be able to actually write to the cache for the store!

The last parameter is only relevant for associative caches: in direct-mapped caches there is no choice for which block to evict! The least-recently-used policy (1) picks the block that has not been accessed the longest for eviction; the FIFO policy (0) picks the block that has been in the cache the longest for eviction.

Your cache simulator should assume that loads/stores from/to the cache take one processor cycle; loads/stores from/to memory take 100 processor cycles for each 4-byte quantity that is transferred. There are plenty of things about caches in real processors that you do not have to simulate, for example write buffers or smart ways to fill cache blocks; implementing all the options above correctly is already somewhat challenging, so we’ll leave it at that.

We expect to be able to run your simulator as follows:

This would simulate a cache with 256 sets of 4 blocks each (aka a 4-way set-associative cache), with each block containing 16 bytes of memory; the cache performs write-allocate but no write-through (so it does write-back instead), and it evicts the least-recently-used block if it has to. (As an aside, note that this cache has a total size of 16384 bytes (16 kB) if we ignore the space needed for tags and other meta-information.)

After the simulation is complete, your cache simulator is expected to print the following summary information in exactly the format given below:

Total loads: 318197
Total stores: 197486
Load hits: 314798
Load misses: 3399
Store hits: 188250
Store misses: 9236
Total cycles: 9344483

Note that there may be a bug in these results. You should probably compare amongst yourselves by posting test cases on Piazza and discussing them…

Hints

Your simulation is only concerned with hits and misses, at no point do you need the actual data that’s stored in the cache; that’s the reason why the trace files do not contain that information in the first place.

Don’t try to implement all the options right away, start by writing a simulator that can only run direct-mapped caches with write-through and no-write-allocate. Once you have that working, extend step-by-step to make the other design parameters work. Also, sanity-check your simulator frequently with simple, hand-crafted traces for which you can still derive manually what the behavior should be.

Problem 2: Best Cache? (10%)

For the second problem, you’ll use the memory traces as well as your simulator to determine which cache configuration has the best overall effectiveness. You should take a variety of properties into account: hit rates, miss penalties, total cache size (including overhead), etc. In your README describe in detail what experiments you ran (and why!), what results you got (and how!), and what, in your opinion, is the best cache configuration of them all.

Credits

The memory traces above come from a similar programming assignment by Steven Swanson at the University of California, San Diego. Thank you Steven!

Deliverables

Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no object files, no executable files, etc.), but allows building all required derived files. Make sure to include a Makefile that sets the appropriate compiler flags as detailed on Piazza and builds all programs by default.

Include a plain text README file (not README.txt or README.docx or whatnot) that briefly explains what your programs do and contains any other notes you want us to check out before grading. Your answers to written problems should be in your README file as well! Make sure to include explanatory notes and detailed derivations that tell us how you solved the problem in question (and convince us that you really did the work).

Finally, make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Grading

For reference, here is a short explanation of the grading criteria; some of the criteria don’t apply to all problems, and not all of the criteria are used on all projects.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for projects on Piazza.

Style refers to both programming and presentation style. Programming style includes things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Style also includes proper modularization of your code (into functions, modules, etc.), proper use of static and extern, etc. Simple, clean, readable code is what you should be aiming for. For C (and, if allowed, C++) programs, make sure you follow the style guide posted on Piazza! Presentation style refers to your README file and (possibly!) your PDF files for diagrams. Your presentation should be clear, structured problem-by-problem, broken into sections (and paragraphs!) as appropriate. Lines should be at most 80 characters in length, broken by UNIX linefeeds. (You may use Markdown format if you so choose, but everything must still be perfectly readable without rendering Markdown to another format.) Diagrams should be clearly labeled, cleanly layed out, and generally a pleasure to look at.

Performance refers to how fast/with how little memory your programs or circuits can produce the required results compared to other submissions.

Functionality refers to your programs or circuits being able to do what they should according to the specification given above; if the specification is ambiguous, ask for clarification! (It also refers to you simply doing the required work, beyond programming or circuit design!)

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using the required compiler options given on Piazza we will take off 10% (except if you document a very good reason). If your programs cannot be built using make we will take off 10%. If valgrind detects memory errors in your programs, we will take off 10%. If your programs fail miserably even once, i.e. terminate with an exception of any kind or dump core, we will take off 10% (for each such case).