On this page:
Starting Implementation
Assumptions
Evaluation
Decoding Machine Code
Tips
Test Cases and Test Harness

Linking Assignment

Due: Wednesday, October 25, 11:59pm

This assignment is about understanding how the ELF format supports run-time linking for shared libraries on x86_64, including the way that references to global variables and functions are implemented and represented. Your task is implement redact, which copies an ELF-format shared library and

For example, given a shared library that is the compiled form of

  int a1[3] = { 1, 2, 3 };

  int a2[3] = { 4, 5, 6 };

  

  int f1(int v) {

    if (v > 0)

      return a1[0];

    else

      return a2[0];

  }

  

  int f0(int u) {

    if (u == 1)

      return f1(u);

    else

      return 5;

  }

  

  int g(int u) {

    return a1[2];

  }

then the shared library has the following initial behaviors:

A copy of the shared library produced by redact, however, will have the following behaviors:

Start with linklab-handout.zip.

Starting Implementation

The starting code "redact.c" handles copying a shared-library file to a destination file based on the command-line arguments. The redact function—which doesn’t do anything in the starting code—receives a pointer to the in-memory copy of the destination file. Your job is to fill in the implementation of redact so that it changes the destination file.

To determine function and variable information, your redact function will read the destination ELF file as it exists in memory. To replace certain variable accesses and function calls with a crash, your redact function will modify the in-memory copy of the ELF file. Through the magic of of mmap (which we will cover later in the semester), those in-memory changes will be reflected in the destination file.

The initial "redact.c" includes a get_secrecy_level function that takes a string for a function or variable name and returns an integer for the secrecy level that is embedded in the name. A secrecy level is computed by converting a sequence of digits at the end of the name string. For example, "f13" has secrecy level 13, while "g" has secrecy level 0 (since it doesn’t end in any digits). Although you’re allowed to modify anything in "redact.c", you shouldn’t have to modify get_secrecy_level.

A key step in the implementation of redact is to decode the machine code that implements a function. Since decoding x86_64 machine code is tedious, the starting implementation includes "decode.c" and "decode.h", where "decode.c" implements a decode function. The decode function is sufficient for functions in the shared libraries that we will use to test your implementation. In addition, "decode.c" provides replace_with_crash, which can replace an instruction with one to provoke a crash through a trace/breakpoint interrupt. See Decoding Machine Code for more information on using decode and replace_with_crash.

Your submitted solution will take the form of a single C file, "redact.c", which must be implemented in ANSI C. You must not modify "decode.c" or "decode.h", since you submit only your "redact.c" file. We’ll compile and link redact.c in the same way as in "Makefile", so it must be self-contained other than its uses of "decode.h", "decode.c", "elf.h", limited Unix headers and libraries for open and mmap, and ANSI C headers and libraries.

Assumptions

To simplify the your task, you can make several assumptions:

Evaluation

To earn 100 points, your implementation must correctly replace each reference or call to a higher-secrecy variable or function with a trace/breakpoint interrupt.

You can earn up to 80 points by not handling function calls, but still replacing references to higher-secrecy variables with a trace/breakpoint interrupt.

When grading, we will infer a level of completion based on your program’s behavior on the tests that are included with linklab-handout.zip (see Test Cases and Test Harness). We will then scale your grade based on the number of our test cases that pass.

Your redact can print out any information that you find useful. Our tests will look only at the shared library produced by redact.

Decoding Machine Code

The decode function provided by "decode.c" has the prototype

  void decode(instruction_t *ins, code_t *code_ptr, Elf64_Addr code_addr);

where code_ptr refers to the machine code to decode, and code_addr is the address where that code will reside at run time. (The code_addr must be provided so that decode can tell the destination of relative references in the machine code.) The type code_t is an alias for unsigned char. Use code_t* for a pointer to machine code, so that pointer arithmetic works in terms of bytes, since decode reports an instruction length in bytes.

The ins argument receives the result of decoding one instruction, so pass the address of a locally declared instruction_t to decode. The instruction_t struct type has three fields: op, length, and addr. The decode function always fills in ins->op and ins->length, and it fills in ins->addr for some values of ins->op.

The possible values for op are (as defined in "decode.h"):

For all operations, the ins->length field is set to the length of the machine-code instruction in bytes, so code_ptr + ins->length is a pointer to the next instruction, if any. The assumptions allowed for redact means that both branches of a conditional (i.e., the target of a MAYBE_JMP_TO_ADDR_OP instruction and the immediately following instruction) can be explored by using a recursive call to an instruction-traversal procedure for one or both of the branches.

If decode does not recognize the content at code_ptr as a machine-code instruction, it reports an error and exits. We will use test inputs where public functions conform to the constraints of decode, so you do not need to handle machine code that decode does not recognize.

The "decode.c" library also provides replace_with_crash:

  void replace_with_crash(code_t *code_ptr, instruction_t *ins);

Pass code_ptr for an instruction that you want to replace with a trace/breakpoint interrupt. Also pass ins as filled with information about the existing instruction, so that replace_with_crash knows how many bytes to replace (filling with “no-op” instruction bytes as necessary to fully replace the original instruction).

Tips

The information about ELF that you need to complete this assignment is mostly covered by Videos: ELF. The map slides should be helpful to find your way through a file; see also a video explaining the map. More generally, you can use "/usr/include/elf.h" as a reference to find relevant structures, fields, and macros. You might also consult any number of other ELF references on the web.

Use readelf to get a human-readable form of the content of an ELF file to get an idea of what your program should find. Use objdump -d to disassemble functions to get an idea of what your program will recognize using decode and to compare your output shared library to the input shared library.

All of the information that you need from the ELF file can be found via section headers and section content, so you will not need to use program headers. You’ll need to consult the .dynsym, .dynstr, .rela.dyn, .plt, and .rela.plt sections, at least.

When working with ELF content, you have to keep track of which references are in terms of file offsets and which are in terms of (tentative) run-time addresses where the library will eventually run. When working with ELF content that is mmapped into memory (as in the starting "redact.c"), then you have one more way of referencing things: a pointer in memory at present. Be careful to keep in mind which kind of reference you have at any time. Again, the maps should help.

Don’t confuse “symbol” with “string,” and keep in mind that they are referenced in different ways. Symbols are referenced by an index that corresponds to the symbol’s position in an array of symbols. Strings are referenced by an offset in bytes within the relevant string section. Every symbol has a string for its name.

Take small steps. Start out by printing the symbol index for every function that is provided by the shared library. Then, print the name instead of the symbol index. Then, print the address where each function’s implementation is found, and so on. Make redaction of variables work before attempting to implement redaction of function calls.

Avoid parsing ELF information into your own set of data structures. An ELF file in memory can be viewed as a data structure already, based on the types that "elf.h" defines. Following different kinds of references is not always trivial, but it’s not difficult, and you can write little functions to make access more convenient and readable. The task and examples are small enough that there’s no need for extra tables or data structures to speed up searches.

As a lower bound, a complete solution can fit in about 250 lines of C code, including the 120 lines in the provided in the starting "redact.c". Depending on whitespace (fine), comments (good), error checking (commendable), and duplicated code instead abstracting into a function (bad), many solutions will be the range of 300-400 lines.

Test Cases and Test Harness

The archive linklab-handout.zip provides a "Makefile" where the default target builds

For example, after running make with the initial files, you can check that calling "ex1.so"’s f0 with the argument 0 produces the result 5 and with the argument 1 produces the result 1:

  $ call ex1.so f0 0

  5

  $ call ex1.so f0 1

  1

After you have completed redact for redacting variable references, then

  $ redact ex1.so dest.so

  $ call dest.so f0 0

  5

  $ call dest.so f0 1

  Trace/breakpoint trap (core dumped)

Alternatively, you can use objdump -d dest.so and compare to objdump -d ex1.so to check whether your redact has replaced an access of a variable address with an int $0x3 instruction that triggers a trace/breakpoint interrupt.

The test makefile target runs your redact on the "exn.so" shared libraries (overwriting "dest.so" for each attempt) and checks that the result behaves as predicted at the top of the "exn.c" source.

You are not required to use the test files, but for grading purposes, we expect your redact to work correctly on "ex1.c" for a chance at more than 80 points (because "ex1.c" checks only variable references). For grading, we will run your program on additional shared-library files. We may also compile your program with different optimization or debugging options; as always, your program must build with gcc on CADE machines with no language-adjusting command-line flags.