Linker Basics

CS 301 Lecture, Dr. Lawlor, 2005/12/05

So you've got some C/C++ program files. You want to make them into an executable. How does this happen?

Step 1 is to compile the source code. If you've got lots of different source files, you want to build them each into not an entire program, but a little piece called an "object file". Object files consist of compiled machine code, but with special hooks to allow it to be combined with other object files into a single executable.

On Linux object files have extension ".o", and you make them with the "-c" flag. So this will create "foo.o" from "foo.c":

	gcc foo.c -c

On Windows, object files have extension ".obj", and you make them with the "/c" flag. So this creates "foo.obj" from "foo.c":

	cl foo.c /c

Step 2 is optional. If you've got a zillion object files that are related, you can put them together into a "library". For now, we'll only look at statically linked libraries, not dynamically-linked libraries (DLLs).

On Linux static library files have extension ".a", and you make them with the "ar cr" tool. So this will create "foo.a" from "foo.o" and "bar.o":

	ar cr foo.a foo.o

On Windows, static library files have extension ".lib", and you make them with the "link /lib" tool. So this creates "foo.obj" from "foo.c":

	link /lib /out:foo.lib foo.obj

Step 3 is to combine all your object files and libraries into a single executable. This step is called "linking".

On Linux executables have no filename extension. You specify the executable name with the "-o" flag. You can also build executables yourself with "ld", but it's trickier.

gcc -o bar bar.o foo.a

On Windows, executables are named ".exe".

cl /o bar.exe bar.obj foo.lib

Often, you don't call these programs yourself. Instead, you let the IDE (e.g., MS Visual C++) call them for you. Or you write a "Makefile" and let "make" call the programs needed.

Stupid bugs in the linker

If you accidentally define the same subroutine name in two object files, the linker will complain about "multiply defined symbols". This is good, because it lets you catch and fix your error.

If you accidentally define the same subroutine name in two library files, the linker takes the definition from the file listed first on the command line! Any subsequent definitions of that subroutine are ignored; any subsequent uses of that subroutine find the first subroutine. This is horrible, because it's unlikely that two subroutines named "doit" are interchangable just because the names are the same!

If you call a subroutine from another inside an object file, the linker will search everywhere for that subroutine. If you call a subroutine from inside a library file, the linker only searches that library and subsequent libraries on the command line! For example, "gcc my.o foo.a bar.a" errors out if bar.a requires anything from foo.a beyond what my.o uses. This is stupid, because the linker is perfectly capable of searching foo.a again, it just doesn't want to. If two libraries both depend on routines listed in each other, you may have to list them several times on the command line "gcc my.o foo.a bar.a foo.a". That second foo.a picks up the things in foo that bar needs.

(These bugs are present in both the UNIX and Windows linkers. Some code actually depends on these bugs to function.)

The problem here is that writing a library name on the link line is just shorthand for a whole set of object files. As it walks the list of libraries, the linker uses a simple pruning algorithm to decide which object files it can ignore--if nobody seen so far still needs a subroutine (or other symbol) listed in the object file, the object file is permanently ignored.

Generally speaking, you've got to be very careful to manage dependencies between libraries, and careful with the order things are listed on the link line.

Guts of object/executable files:

There are lots of different things inside an object file or executable (see page 543 of the textbook for a complete list):

Machine code, like compiled subroutines. These are always stored in a section called ".text", which is silly because they aren't text!
Read-only data, like strings, tables, and other constants. These are stored in a section called ".rodata".
Read-write initialized data, like something declared as "static int arr[3]={4,7,9};".
These are in a section called just ".data".
Read-write uninitialized data, like most global variables. These don't need to be stored (since they're initialized to zero), but they're still listed in a section called ".bss", another silly historical name (the book suggests the mnemnoic "Better Save Space").

You can look at these things inside an object or executable file using the GNU/Linux tool objdump:

    objdump -hdrC foo.o

or on Windows using the Microsoft tool dumpbin:

    dumpbin /disasm foo.obj

(both objdump and dumpbin have zillions of additional parameters and options.)

When you're writing C, the compiler is smart enough to put everything into the right places. But when you're writing assembly (especially when writing a standalone .S assembly source file), you often have to explicitly say:

.section ".text"

before writing assembler instructions, or

.section ".rodata"

before defining read-only strings or tables.

For example, you can write a string constant in GNU assembler with ".ascii", although you've got to be careful to add the terminating 0!
(Executable NetRun Link)

.section ".rodata"
my_str:
	.ascii "Wazzup?\n\0"

.section ".text"
foo:
	push $my_str
	call printf
	pop %eax
	ret