Tutorial Categories:

HTML/CSS JavaScript/AJAX Server-Side Marketing General Comp-Sci

C++ Compilation Process, Resolving of Dependencies, and Project Files

By Justin Poirier

This document will explain the concepts of resolving dependencies in C++ code, the C++ compilation and linking process, and C++ project files. These three topics can only be explained by dealing with all 3 simultaneously, as each is related to the other two.

Contents of a compiled program

When a fully-compiled C++ program is loaded into memory, its memory block will consist of binary data for the actual instruction code, the data used, the stack, etc.; with the CPU aware at all times of the address of the next instruction, never cutting into addresses reserved for the data, stack, etc. thanks to special instructions that redirect it as necessary. Each function call or other such redirection in the source code will have been converted to instructions that transfer execution to the address of the machine code generated for the destination. This is complicated by the fact that we donít know the exact address of the destination, since we donít know where the program will reside in memory. On modern systems with virtual memory, we know even less than before about what locations in physical memory the program will occupy, because all addresses are communicated as locations in terms of an imaginary address space, and then each is converted to its corresponding address in real memory just before use. Usually each process has its own virtual memory space. The CPUís memory management unit (MMU) and the operating system take care of converting virtual addresses to real addresses, allowing us to think of a processís virtual memory as if it where real. But even within this virtual memory we donít know where addresses of instructions really are, because we donít know where the code segment begins (for various reasons including Address Space Layout Randomization).

For these reasons, systems with and without virtual memory both express addresses as offsets relative to the starting point of the appropriate segment (the segment base, which may not be the literal starting address. For example, on 8086 CPUs, common 20 bit addresses had to be represented using 16-bits in order to fit in registers. Segment bases could only fall on multiples of 16 so that the last 4 bits of each would be 0 and could be assumed).

Resolving dependencies

Compilation of a C++ program is done one file at a time (we will later discuss how the calling of code in a certain file from code in another file is implemented). In order for a file to be compiled, the code has to follow certain rules governing its order. There are three basic cases which necessitate this structure.

  1. a line of code might attempt to create a variable of a type that has not been defined yet
  2. a line of code might attempt to create a variable that is a pointer to a type that has not been defined or declared yet
  3. a call might be made to a function that has not been defined or declared yet

In trivial cases the solution to such problems is to move the definition of the class or function (or the entire class containing the function) that is undefined at the point of the problematic code, up in the document so that it comes before the problematic code. However, in a program with multiple classes, this solution ceases to work due to the network of dependencies concerning which classes need the compiler to know about which other classes in order to be defined. Moving a definition up in the code might resolve one problem but cause others, if the definition also moves ahead of other classes that it was meant to stay behind. This would occur in the following example, in a program of only two classes.

class A {
	B * bMemberVariable;
}
class B {
	A * aMemberVariable;
}

In the above code, A contains a pointer to an object of class B, and B contains a pointer to an object of class A. The violation here is of type 2; B has not been declared or defined yet at the point where a pointer of type *B is declared to be a member of A. Moving Bís definition above that of A wonít work here since B has as a member variable a pointer of type *A. This would cause another violation of the same type. Clearly we need solutions other than moving entire definitions around, and a set of rules defining when to use each type of solution to resolve a programís entire network of dependencies. We will prove that for any sensible collection of classes, all dependencies can be accounted for.

In the above example, the answer is to place only a declaration of B above A, leaving the definition of B where it is. In C++ a pointer to a type can be created as soon as that type has been declared; the definition can appear later. This solution can be used for any violation of type 2, and never causes another violation to occur. If there is a group of more than two classes linked by member variables pointing to other classes in the group, the declarations for all classes in the group can simply be placed earlier in the code than any of the classesí actual definitions, and there will be no type 2 violations.

Violations of type 3 can also be resolved without changing the order of class definitions. For stand-alone functions the order of class definitions is not affected by function order, and these violations can be resolved by placing, for each call to a function from another function that must come before it in code, a declaration of the called function somewhere above the calling function. For groups of stand-alone functions linked by calls, the declarations of all in the group can simply be placed before any definitions. For functions that are members of classes, we must be able to resolve cases where a function is called by either another function in the same class that must appear before it in the class definition, or a function in a class which must appear before the called functionís class. The solution to these violations is to move the definition of the calling function to come later than the involved class(es), leaving only a declaration where it previously was. This way, the called function will already have been declared/defined when the call occurs in the code. The C++ notation used to place a function definition outside of the definition of its class is:

class name::function name () {function contents}

For groups of member functions linked by calls, the definitions of all in the group can be moved below the last of the definitions of the classes involved, and no code will call an undeclared function.

Violations of type 1 can be resolved by changing the order of class definitions. When a class contains a member variable with the type of another class, the definition of the other class must appear before that of the first class, so that the compiler will no how much memory to allot for it when calculating the size of instances of the first class. We must consider the question of whether placing a certain class before another class could create a similar violation in the class being moved, or an intermediate class. This would happen if the class being moved also contained a variable of the type of the class itís being placed in front of, or of a third class which in turn contained a variable of this class, or a similar situation with an arbitrary number of classes in the chain. It should be obvious that this is not a sensible or intuitive class structure, since an instance of any of the involved classes would be an infinitely large structure and its constructor would never finish executing.

We have shown that violations of the first two types can be resolved without changing the order of class definitions, and that violations of type 3 can be resolved by rearranging class definitions. Therefore a general solution can be conceived for structuring any program without violations, whereby classes are ordered in a way thatís totally dictated by ensuring classes come before other classes that contain their type as members, and dependencies regarding pointer members and function calls are resolved as described above, without affecting the order of class definitions. This is how C++ programs are structured in common practice.

Since resolving dependencies regarding function calls involves moving function definitions to a later point than any of the involved class definitions, in this system function definitions tend to migrate to the bottom of a program, with class definitions and declarations all together in a block at the top.

C++ Project Files and the Linking Stage

Before a C++ program is compiled, the pre-processor inserts the contents of .h files into the .cpp files in which they are included. It is only the .cpp files (including inserted contents of .h files) that get compiled (except in the case of precompiled header files). Each .cpp file is compiled separately, into an object file. For this reason it is expected that each .cpp file resolve all dependencies as described above.

This independent compilation of .cpp files is complicated because, as described above, calls to a function are replaced with execution-transferring machine code instructions. However the compiler has no way of knowing what offset within the code segment a destination instruction will have, since it does not know how much code representing functions belonging to types defined in other .cpp files, or even functions declared in the present .cpp file but defined elsewhere, will come before it in the code segment. For this reason, the compiler leaves this information incomplete. Another program called a linker is used to combine all object files, reconciling all the offsets of destination points with each other in a way that works.

This system in which .cpp files get compiled and .h files merely store code that can be inserted, gives us a mechanism by which to insert the same code in multiple .cpp files. While this causes the types and functions described by the code to be defined multiple times, the redundancy does not cause problems (linkers are able to identify definitions that occur in multiple object files and only create such a type or function once. An error only occurs when a compiler sees a definition twice within one .cpp file) and allows the benefit that the code in each .cpp file will be able to use the types defined, and functions declared, in the .h file. It is common practice to make .h files contain only definitions and declarations, with actual function contents in .cpp files. This is made possible by the fact that, as mentioned earlier in our discussion of resolving dependencies, class definitions and declarations are commonly all together in source code, before actual definitions of functions begin. The class definitions and declarations can be placed in a projectís header files and, because include statements are typically at the top of .cpp files, will always wind up before the function definitions in the .cpp file just prior to compiling.