BSc-Thesis/thesis/src/02.intercept.tex


\chapter{Intercepting Function Calls}\label{ch:intercepting-function-calls}

In this chapter, all steps on how to intercept function calls in this work are discussed.
An example of what the resulting interception looks like may be found in Section~\ref{sec:intercepting-example}.
Furthermore, an overview on how to test given programs is presented in Section~\ref{sec:automated-testing-on-intercepted-function-calls}.
How these function calls may be manipulated is discussed in Chapter~\ref{ch:manipulating-function-calls}.


\section{Identified Methods for Intercepting Function and System Calls}\label{sec:methods-for-intercepting}

First, one has to answer the question on \textit{how exactly} to intercept function or system calls.
At the beginning of this work, it was not yet determined if the interception of function calls, system calls, or both should be used to achieve the overarching goal (see Section~\ref{sec:motivation-and-goal}).
This first section tries to list all possible and relevant methods on how to intercept function or system calls but does not claim exhaustiveness.
The order of the following subsections is roughly based on the thought process on finding the most appropriate method suitable for this work.


\subsection{\texttt{ptrace} System Call}\label{subsec:ptrace}

The first thing that pops up when researching on how to intercept system calls in Linux is the \texttt{ptrace} (``process trace'') system call.
This system call allows one process to observe and control the execution of another process (including memory and registers).
The control is handed from the traced process to the tracing process each time any signal is delivered.
\cite{ptrace.2}

To make use of this system call, a corresponding command already exists.
See Subsection~\ref{subsec:strace}.


\subsection{\texttt{strace} Command}\label{subsec:strace}

The \texttt{strace} (``system call/signal trace'') command may be used to run a specified command and to thereby intercept and record the system calls which are made.
Each system call is recorded as a line and either written to the standard error output or a specified file.
\cite{strace.1}

Listings~\ref{lst:main.c} and~\ref{lst:strace} give a simple example of what this output looks like.
It is clearly visible that only (``pure'') system calls are recorded, and calls to library functions (like \texttt{malloc} or \texttt{free}) do not appear.
Also note that arguments to the calls are displayed in a ``pretty'' way.
For example, string arguments would be simple pointers, but \texttt{strace} displays them as C-like strings.

\begin{listing}[htbp]
  \inputminted[linenos]{c}{src/listings/main.c}
  \caption{Contents of \texttt{main.c}.}
  \label{lst:main.c}
\end{listing}

\begin{listing}[htbp]
  \begin{minted}{text}
execve("./main", ["./main"], 0x7ffd63b32bb0 /* 71 vars */) = 0
[-- 32 lines omitted --]
write(1, "Hello World!\n", 13)          = 13
write(1, "String: Abc123\n", 15)        = 15
exit_group(0)                           = ?
+++ exited with 0 +++
  \end{minted}
  \caption{Output of \texttt{strace ./main}.}
  \label{lst:strace}
\end{listing}

This approach works well for debugging and other use cases, but intercepting system calls alone does not satisfy the requirements of this work.


\subsection{\texttt{ltrace} Command}\label{subsec:ltrace}

The \texttt{ltrace} (``library call trace'') command may be used to trace dynamic library calls instead of system calls.
It works similarly to \texttt{strace} (see \ref{subsec:strace}).
\cite{ltrace.1}

Listings~\ref{lst:main.c} and~\ref{lst:ltrace} illustrate what the output of \texttt{ltrace} looks like.
In contrast to the output of \texttt{strace} now only ``real'' calls to library functions are included in the output.
Therefore, a lot less ``noise'' is generated (see omitted lines in Listing~\ref{lst:strace}).
Again, the function arguments are displayed in a ``pretty'' way.
This command uses so-called prototype functions~\cite{ltrace.conf.5} to format function arguments.

\begin{listing}[htbp]
  \begin{minted}{text}
malloc(10)                                     = 0x55624164b2a0
printf("Hello World!\nString: %s\n", "Abc123") = 28
free(0x55624164b2a0)                           = <void>
+++ exited (status 0) +++
  \end{minted}
  \caption{Output of \texttt{ltrace ./main}.}
  \label{lst:ltrace}
\end{listing}

This method fits the requirements for this work a lot better than \texttt{strace} (see Subsection~\ref{subsec:strace}),
but it is not very flexible and offers no means to modify the intercepted function calls.

\subsection{Kernel Module}\label{subsec:kernel-module}

Another possibility to intercept system calls is to intercept them directly in the kernel via a kernel module.
However, this work did not explore this approach further due to time constraints and other, better-fitting alternatives.
See~\cite[Section~7.2]{netsectools2005} for more details on how to intercept system calls using kernel modules.


\subsection{Wrapper Functions in gcc}\label{subsec:wrapper-functions}

A different approach to intercepting function calls is to tell the compiler directly which functions should be intercepted.
The compiler, and the linker respectively, then directly link calls to the specified functions to wrapper functions.
(See Subsection~\ref{subsec:preloading} for more details.)

The default linker \texttt{ld} includes such a feature.
See the ld(1) Linux manual page~\cite[Section OPTIONS]{ld.1}:

\begin{quote}
  \begin{description}
    \item[\texttt{-{}-wrap=\textit{symbol}}]
    Use a wrapper function for \texttt{\textit{symbol}}.
    Any undefined reference to \texttt{\textit{symbol}} will be resolved to \texttt{\_\_wrap\_\textit{symbol}}.
    Any undefined reference to \texttt{\_\_real\_\textit{symbol}} will be resolved to \texttt{\textit{symbol}}.

    This can be used to provide a wrapper for a system function.
    The wrapper function should be called \texttt{\_\_wrap\_\textit{symbol}}.
    If it wishes to call the system function, it should call \texttt{\_\_real\_\textit{symbol}}.
    \lbrack\dots\rbrack
  \end{description}
\end{quote}

The gcc compiler also supports this by allowing passing options to the linker.
See the gcc(1) Linux manual page~\cite[Section OPTIONS]{gcc.1}:

\begin{quote}
  \begin{description}
    \item[\texttt{-Wl,\textit{option}}]
    Pass \texttt{\textit{option}} as an option to the linker.
    If \texttt{\textit{option}} contains commas, it is split into multiple options at the commas.
    You can use this syntax to pass an argument to the option.
    For example, \texttt{-Wl,-Map,output.map} passes \texttt{-Map output.map} to the linker.
    When using the GNU linker, you can also get the same effect with \texttt{-Wl,-Map=output.map}.
    \lbrack\dots\rbrack
  \end{description}
\end{quote}

This means, by specifying \texttt{-Wl,-{}-wrap=\textit{symbol}} when compiling using gcc,
all calls from the currently compiled program to \texttt{\textit{symbol}} are redirected to \texttt{\_\_wrap\_\textit{symbol}}.
To call the real function inside the wrapper, \texttt{\_\_real\_\textit{symbol}} may be used.
Listings~\ref{lst:wrap.c} and~\ref{lst:wrap} illustrate this by overriding the \texttt{malloc} function of the C standard library.

\begin{listing}[htbp]
  \inputminted[linenos]{c}{src/listings/wrap.c}
  \caption{Contents of \texttt{wrap.c}.}
  \label{lst:wrap.c}
\end{listing}

\begin{listing}[htbp]
  \begin{minted}{shell}
gcc -o main_wrapped main.c wrap.c -Wl,--wrap=malloc
./main_wrapped
  \end{minted}
  \caption{Compile \texttt{main.c} and \texttt{wrap.c} and run the resulting program.}
  \label{lst:wrap}
\end{listing}

This approach allows wrapping any function in a relatively clean way.
But it is not possible to override functions in any given binary program.
It is required to re-compile (or to re-link) a given program to use this feature of ld.
Therefore, the source code (or the corresponding \texttt{*.out} files) needs to be available.
Note, only calls from the targeted source code will be redirected, calls from other libraries won't.

Theoretically, it should be possible to re-link a given binary without having access to its source code.
But due to other more straight-forward methods (see Subsection~\ref{subsec:preloading}), this has not been investigated further.


\subsection{Preloading using \texttt{LD\_PRELOAD}}\label{subsec:preloading}

To execute binary files on Linux systems, a dynamic linker is needed at runtime.
(Unless the binaries were statically linked at compile-time.)
Usually, \texttt{ld.so} and \texttt{ld-linux.so} are used as dynamic linkers.
They find and load the shared objects (shared libraries) needed by a program, prepare the program and finally run it.
\cite{ld.so.8}

As the overwhelming majority of programs are dynamically linked,
most function calls to other libraries (like to the C standard library) reference a shared object, which has to be loaded by the linker at runtime.
Therefore, it would be possible to ``hijack'' (or intercept) these function calls
when the linker would allow loading other functions instead of the proper ones.

Luckily, \texttt{ld.so} allows this so-called ``preloading''.
See the ld.so(8) Linux manual page~\cite[Section ENVIRONMENT]{ld.so.8}:

\begin{quote}
  \begin{description}
    \item[\texttt{LD\_PRELOAD}]
    A list of additional, user-specified, ELF shared objects to be loaded before all others.
    This feature can be used to selectively override functions in other shared objects.
    \lbrack\dots\rbrack
  \end{description}
\end{quote}

This means, by setting the environment variable \texttt{LD\_PRELOAD}, it is possible to override specific functions.
Listings~\ref{lst:preload.c} and~\ref{lst:preload} illustrate this by overriding the \texttt{malloc} function of the C standard library.

\begin{listing}[htbp]
  \inputminted[linenos]{c}{src/listings/preload.c}
  \caption{Contents of \texttt{preload.c}.}
  \label{lst:preload.c}
\end{listing}

\begin{listing}[htbp]
  \begin{minted}{shell}
# ./main is already compiled and ready
gcc -shared -fPIC -o preload.so preload.c
LD_PRELOAD="$(pwd)/preload.so" ./main
  \end{minted}
  \caption{Compile \texttt{preload.c} and run a program with \texttt{LD\_PRELOAD}.}
  \label{lst:preload}
\end{listing}

The function \texttt{dlsym} is used to retrieve the original address of the \texttt{malloc} function.
\texttt{RTLD\_NEXT} indicates to find the next occurrence of \texttt{malloc} in the search order after the current object.
\cite{dlsym.3}

By using this method, it is possible to override, and therefore wrap, any function as long as the targeted binary was not statically linked.
However, one must be aware that not only function calls inside the targeted binary, but also calls inside other libraries (e.g., \texttt{malloc}) are redirected to the overriding function.


\subsection{Conclusion}\label{subsec:methods-for-intercepting-conclusion}

During the research on different approaches to intercepting system and function calls,
it has been found that the most reliable way to achieve the goals of this work (see Section~\ref{sec:motivation-and-goal}) is to intercept function calls instead of system calls.
This is because---as long as the programs to test are dynamically linked---, intercepting function calls allows one to intercept many more calls and in a more flexible way.
Therefore, from now on this work only considers function calls and no system calls directly.

In this work, preloading (see Subsection~\ref{subsec:preloading}) was chosen to be used
because it is simple to use (``clean'' source code, easy to compile and run programs with it) and offers the means to arbitrarily execute code when the intercepted function call is redirected.
The following sections concern the next steps in what else is needed to create a powerful ``interceptor''.


\section{Fundamental Project Structure}\label{sec:fundameltal-project-structure}

After deciding to use the preloading method to intercept function calls, a more detailed plan is needed to continue developing.
It was decided to have one single \texttt{intercept.so} file as a resulting artifact which then may be loaded via the \texttt{LD\_PRELOAD} environment variable.
The easiest and most straightforward way to structure the source code was to put all code in one single C file.
Listing~\ref{lst:intercept-preload.c} gives an overview of the underlying code structure.
For each function that should be intercepted, this function simply has to be declared and defined the same way \texttt{malloc} was.

\begin{listing}[htbp]
  \inputminted[linenos]{c}{src/listings/intercept-preload.c}
  \caption{Contents of \texttt{intercept-preload.c}.}
  \label{lst:intercept-preload.c}
\end{listing}


\section{Retrieving Function Argument Values}\label{sec:retrieving-function-argument-values}

Now that the first steps have been done, one needs to think about what exactly to record when intercepting.
A simple notification that a given function was called would not be sufficient.
Within the following subsections, effort is put into getting as much information as possible from each function call.

As already mentioned, \texttt{ltrace} uses prototype functions to format its function arguments.
This allows \texttt{ltrace} to ``dynamically'' display function arguments for any new or unknown functions without the need for recompilation.
\cite{ltrace.conf.5}

However, due to implementation complexity reasons and the need for ``complex'' return types for string/buffer and structure values (see Section~\ref{sec:retrieving-function-return-values}) a statically compiled approach has been used for this work.
This means that each function formats its arguments and return values itself without any configuration option.

The reason for retrieving as much information as possible from each function call is that at a later point in time, it is possible to completely reconstruct the exact function calls and their sequence.
This allows analysis on these records to be performed independently of the corresponding execution of the program.
It should always be possible to fully parse the recorded calls without any specific knowledge of specific functions, their argument types, or return value type.


\subsection{Numbers}\label{subsec:retrieving-numbers}

The simplest types of arguments are plain numbers, like integers (\texttt{int}, \texttt{long}, \ldots) or floating point numbers (\texttt{float}, \texttt{double}).
(In fact, \textit{all} arguments are represented as numbers or integers.
See the following subsections for examples.)
Plain numbers may be formatted simply as what they are, in base 10 notation, or with a prefix like \texttt{0x} for hexadecimal or \texttt{0} for octal representation.

Example: \texttt{malloc(123)} (or \texttt{malloc(0x7B)}).

\subsection{Unspecific Pointers}\label{subsec:retrieving-unspecific-pointers}

Pointers with no further information known about (like \texttt{void *}) are essentially integers.
Therefore, they may be treated as such.

Example: \texttt{free(0x55624164b2a0)}.

\subsection{Strings and Buffers}\label{subsec:retrieving-strings-buffers}

Strings in C are simple pointers to a place in memory which is null-terminated.
This means that the strings end with the first occurrence of the null-byte (\texttt{0x00}).
To distinguish unspecific pointers from pointers to strings, it was chosen to use a colon (\texttt{:}) after the pointer numerical value.
The colon is followed by the contents of the string with beginning and ending quoted (\texttt{"}).
Special values inside the string are escaped with a backslash.

Example: \texttt{sem\_unlink(0x1234:"/test-semaphore")}.

Another type of string-like data in C is a buffer with a known length.
When buffers are used, usually another argument is passed to the function which indicates the length of the buffer.
This fact may be used to print out the contents of the buffer in the same way as normal C strings.

Example: \texttt{write(3, 0x1234:"Test\textbackslash{}x00ABC", 8)}.

\subsection{Flags}\label{subsec:retrieving-flags}

Some functions have one of their arguments dedicated to flags which may be combined by bitwise XOR\@.
These arguments are also of type integer.
To distinguish flag arguments from others, a pipe symbol (\texttt{|}) is used after the colon and between the flags.

Example: \texttt{open(0x1234:"test.txt", 0102:|O\_CREAT|O\_RDWR|, 0644)}.

\subsection{Constants}\label{subsec:retrieving-constants}

For some functions constants are used.
These constants are typically used C macros in the source code.
This makes the source code more readable (and portable).
Constants are represented as an integer again followed by a colon, this time without any special characters to distinguish them from other types.

Example: \texttt{socket(2:AF\_INET, 1:SOCK\_STREAM, 6)}.

\subsection{Pointers to Arrays}\label{subsec:retrieving-pointers-to-arrays}

Sometimes arrays are used as arguments.
Arrays in C work similar to strings, they are either null-terminated (by an element being of value 0), or their length is explicitly given.
So to represent them, two brackets are used (\texttt{[]}) and a comma (\texttt{,}) to separate the respective elements.
Each element may be represented as an ``argument'' on its own (as illustrated by the example).

Example: \\
\texttt{getopt(2, 0x7f0b8:[0x7feb3:"./main", 0x7fee6:"arg"], 0x123:"v")}.

\subsection{Pointers to Structures}\label{subsec:retrieving-pointers-to-structures}

In rare cases, structures (\texttt{struct}) are used as argument types.
Two curly brackets (\texttt{\{\}}) are used to indicate structures.
Then the field names are displayed plainly, followed by a colon and then the value of that field.
Commas are used to separate the fields respectively.

Example: \texttt{\tiny connect(2, 0x123:\{sa\_family: 2:AF\_INET, sin\_addr: "1.1.1.1", sin\_port: 80\}, 16)}.


\section{Retrieving Function Return Values}\label{sec:retrieving-function-return-values}

It might seem that retrieving return values of functions is as straightforward as retrieving their arguments, but this is not entirely the case.
Most libc functions return -1 on error and set \texttt{errno} to indicate the exact type of error.
Other functions (like \texttt{read}, \texttt{pipe}, or \texttt{sem\_getvalue}) even store their output in a pointer which was given to them as an argument.
The following examples illustrate how this challenge was solved.

Example (\texttt{malloc}): \\
\texttt{return 0x1234; errno 0}, \\
\texttt{return -1; errno ENOMEM}.

Some libc functions return their results via a pointer, which was previously given to them as an argument.
The \texttt{pipe} function is called with an \texttt{int} array of size two as an argument and stores its two pipe ends into this array.
The \texttt{read} function is called with a pointer to a buffer and a corresponding size and stores its read data into this buffer.

Example (\texttt{pipe}): \\
\texttt{return 0; errno 0; fildes=[3,4]}, \\
\texttt{return -1; errno ENFILE}.

Example (\texttt{read}): \\
\texttt{return 12; errno 0; buf=0x7fff70:"Hello World!"}, \\
\texttt{return -1; errno EINTR}.


\section{Determining Function Call Location}\label{sec:determining-function-call-location}

Besides argument values and return values, it would be interesting to know from where inside the intercepted program the function call came.
At first this seems quite impossible.
But a function always knows at least the return address, the address to set the instruction pointer to when the function finishes.
With this information, it may be estimated where the call to the current function came from.

\subsection{Return Address and Relative Position}\label{subsec:return-address-and-relative-position}

As already mentioned, the return address of a function is vital for estimating where the call came from.
Luckily, GCC provides the means to get the return address of the current function.
See in the manual of GCC~\cite[Section~7.6]{gcc}:

\begin{quote}
  \begin{description}
    \item[\texttt{void *\_\_builtin\_return\_address(unsigned int \textit{level})}] \ \

    This function returns the return address of the current function, or of one of its callers.
    The \textit{level} argument is number of frames to scan up the call stack.
    A value of \texttt{0} yields the return address of the current function, a value of \texttt{1} yields the return address of the caller of the current function, and so forth.
    \lbrack\dots\rbrack
  \end{description}
\end{quote}

The return address on its own is of limited use.
Because, among other things, of Address Space Layout Randomization (ASLR) in almost all modern programs.
ASLR is a security feature that randomly places shared objects (libraries) in the virtual memory of a program on each execution.
In contrast to always positioning the same object at the same address each time, this makes it harder to exploit internal memory structures.

Fortunately, the dynamic linking library includes a function to translate a given virtual memory address to symbolic information without having to worry about ASLR and other obstacles.
See the dladdr(3) Linux manual page~\cite{dladdr.3}:

\begin{quote}
  \begin{description}
    \item[\texttt{int dladdr(const void *addr, Dl\_info *info)}] \ \

    The function \texttt{dladdr()} determines whether the address specified in \textit{addr} is located in one of the shared objects loaded by the calling application.
    If it is, then \texttt{dladdr()} returns information about the shared object and symbol that overlaps \textit{addr}.
    This information is returned in a \texttt{Dl\_info} structure:

    \begin{minted}{C}
typedef struct {
  const char *dli_fname; /* Pathname of shared object
                            that contains address */
  void       *dli_fbase; /* Base address at which
                            shared object is loaded */
  const char *dli_sname; /* Name of symbol whose
                            definition overlaps addr */
  void       *dli_saddr; /* Exact address of symbol
                            named in dli_sname */
} Dl_info;
    \end{minted}

    \lbrack\dots\rbrack
  \end{description}
\end{quote}

Using information from \texttt{Dl\_info}, it is possible to exactly determine the (shared) object from where the call came from (\texttt{dli\_fname}).
Furthermore, it is possible to calculate the relative position inside this (shared) object using \texttt{dli\_fbase} and the return address itself.
Keep in mind that the return address may only be used as an estimation for the origin of the call.
Especially heavily optimized programs might use the same return address for functions in different code paths.
Optionally, a name of a ``symbol'' (function) may be retrieved from where the function call came from.


\subsection{Source File and Line Number}\label{subsec:source-file-and-line-number}

DWARF is a file format used for storing debugging information (like source file, line number) inside compiled binaries.
This allows various debuggers and other analysis programs to better give feedback to the user.
\cite{dwarfstd.org}

This also helps to find the origin of a given function call.
When a program is compiled with GCC using the flags \texttt{-g} or \texttt{-gdwarf} GCC includes the DWARF debug section in the resulting binary.
Using the readelf tool, it is possible to make use of this debug section.
See the readelf(1) Linux manual page~\cite[Section OPTIONS]{readelf.1}:

\begin{quote}
  \begin{description}
    \item[\texttt{-{}-debug-dump}]
    Displays the contents of the DWARF debug sections in the file, if any are present.
    [\dots]
    The letters and words refer to the following information:
    \begin{description}
      \item {}[\dots]
      \item[\texttt{=rawline}] Displays the contents of the \texttt{.debug\_line} section in a raw format.
      \item[\texttt{=decodedline}] Displays the interpreted contents of the \texttt{.debug\_line} section.
      \item {}[\dots]
    \end{description}
  \end{description}
\end{quote}

Using the resulting output, which sets relative address and source file and line number in relation, it is possible to retrieve both values from any given relative address inside the binary.
If this information is present, it is printed within the meta-information of the function call (see Section~\ref{sec:intercepting-example}).


\section{\texttt{intercept.so} Library}\label{sec:intercept.so-library}

The time has come for putting it all together.
As mentioned in Section~\ref{sec:fundameltal-project-structure}, almost the whole project exists in one source file, \texttt{intercept.c}.
This file is compiled to \texttt{intercept.so}, which may be preloaded using \texttt{LD\_PRELOAD} and controlled with other environment variables.
These other environment variables are described in the following:

\begin{description}
  \item[\texttt{INTERCEPT}]
    This variable has to be set to enable function call interception.
    The value decides where to output/print/write/send the recorded function calls.
    Values may be \texttt{stdout}, \texttt{stderr}, \texttt{file:\textit{<path>}}, \texttt{unix:\textit{<path>}}.
  \item[\texttt{INTERCEPT\_VERBOSE}]
    This variable indicates whether string and structure types should be printed fully or empty.
    Possible values are \texttt{0} and \texttt{1} (default).
  \item[\texttt{INTERCEPT\_FUNCTIONS}]
    This variable is used to specify which function calls should be intercepted.
    It is a list separated by commas, colons, or semicolons.
    Wildcards (\texttt{*}) at the end of function names are possible.
    A prefix of \texttt{-} indicates that the following function should not be intercepted.
    Example: \texttt{*,-sem\_} intercepts all functions except those which start with \texttt{sem\_}.
    By default, all (implemented) functions are intercepted.
  \item[\texttt{INTERCEPT\_LIBRARIES}]
    This variable is used to specify which libraries' function calls should be intercepted.
    It is a list separated by commas, colons, or semicolons.
    Wildcards (\texttt{*}) at the end of library paths are possible.
    A prefix of \texttt{-} indicates that the following library path should not be intercepted.
    Example: \texttt{*,-/lib*,-/usr/lib*} intercepts only function calls originating from binaries outside \texttt{/lib*} or \texttt{/usr/lib*} which in most cases is the executed program itself.
    By default, function calls from everywhere are intercepted.
\end{description}

The shared object currently supports intercepting the following functions:
\texttt{malloc}, \texttt{calloc}, \texttt{realloc}, \texttt{reallocarray}, \texttt{free}, \texttt{getopt}, \texttt{exit},
\texttt{read}, \texttt{pread}, \texttt{write}, \texttt{pwrite}, \texttt{close}, \texttt{sigaction}, \texttt{sem\_init},
\texttt{sem\_open}, \texttt{sem\_post}, \texttt{sem\_wait}, \texttt{sem\_trywait}, \texttt{sem\_timedwait}, \texttt{sem\_getvalue},
\texttt{sem\_close}, \texttt{sem\_unlink}, \texttt{sem\_destroy}, \texttt{shm\_open}, \texttt{shm\_unlink}, \texttt{mmap},
\texttt{munmap}, \texttt{ftruncate}, \texttt{fork}, \texttt{wait}, \texttt{waitpid}, \texttt{execl}, \texttt{execlp},
\texttt{execle}, \texttt{execv}, \texttt{execvp}, \texttt{execvpe}, \texttt{execve}, \texttt{fexecve}, \texttt{pipe},
\texttt{dup}, \texttt{dup2}, \texttt{dup3}, \texttt{socket}, \texttt{bind}, \texttt{listen}, \texttt{accept}, \texttt{connect},
\texttt{getaddrinfo}, \texttt{freeaddrinfo}, \texttt{send}, \texttt{sendto}, \texttt{sendmsg}, \texttt{recv}, \texttt{recvfrom},
\texttt{recvmsg}, \texttt{getline}, \texttt{getdelim}.

\section{\texttt{intercept} Command}\label{sec:intercept-command}

To make the usage of the aforementioned shared object easier, a simple python script has been put together.
This script may be used as a command line tool.
See Listing~\ref{lst:intercept}.

\begin{listing}[htbp]
  \inputminted[linenos]{python}{../proj/intercept/intercept}
  \caption{Contents of \texttt{intercept}.}
  \label{lst:intercept}
\end{listing}

The synopsis of the command is as follows:
\begin{minted}{text}
intercept [-h] [-F FUNCTIONS] [-s] [-o | -L LIBRARIES] \
  [-l LOG | -i INTERCEPT] [--] COMMAND [ARGS...]
\end{minted}

\begin{description}
  \item[\texttt{-F}, \texttt{-{}-functions}]
    A list of functions to intercept.
    See Section~\ref{sec:intercept.so-library} for more details.
    Default value is \texttt{*}.
  \item[\texttt{-s}, \texttt{-{}-sparse}]
    Indicates that strings and structures should be printed empty to save bandwidth.
  \item[\texttt{-o}, \texttt{-{}-only-own}]
    A shorthand for \texttt{-L *,-/lib*,-/usr/lib*}.
    This has the effect that only function calls from the executed binary itself are recorded.
  \item[\texttt{-L}, \texttt{-{}-libraries}]
    A list of library paths to intercept function calls from.
    See Section~\ref{sec:intercept.so-library} for more details.
    Default value is \texttt{*} (except when \texttt{-o} is present).
  \item[\texttt{-l}, \texttt{-{}-log}]
    Used to specify in which file the recorded function calls should be logged.
    Shorthand for \texttt{-i file:\textit{<arg>}}.
  \item[\texttt{-i}, \texttt{-{}-intercept}]
    Decides where to output/print/write/send the recorded function calls.
    Values may be \texttt{stdout}, \texttt{stderr}, \texttt{file:\textit{<path>}}, \texttt{unix:\textit{<path>}}.
    See Section~\ref{sec:intercept.so-library} for more details.
\end{description}


\section{Example}\label{sec:intercepting-example}

To make it easier for the reader, Listing~\ref{lst:intercept-client} provides some recorded function calls.
Most lines had to be broken up into multiple lines for better readability.
The recorded calls stem from a program written by myself as a solution for an assignment in the Operating Systems course at university.
It is a simple HTTP client.
The program was invoked using \texttt{./intercept -o -{}- ./client http://www.complang.tuwien.ac.at/}.

The first number on each line indicates unix time with nanosecond precision.
The second and third numbers correspond to the process ID and thread ID respectively.
Each line contains either a recorded call to a function or a recorded return of a function.
After the arguments of each function call a colon (\texttt{:}) indicates the beginning of meta-information.
This information always includes the return address to where the function jumps when completed.
If available, the interpretation of the return address is also provided.
This includes the offset relative to the calling binary and a source file and line number combination if the binary was compiled using \texttt{gcc -g} or \texttt{gcc -gdwarf}.

\begin{listing}[htbp]
  \inputminted[fontsize=\tiny]{text}{src/listings/intercept-client.txt}
  \caption{Recorded function calls from \texttt{./client}.}
  \label{lst:intercept-client}
\end{listing}


\section{Automated Testing on Intercepted Function Calls}\label{sec:automated-testing-on-intercepted-function-calls}

The recorded function calls of a program run may now be used to perform checks and tests on them.
It is trivially possible to check which functions were called and in what order.
Furthermore, it is possible to check various pre- and post-conditions for each function call.
This is beneficial because many library functions in C rely on these pre- and post-conditions, which are not enforced by the compiler or in any other way.

For example, the \texttt{malloc} function has the post-condition that the returned value later needs to be passed to \texttt{free} to avoid memory leaks.
The \texttt{free} function, on the other hand, has the pre-condition that the passed value was previously acquired using \texttt{malloc} and may not be yet freed.
Any violation of such pre- and post-conditions may be reported as non-compliant behavior.
\cite{malloc.3}

This means that intercepted function calls allow a tester to check whether programmers use library functions in compliance with their specifications.
Other checks may also include guards to calls to ``forbidden'' functions, or that specific functions must be called exactly three times.
Another important post-condition of most library functions is the return value, which in most cases indicates success or failure of an operation.
However, intercepting of calls alone may not be able to verify if a program really checks the return value of a function and acts accordingly.
Chapter~\ref{ch:manipulating-function-calls} shows how this problem may be solved.


\subsection{Validating Memory Management}\label{subsec:testing-memory-management}

The most basic memory management functions in the C standard library are the following.

\begin{description}
  \item[\texttt{malloc}, \texttt{calloc}]
    Allocate memory. \cite{malloc.3}
  \item[\texttt{realloc}, \texttt{reallocarray}]
    Change the size of a previously allocated memory block and possibly move the block to another position in virtual memory. \cite{malloc.3}
  \item[\texttt{free}]
    Free previously allocated memory. \cite{malloc.3}
  \item[\texttt{getaddrinfo}]
    Allocate and initialize a linked list of \texttt{addrinfo} structures. \cite{getaddrinfo.3}
  \item[\texttt{freeaddrinfo}]
    Frees memory previously allocated by \texttt{getaddrinfo} for the dynamically allocated linked list. \cite{getaddrinfo.3}
  \item[\texttt{getline}, \texttt{getdelim}]
    Used to split strings.
    Allocate memory on their own, which must be freed afterward. \cite{getline.3}
\end{description}

By only intercepting these functions, it is possible to check if all allocated memory blocks in a simple program were properly allocated and freed.


\subsection{Validating Resource Management}\label{subsec:validating-resource-management}

In addition to memory management, the proper use of other resources---most notably file descriptors---can also be checked.
Many functions in the C standard library rely on file descriptors.
It may be checked if file descriptors were properly acquired, if only previously acquired file descriptors are used, and if these file descriptors are closed after their use.
Relevant for this work are also semaphores because they do not rely on file descriptor in their API\@.
Due to time restrictions, no detailed list for validating resource management has been put together.