# Autopledge

Abstract

Even in following good coding practices, arbitrary code execution bugs can still exist. By leveraging pledge(2) system calls and a static analysis framework, we attempt to mitigate these bugs by automatically inserting pledge statements. Although an algorithm was devised to do this, time limitations prevented its full implementation.

#Introduction

##Motivation

System and high performance computing code are often written in low level languages such as C and C++. Low level languages provide high performance but are open to memory corruption attacks due to the lack memory safety and manual memory management. Memory corruption attacks (Szekeres 2013) manifests through memory errors such as overflows, underflows or dangling pointers. One main type of memory corruption attack is control-flow hijacking that takes control of the entire program by using memory errors to enable the write and execution of desired instructions. The classic control-flow attack that works by overwriting in memory programs with new desired instructions can prevented with non-executable data policies. However, attacks can use Return Oriented Programming (ROP) to bypass data policies by chaining together in-memory code. The library used for ROP attacks is libc, these attacks are referred "return-to-libc" attacks. Libc is ideal for attacks because it likely to be already residing in memory and there are enough instruction sequences in libc's functions that make it Turing complete by chaining together these sequences(Tran 2011). Currently there are no policies that prevent ROP because it executes valid code from memory, but mitigation techniques (Li 2010, Pappas 2012, Raadt) can be applied to reduce the effectiveness of such attacks.

Based on the observation that ROP uses many libraries that make system calls, De Raadt proposed a new mitigation technique against control-flow attacks _Pledge_(Raadt) for the OpenBSD operating system. To reduce attack surfaces, programs in OpenBSD can be annotated with pledge(pledge — restrict sys...) promises to limit the number of systems calls the program can make. Promises used in pledges are groupings of system calls that are related. For example, the stdio promise includes system calls required for IO such as fsync, read, write, pipe, etc. When running a pledged program, the kernel can enforce the pledged promises and report an error if the program attempts to make system calls that are not allowed.

To pledge a program, a list of all the system calls that the program can use is required. Over-pledged programs can malfunction while under-pledged programs can leave more attack surfaces than necessary. Therefore, a full understanding of the program's behavior and the interfaces it uses is required to correctly identify the list of all the system calls a program can use. Once all the system calls for a program is determined, they can be converted to promises for use in pledge.

Augmenting programs with pledges is non-trivial and time consuming even for simple UNIX utility programs. Pledging large programs is more challenging for developers due to difficulty in accurately determining behaviors of complex systems and the lack of understanding in low level libraries. Similar to permissioning in mobile operating systems and web browsers, these complications can lead to overpermissioning in real world applications (Felt 2011)(Jackson 2009). To solve this problem, we develop an automated system Autopledge to analyse C programs and insert pledges in optimal locations.

##Contributions

This paper make the following contributions:

1. An approach to find all system calls that are used by a C program through dataflow analysis.
2. An algorithm for determining optimal locations to insert pledge calls.
3. An implmentation to insert pledge statements to C source files.

Smartphones have large amounts of personal information such as sensor data, photos, and contacts. To protect user privacy and safety, granular application permissions are an essential part of mobile operating systems such as Android and iOS. Developers are likely to over permission their applications as a study (Felt 2011) found that about one-third of all Android applications are over-permissioned. Vidas and his colleges created a static analysis tool (Vidas) to help developers find the least amount of permissions. In contrast to Android permissioning, pledge allows for additional pledging anywhere in the code, further reducing the attack surface. We take advantage of this and add pledges to different parts of the program.

#Analysis

##Overview

A control-flow hijacking attack is considered successful when malicious code begins executing. A meaningful attack often requires the execution of system calls (Szekeres 2013). Pledges can be applied to mitigate the effectiveness of attacks through security vulnerabilities by reducing the available resources, i.e. types of system calls, to the attacker.

Security vulnerability attacks can possibly occur at any buffer through a buffer overflow. The ideal location to add a pledge call is directly after a system call so it cannot be used by any security vulnerability attack following that system call. However, if there is a subsequent use of that same system call later in the execution of the program, the pledge should only be placed after that last execution. If the pledge were placed after the first execution of a system call, the normal execution of the program would fail on any subsequent execution of the same system call. Therefore, the analysis must know the set of system calls that will execute in the future so that they are not pledged until they are no longer in that set.

##Notation

A basic block is a sequence of code with no branches in or out except from the entry and exit. It follows that there is no way to enter or jump from anywhere inside a basic block and all instructions inside the basic block execute only once and in order. Basic blocks can then be seen as the largest unit of code which preserves control flow for an analysis. Basic blocks make up functions and will therefore be denoted by $$f_0$$ to $$f_n$$ for a function $$f$$ with $$n$$ basic blocks.

Function _f_ has two basic blocks, the first has system call 1 and the second has system call 2. The first basic block in _f_ also calls function _g_ which has a single basic block that also has system call 2.

To determine where pledge calls can be inserted into the program shown in Figure 1, knowledge of function $$f$$ calling function $$g$$, $$f$$ calling system call 1, and both functions calling system call 2 is required. That knowledge would allow the determination that a pledge on system call 2 can only occur at the end of the second basic block in $$f$$ instead of after $$g$$ is called. To make that determination, two sets can be used to represent which system calls have executed so far and which system calls have yet to execute for each function, $$A$$ and $$A'$$, respectively. In order to form $$A$$ and $$A'$$ for a function $$f$$ with a call to another function $$g$$, the set of all system calls in $$g$$ will have to be known which will be called $$G$$. Finally, for each basic block $$f_i$$ in a function $$f$$, the set of system calls will be denoted by $$F_i$$

##Hypotheses

Upon executing the analysis, the set of system calls in each basic block $$F_i$$ of every executable function $$f$$ must be known. With this information, the set of system calls that have been executed $$A$$ and the set of system calls that remain to be executed $$A'$$ can be found for every basic block. To determine the earliest point in a program to remove a system call from those that are pledged, from the program's starting point, look at which types of system calls are in that first basic block of function $$f$$ and add those to $$A$$. The types of system calls that are in the remaining basic blocks of $$f$$, as well as those from functions which may be called from a call sites in $$f$$, $$G_j$$, are added to $$A'$$. By that point, all system calls for every $$G_j$$, even from functions they call, are included in that set. When a call site is added to the scope of $$A$$, all system calls from the functions associated with it do too.

$$A'$$ is then set subtracted from $$A$$ to determine if a system call in $$A$$ will not be executed in the future of the program's execution. Since a pledge is a whitelist, the result of $$A$$ \ $$A'$$ is subtracted from $$F$$ to determine the set of system calls to be pledged. If that result has changed since the last basic block, a new pledge call is needed to remove the system call that should no longer be used from being executed. The analysis continues by unioning the next basic block's set of system calls $$F_{i+1}$$ to $$A$$ and computing $$A'$$ without that set.

It then follows that in a function $$f$$ with $$n$$ basic blocks at basic block $$f_j$$ whos set of system calls is $$F_j$$:

$F=\bigcup_{i \in I}F_i$

$A=\bigcup_{i=0}^{j}F_i$

If function $$f$$ can be called by function $$g$$, the system calls belonging to the set of basic blocks that would be executed after the call site which could call $$f$$ in $$g$$ are denoted by $$B'$$. Then:

$A'=[ \bigcup_{i=j+1}^{n}F_i ] \cup B'$

If no function can call $$f$$, $$B'$$ would just be the empty set.

(1) follows from the fact that sets of system calls for each basic block $$F_i$$ form their containing function $$f$$ while (2) and (3) are made from two partitions of $$F_i$$ for $$i..n$$. However, (3) would require extra types of system calls if $$f$$ were called from $$g$$. Because the analysis is done on a per function basis, but considers system calls from other functions, $$g$$ would have it's own sets of executed and to be executed system calls which we'll call $$B$$ and $$B'$$. The context of $$B'$$ is required because if it contained a system call $$s$$ that was also in $$A$$ but not $$A'$$, then a pledge discluding $$s$$ would be made after the latest added basic block in $$A$$. With that context unioned with $$A'$$, $$s$$ will not be removed from subsequent pledges within $$f$$ since it will be required after $$f$$ has completed in $$g$$. It's also important to note that if an execution of a program has a call stack with multiple functions $$g_i$$ called before $$f$$, only the context of the function call before it is required because each function would union its context into its subsequent called function. This concept is shown in Figures 2 and 3.

$$A$$ contains the system calls of the first basic block in $$f$$ and $$A'$$ contains the system calls of the second basic block in $$f$$ as well as all system calls of $$g$$. The analysis will find that system call 1 can be removed from a pledge call after that first basic block in $$f$$, unless another function calling $$f$$ later requires it.

$$A$$ is increased by including all system calls available to function $$g$$ due to including a call site in $$f$$. However, a pledge is still not created to disclude system call 2 because it's still used later in $$f$$.

The notion of a call stack is a dynamic one that would be normally be seen in an execution trace even though this analysis is static. The information needed to complete the analysis meaningfully can be derived from a call graph. A call graph can be used to determine which functions can possibly be called from any function. That determination is used to derive any recursion as well as the necessary contexts that will be needed when solving for $$A'$$ so breaking pledges are not made.

#Autopledge

##Overview

An ideal pledge call is inserted at exactly the points in a program where static analysis has concluded the keywords it lists are the minimal set necessary for future operation. A conceptual overview of the algorithm which computes these points is given in section 3, however, the handling of both indirect calls, and recursion and looping constructs has not been handled.

Cycles, whether at the function or basic block level, are a concern. A valid analysis must ensure that the possibility of executing a code segment multiple times is taken into account, however, since reasoning about such behaviour would necessitate a form of symbolic execution, the authors propose a method of precision loss instead. Function and basic block summaries are computed such that all code segments in a cycle are considered to make the same set of system calls. This is implemented by representing the calling elements as nodes in a graph, where every node contains a bitvector of possible system calls. Then, for every node, the bitvector of the caller and their own are met and assigned to the caller's node. Since the bitvector and meet operator over bitvectors forms a lattice, application of this algorithm may be shown to a reach a fixpoint in maximally quadratic time in node cardinality.

Without extensive synthesis of guard statements, it is often the case that an indirect call will be made to multiple functions. A simplistic approach was taken: the call site is considered to make the union of system calls made by all possible functions. During pledge call insertion, if any of the possible functions do not have a valid location to insert a pledge (e.g. if the function in question does not make the call at all), it must be inserted following the indirect call to avoid a potential lapse in blacklist coverage.

##Setup

Some form of call graph must be generated, as this analysis hinges upon interprocedural information. This is provided by CIL in the form of a callgraph module implemented as an extension. Another data structure to encode system call information for every code segment, derived by the algorithm defined in section 3 in addition to the addendums made in the overview for section 4, is generated by visiting all functions and blocks, as necessary, in the analyzed program.

## Finding Pledge Call Locations

The core of our analysis performs an in-order traversal of the call graph per code segment. A set $$A$$ of system calls is generated by querying a single call in a precomputed table; the set $$A'$$, by taking the union of a global context (passed in recursively) and the local context, computed by taking the union of all function local-instructions. Whenever $$A \cup A' < A'$$, the function descends and attempts to discover another location in the code which would allow for a pledge call of higher granularity. If it fails, a pledge call is inserted after the initial call. Due to the previous passes, this will never insert a pledge call inside a function that may be called later on or inside of a cycle.

#Results

Due to time constraints, only a part of the analysis was completed by the time of submission. This included the computation of system calls on a per-function level and the propagation of said calls between all callers. As CIL lacks the same abstractions as LLVM/Clang, performing propagation on the block level was cumbersome; instead, it was performed on a per-statement level.

#Discussion

##Threats to validity

In order to create as simple of an analysis as possible, we have intentionally overlooked a number of possible, valid program behaviours. These include a lack of support for threads and child processes, which would be of particular issue in interpreters attempting to adapt our work. Furthermore, we use a context insensitive analysis: beyond any context information used indirectly by the call graph module within CIL, we do not consider the possibility of improving precision by determining which parts of a called function will be executed.

##Lessons learned

The field of program analysis is certainly complex, however, this is not aided by the degree of complexity of the C++ language. This complexity, expressed through the variety of data structures and unstructured program flow possible (and utilized) by many applications makes analysis difficult. Although this can be alleviated to an extent, the underlying complexity eventually becomes clear: few examples and resources are available to the prospective users of analysis frameworks, and a lack of shared terminology further complicates learning. An inherent complexity of the domain causes complexity of the API and a lack of popular usage prevents the iterative simplification of the API which in sum makes it difficult to make progress at any reasonable rate.

#Conclusions and future work

##Summary and Future Work

Although the implementation of Autopledge was incomplete, the algorithm proposed by this paper appears sound and may be applicable to further work. We have provided an overview of the previous work in this area, as well as a more formal definition of the constraints implied by pledge and by extension, other implementations of the notion of capabilities. Future work may expand upon the partial implementation provided by this paper, or investigate improvements to the algorithm itself.

##Future work

### References

1. pledge — restrict system operations. (2016).

2. Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn Song, David Wagner. Android permissions demystified. In Proceedings of the 18th ACM conference on Computer and communications security - CCS 11. Association for Computing Machinery (ACM), 2011. Link

3. Collin Jackson, Adam Barth, Andrew Bortz, Weidong Shao, Dan Boneh. Protecting browsers from DNS rebinding attacks. ACM Trans. Web 3, 1–26 Association for Computing Machinery (ACM), 2009. Link

4. Jinku Li, Zhi Wang, Xuxian Jiang, Michael Grace, Sina Bahram. Defeating return-oriented rootkits with Return-Less kernels. In Proceedings of the 5th European conference on Computer systems - EuroSys 10. Association for Computing Machinery (ACM), 2010. Link

5. Vasilis Pappas, Michalis Polychronakis, Angelos D. Keromytis. Smashing the Gadgets: Hindering Return-Oriented Programming Using In-place Code Randomization. In 2012 IEEE Symposium on Security and Privacy. Institute of Electrical & Electronics Engineers (IEEE), 2012. Link

8. Minh Tran, Mark Etheridge, Tyler Bletsch, Xuxian Jiang, Vincent Freeh, Peng Ning. On the Expressiveness of Return-into-libc Attacks. 121–141 In Lecture Notes in Computer Science. Springer Science $$\mathplus$$ Business Media, 2011. Link