Changes to the Zend interpreter to support concolic execution

Both the Ardilla and Apollo projects utilize a concolic execution for PHP. A concolic execution computes both a concrete and a symbolic value for every variable. The concrete value is just as in ordinary execution. The symbolic value is an expression, in terms of the program's input, that shows how that value was computed; you can think of it as a sort of record of the program's execution, or as an unfolding of the program text. See below for an example. Concolic execution is one aspect of the so-called "concolic testing" approach to test generation.

The concolic execution is implemented as a straightforward extension of the Zend interpreter. This document describes that extension. It was originally made by a student while he was an intern at IBM, so the work is owned by IBM, and IBM is not willing to release it. Due to IBM's ownership of the intellectual property in the original implementation, we describe the changes only at a high level. From this description another person could create a clean-room re-implementation of the changes. That would let us release the rest of Ardilla, which builds on the extension. (The other parts of Ardilla, besides the Zend extension, are not owned by IBM.)

Before making the change, you should have read the Ardilla paper. You need to understand concolic execution. You do not need to understand concolic testing, though you may find it helpful.

The origin field

Zend's basic structure value is zval. The extension is to add to zval an additional field, origin, that tracks the symbolic expression. The origin field is non-empty only for variables that depend, directly or indirectly, from user input.

When a global parameter is first accessed (via the global parameters tables $_GET, $_POST, $_REQUEST, $_COOKIE), the origin field is initialized to a new symbolic variable.

Whenever a computation is performed, it is performed on the concrete value and also on the symbolic value (if any). As a contrived example, suppose that Zend executes the statement "z = x + y" when:

Then after the statement, variable z has concrete value 25 and symbolic value "quantity + (kilometers * 1000)".

(TODO: What is the symbolic expression if one argument of a computation has an origin, but the other does not?)

Whenever a conditional statement (if, switch, ...) is executed on a value with an origin, the modified interpreter also outputs the result of the conditional statement

Many of the changes are in the file zend_vm_def.h, which is a template that is used to create zend_vm_execute.h. Thus, it is important to regenerate zend_vm_execute.h every time that zend_vm_def.h changes.

Code changes, organized by file

zend.c
Trace errors
zend.h
Change zval to have a field origin to track parameter flow. Update parameters' origin on assignment.
zend_compile.c
Initialize origin in new zval. Trace function calls.
zend_execute.c
Update origin on copy. Trace access to zval (with non null origin). Trace access to global parameters and global hash-tables.
zend_execute.h
Trace comparing zval to boolean.
zend_ini_parser.c
Initialize origin in new zval.
zend_ini_scanner.c
Initialize origin in new zval.
zend_language_scanner.c
Trace function calls. Initialize origin in new zval.
zend_operators.c
Initialize origin in new zval. Copy origin on assignment. Trace comparison between parameters and values.
zend_vm_def.h
Find all globals from request hash-table. Trace access to global parameters. Trace execution errors. Trace compare to a boolean. Trace entry and exit to a function. Trace db calls with symbolic parameters. Give new origins to function return values if any parameter was symbolic. Trace calls to exit.
zend_vm_execute.skl
Trace statement execution.
zend_vm_gen.php
Trace comparison operators.
array.c
Trace accessing global hashtables.
php_variables.c
Create infrastructure to allow tracing access to global hashtables. Add origin to input parameters.
php_variables.h
Create infrastructure to allow tracing access to global hashtables.
cgi_main.c
Start tracing on call to cgi.