The approach we used to document provenance combines both the data-oriented model and the process-oriented model. In the process-oriented model, binary provenance describes how a piece of software was compiled. It is comprised of two parts, a description of the environment and a description of the binary itself. The environment description includes the operating system, environment variables, compiler used, and libraries installed. The binary description includes configuration flags and/or modifications made to configuration or makefiles. Our goal is to provide the user with the ability to reproduce not only the binary, but the environment in which it was run.
A fundamental difference between executables is the hardware platform on which they were compiled. Differences in floating-point performance across different architectures can have a profound impact on outcome of a calculation and have been widely publicized in the popular media (Halfhill, 1995). The XSD captures not only architecture, but also the specific processor and the flags that are enabled on it.
Capturing pertinent details about the operating system is complicated, especially for Linux distributions, since each distribution contains many individually updated components. Essential information must be captured such as the operating system name, version, distribution, kernel name, and kernel version. For example, an application running on Ubuntu Dapper Drake (http://www.ubuntu.com
) must have the following operating system metadata: Linux, 6.06, Ubuntu Desktop, #1 PREEMPT, 2.6.15-27-386; whereas an application built on the Mac OS X Leopard platform must have the following operating system metadata: Mac OS, 10.5.1, n/a, Darwin, 9.10.0.
The compiler used and libraries linked during compilation are a crucial aspect of the environment. In addition to compiler name and version, a list of which updates have been applied is also captured. This section of the provenance metadata also records which flags were used when the compiler was invoked, architecture and optimization flags being of special interest. Libraries used for compilation are described similarly to the binary itself and are recursive. That is to say that a library that is in turn linked to other libraries are also captured in the library’s provenance.
Binaries also can be configured prior to compilation. Some packages are distributed in a format for use with the GNU build system or Autotools (Vaughan, 2000). Modification of the configure script or the makefile can yield substantially different results after compilation. The provenance XSD captures flags to the configure script, modifications to configure scripts and makefiles.
The concept of provenance can extend to knowledge of the behavior of executables, such as describing their function. The Brain Surface Extractor (BSE) (Shattuck and Leahy, 2002), the Brain Extraction Tool (BET) (Smith, 2002), and MRI Watershed (Dale et al., 1999) are all brain extraction algorithms, however, their internal functions may not be evident to a naive user, especially since they are commonly referred to by their abbreviations. This information, in addition to a short description of the executable, is also captured in the XSD and may be added to provenance XML files.
Executable provenance need only be collected once, when a binary is compiled or when a script is written. It must then be collected and recorded manually, then appended to the provenance XML. The LONI Pipeline is currently being extended to store and display executable provenance, eliminating the need for manual file editing in the future.