Source code analysis typically requires the use of expensive and difficult-to-configure tools that support only a predefined set of standard checks. In some cases, it is possible to add customized checks, but doing so can be laborious and requires a deep understanding of each tool’s internal data structures, which means that it is rarely done.

The Cobra infrastructure simplifies fast analysis of general code patterns, common coding flaws, or coding rule compliance by storing the key information of source code in an extremely simple data structure. The tool uses a small built-in lexical analyzer that recognizes the main lexical units of C, C++, or Java source code, and categorizes these in tokens. This data structure is straightforward to query interactively to identify patterns of interest in the code, and it makes it possible to quickly build customized checkers and analyzers that match such patterns in large repositories of code.

For example, to find all declarations and uses of a variable named ‘i’ the user would write the query: ‘match i; display;’ which can be abbreviated to ‘m i; d;’.

The Cobra tool can be used in one of three modes: (1) as an interactive query engine to match patterns with a simple query language, (2) to write inline Cobra programs that can contain arbitrary branching and iteration over the lexical tokens in the input stream to identify more complex types of patterns, and (3) as an infrastructure for building more elaborate standalone checkers in C that are compiled separately and linked with the Cobra preprocessor that builds the central data structure. Navigation operators that can be used to traverse that data structure are predefined, and frequently used meta-information about the token sequence is precomputed (e.g., to connect pairs of matching braces, brackets, or parentheses to record nesting level, the length of identifiers, etc.). The tool supports the use of multi-threading to increase the speed of query processing for large code bases.

The tool infrastructure assumes a standard UNIX or UNIX-like (Linux, OS-X, Cygwin, Solaris) environment, and the availability of a standard C compiler like gcc or clang. A collection of queries, inline Cobra programs, and standalone checkers has been developed as examples. The set includes checkers that can create a function call-graph, or a control flow graph for a function, or compute the cyclomatic complexity of functions. Sample queries issue warnings about suspicious use of binary operators when insufficient parentheses are used to fix the order of evaluation (a frequent source of errors in C code). Most queries or checkers are written and tested in minutes.

This work was done by Gerard J. Holzmann of Caltech for NASA’s Jet Propulsion Laboratory. This software is available for license through the Jet Propulsion Laboratory, and you may request a license at: NPO-50050