Nokome Bentley, 19 June 2019

From blobs to trees: semantic code analysis

What and why?

Semantic analysis of code aims to determine its meaning i.e. what the code does, as opposed to what its shape is (which is syntactic analysis). Within Stencila we do semantic analysis of code for two main purposes (1) code cell dependency analysis, and (2) package requirements analysis.

We do code cell dependency analysis to enable a reactive programming model. This model, familiar to any user of a spreadsheet, requires a graph of the dependency between code cells. Currently, we do this analysis within execution contexts by analysing each code cell to see which variables are inputs and which variables are outputs. For example, in the stencila/r package by piggy backing on the CodeDepends package.

We do package requirements analysis (i.e. work out what packages your code requires) within dockta so that we can build Docker images for a project. We do this by detecting package import statements. We wanted to keep that code within Javascript, rather that having to rely on being able to spawn a R or Python session just to determine which packages are needed by some code. For R and Python we do that using regular expressions (e.g. for Python). For Javascript, we use the detective package which parses the entire source code to find require() calls.

Trees from blobs

Instead of doing these different forms of semantic analysis using different methods (i.e. regexes v parsing and AST walking) within different environments (i.e. within R v within Python v within Javascript) we are planning to consolidate this functionality within Encoda. It's a natural fit for Encoda because that's the place where we handle the parsing of "blobs" of content (e.g. a binary blob of a .docx, or a text blob of Markdown) into a tree of structured, semantic, document nodes. Our plan is to treat source code in the same way as these other formats taking blobs of source code (e.g a Python script, or a R code cell) and parsing into a tree of nodes that can that be analysed for different purposes (e.g. package requirements analysis or cell dependency analysis).

The "Lake Tree", Wanaka, NZ. Bernard Spragg, CC0 1.0 Universal.

Github's Semantic Library

Given that, last week I was interested to see Github introduce a "jump-to-definition" feature. Under the hood, that feature uses Github's own Haskell library for semantic code analysis, Semantic.

As the README says, Semantic basically:

  1. Reads blobs [of raw source code].
  2. Generates parse trees for those blobs with tree-sitter (an incremental parsing system for programming tools).
  3. Assigns those trees into a generalized representation of syntax.
  4. Performs analysis, computes diffs, or just returns parse trees.
  5. Renders output in one of many supported formats.

The part of that I was most excited about was "generalized representation of syntax". It would be great if we had a program that we could leverage in Encoda for semantic analysis of source code, in the same way that we use Pandoc for text documents. I had previously looked at Babelfish and its Universal Abstract Syntax Tree (UAST) for this but hadn't yet experimented with it.

Test driving Semantic

I was interested to see what the tree generated by Semantic looked like. Specifically, how it parsed import statements and whether it was a viable solution to replace the current regex based approach we use in Dockta. So, I took it for a test drive...

After a brief, failed, attempt at building locally (on Ubuntu 16.04), I opted to build Semantic's Docker image instead:

git clone git@github.com:github/semantic
cd semantic
docker build --tag semantic .

You can then run it against some test files using Docker's run command:

docker run -v $(pwd):/work semantic parse /work/test.py

Let's unpack that command a little:

  • the -v $(pwd):/work option mounts the current working directory into the container at /work (it's a useful way for easily running a containerized executable on local files without having to put them into the container)
  • the semantic parse --json /work/test.py part is where we tell Semantic to parse test.py and output JSON

Currently, Semantic supports Ruby, JavaScript, TypeScript, Python, Go, PHP, Java, JSON, JSX, Haskell and Markdown. I created three test files for Python, JavaScript and Go which have similar import statement semantics.

test.py

import mod
from mod import a as b

test.js

import mod from 'mod'
import {a as b} from 'mod'

test.go

import "mod"
import alias "mod"

And then parsed each of these files into S-expressions to examine the basic structure of the generated tree:

> docker run -v $(pwd):/work semantic parse /work/test.js
(Statements
 (Import)
 (Import))
> docker run -v $(pwd):/work semantic parse /work/test.py
(Statements
  (QualifiedImport
    (Identifier))
  (Import
    (Alias
      (Identifier)
      (Identifier))))
docker run -v (pwd):/work semantic parse /work/test.go
(Statements
  (QualifiedImport
    (Identifier))
  (QualifiedImport
    (Identifier)))

I was hoping Semantic would return a relatively simple tree using its "generalized representation of syntax" and that the trees would be similar across languages. Instead, I was surprised at how much variation there was in node types in each of the trees.

To further look into these trees, I generated JSON using the --json flag. If we were to use Semantic in Encoda, this would be our primary interface between the two. The results (below) confirm the diversity of node types. We could write code to normalize these trees but that seems to defeat the purpose of using Semantic.

We'll definitely keep an eye on Semantic, but in the short term, at least for doing package requirements analysis, despite their potential problems, we'll probably stick to using regexes.

Appendix

For brevity, the JSON outputs below don't include sourceRange and sourceSpan properties that provide information on the location of a node in the source file.

JSON output from parsing test.py

{
  "trees": [
    {
      "path": "/work/test.py",
      "language": "Python",
      "tree": {
        "term": "Statements",
        "statements": [
          {
            "term": "QualifiedImport",
            "qualifiedImportFrom": [
              {
                "term": "Identifier",
                "name": "mod"
              }
            ]
          },
          {
            "term": "Import",
            "importFrom": {
              "tag": "QualifiedName",
              "paths": [
                "mod"
              ]
            },
            "importSymbols": [
              {
                "term": "Alias",
                "aliasName": {
                  "term": "Identifier",
                  "name": "b"
                },
                "aliasValue": {
                  "term": "Identifier",
                  "name": "a"
                }
              }
            ]
          }
        ]
      }
    }
  ]
}

JSON output from parsing test.js

{
  "trees": [
    {
      "path": "/work/test.js",
      "language": "JavaScript",
      "tree": {
        "term": "Statements",
        "statements": [
          {
            "term": "Import",
            "importFrom": {
              "pathIsRelative": "NonRelative",
              "unPath": "mod"
            },
            "importSymbols": [
              {
                "aliasName": "mod",
                "aliasValue": "mod"
              }
            ]
          },
          {
            "term": "Import",
            "importFrom": {
              "pathIsRelative": "NonRelative",
              "unPath": "mod"
            },
            "importSymbols": [
              {
                "aliasName": "b",
                "aliasValue": "a"
              }
            ]
          }
        ]
      }
    }
  ]
}