Within 7 months of the launch of the Center for Causal Discovery (CCD), Dr. Joseph Ramsey and his colleagues in the Algorithm Development group successfully scaled up several Directed Acyclic Graph (DAG) search algorithms, including robust Greedy Equivalent Search (GES), to recover sparse linear structures on up to 1 million variables from 1000 samples of simulated data – an exceptional achievement given that these algorithms had previously been restricted to a few thousand variables.

Dr. Ramsey used algorithmic improvements, implementation improvements, and parallelization to achieve this milestone with excellent accuracy: 3% false positives and 5% false negatives. The million-variable search takes about 2 days to run on a node with 40 processors and 384 Gb of RAM at the Pittsburgh Supercomputing Center; a 50,000-variable search will run on a 4-core laptop in less than 30 minutes.

Ongoing work is focusing on further speeding the process and ensuring that accuracy is maintained with denser, non-linear, non-Gaussian data distribution.

“This achievement is remarkable given where we were just six months ago. It will help make causal discovery a routine scientific tool in the very near future,” notes Dr. Clark Glymour, who leads the Algorithm Development effort for the CCD.

These algorithms and application programming interfaces through which they can be run will be made available on the CCD Git Hub site as they are hardened for applied use in biomedical research.