About Cure


Big Data 4 Better Health




Introduction to the BD4BH project

The BD4BH research project is funded by a five million dollar CURE grant from the Pennsylvania Department of Health. This three-year project started in 2015 and includes over 40 researchers from Carnegie Mellon University, the University of Pittsburgh, the University of Pittsburgh Medical Center (UPMC), and the Pittsburgh Supercomputing Center (PSC).

The overall goal of the project is to develop new machine learning methods that improve the analysis of the large amounts of data (big data) now being obtained on patients with cancer. The particular focus is on methods for better predicting cancer outcomes (e.g., the probability that a specific treatment will prevent cancer recurrence) using clinical data in combination with molecular data that have been gathered from a patient's cells, such as gene-mutation data.

The project is exploring proof-of-principle approaches that are designed to lead to clinical trials. The main focus of the project is on breast and lung cancer; however, the methods developed are applicable to other forms of cancer, as well as other diseases.

Highlights of our work so far

Curating and analyzing breast and lung cancer patient data

While we have access to detailed molecular data from over 1000 breast cancer patients, a subset of these (150 patients) were treated at UPMC. For this group, we have analyzed medical records to extract a rich set of clinical information to use in defining and predicting cancer outcomes. In addition, we have clinical and molecular data on approximately 500 patients with lung cancer that we are also in the process of analyzing.

Developing efficient methods that find informative patterns in the data

We have developed methods that analyze the curated cancer data to find clinical and molecular patterns in the data that are predictive of cancer outcomes. Given the size and complexity of the data in this project (several billion pieces of information per patient), we have had to develop machine-learning methods that are highly efficient and that utilize the computational power of the PSC. Doing so allows informative patterns to be discovered in the data in a matter of minutes, rather than hour or even days, as previously required.

Using patterns in the data to predict cancer outcomes

We have developed machine learning algorithms that use the patterns found in the data to predict cancer outcomes. For example, there are currently several possible drugs for treating breast cancer patients and determining which drug to prescribe is often based on clinical trail-and-error, which can take considerable time and affect overall treatment success. The methods we are developing are designed to tailor treatments based on an individual patient's clinical and molecular profiles. To date we have tested these methods on data from Western Pennsylania patients who were treated with two popular drugs, Arimidex and Tamoxifen. The results provide support that we can improve patient response when compared to initial treatments that are based on clinical variables alone. We are now testing some of the predictions of the method; in particular, we are performing cancer cell lines experiments in collaboration with cancer experimentalists and physicians.

Educational outreach

An important part of the CURE BD4BH project is educational outreach. We are collaborating with faculty and students at Lincoln University (Oxford, PA), which is the first degree-granting historically black university in the U.S. Overall, seventeen undergraduate students from Lincoln University participated in 10-week long internships in Pittsburgh on projects related to the analysis of big biomedical data. These students conducted cutting-edge research under the guidance of BD4BH faculty who have expertise in biology, biomedical informatics, and machine learning.

Health Research Annual Progress Report

This annual progress report runs from July 1, 2017 - June 30, 2018. It gives an overview of what the Nonformula Grant is, the purpose of the project, the project overview, and a summary of the research completed.