New statistical framework to increase the reliability of data-driven biomedical research on biomolecules

Biomolecules such as genes and proteins are basic elements that regulate the functions of cells. Among the numerous biomolecules, scientists are especially interested in those exhibiting different abundances between two conditions (called “differential biomolecules”), and higher abundances in the condition of interest than in the negative control condition (called “enriched biomolecules”). For example, if a gene is abundantly expressed in diseased patients, but has low expression in normal people, it is reasonable to hypothesize that this gene is associated with the disease. The discovery of those important differential or enriched biomolecules (called interesting biomolecules) is usually an indispensable step in data-driven biomedical research and in the formation of new hypotheses. A critical statistical question is whether the discoveries are reliable or contain too many false signals.

Now, three UCLA researchers have come up with a statistical framework that increases the reliability of identifying differential or enriched biomolecules from high-throughput experiments that measured thousands of biomolecules under two conditions. The research was published in October in the journal Genome Biology and has been accessed more than 4,000 times since then.

“The identified interesting biomolecules are subject to downstream experimental validation, which is often expensive and laborious,” said Jingyi “Jessica” Li, the study’s corresponding author and a UCLA associate professor of statistics. “Thus, researchers demand reliable discoveries that contain few false discoveries.”

False discovery rate (FDR) is a statistical concept that describes the reliability of discoveries identified by a method. The smaller the FDR, the more reliable the discoveries. This concept has been widely used as a criterion in bioinformatics tools for analyzing various biological data, such as genomics and proteomics data, and identifying interesting biomolecules.

Existing methods that claim to control the FDR — that is, to make the FDR of their discoveries under a pre-specified threshold — primarily rely on the use of “p-values,” whose calculation requires strong statistical assumptions on the data or large sample sizes. However, due to expensive experimental costs, sample sizes are often small, and assumptions are thus not verifiable and become questionable, Li said.

“We looked into some existing bioinformatics tools, including some highly cited ones, and had a surprising finding that many of them discovered many more false discoveries than expected,” said Li, who is also head of the Junction of Statistics and Biology laboratory. “These unreliable analysis results would inevitably lead to problematic scientific findings.”

Li, together with Xinzhou Ge, and Yiling “Elaine” Chen, two Ph.D. graduates from the UCLA department of statistics, have designed a statistical framework for resolving the problem of reliability crisis. Their framework, called “Clipper,” is able to control the FDR of its discoveries without requiring a large sample size or making strong statistical assumptions.

Unlike existing methods, which control the FDR by calculating p-values for potential discoveries, Clipper proposes a new concept named “contrast score,” to summarize measurements between two conditions and to describe the degree of interestingness. Clipper sets a cutoff on all contrast scores, and as its name indicates, selects a smaller portion of interesting discoveries out of a large number of biomolecules. The advantages of Clipper are “flexibility and robustness,” Li said. The contrast scores of Clipper can take different formats so that it can handle different data characteristics and analysis tasks.

In comprehensive tests on both simulated and actual data — all of which provide empirical evidence for what Li calls truly interesting discoveries — Clipper outperformed existing methods, always reliably controlling the number of false discoveries, and simultaneously achieving good statistical power. Clipper manages to ensure reliability by getting rid of strong assumptions and making the analysis more transparent and even simpler, she said.

“We hope to advocate our method to the broad scientific research community to improve the reliability of data analysis,” Li said.

The open-source software is available for free online as a package for a widely used scientific computing platform for statistical analysis known as the R programming environment.

The researchers’ next goal is to escalate Clipper into stand-alone bioinformatics methods for specific data analyses.

This research was supported by grants from the National Science Foundation, National Institute of General Medical Sciences, Alfred P. Sloan Foundation, W.M. Keck Foundation, and Johnson and Johnson.

Article by Professor Jingyi Jessica Li, Xinzhou Ge and Stuart Wolpert