New Machine Learning Model Accurately Identifies Metabolites

A new, open-source machine learning model called LC-MS²Struct can help identify metabolites with more accuracy than traditional liquid chromatography-mass spectrometry (LC-MS). The model may benefit the fields of drug discovery, diagnostics, and fundamental research by increasing identification reliability. This research was recently published in Nature Machine Intelligence.

Metabolites are tiny molecules that transfer energy and relay cellular information around the human body. Due to their size, they are difficult to distinguish in traditional blood sample analysis processes, such as LC-MS. In the LC-MS technique, metabolites are separated by running the sample through an LC column, which returns different flow rates, or retention times, for each metabolite. A mass spectrometer then sorts the metabolites by mass. This two-part process allows researchers to reliably identify a decent portion of metabolites, but the method is far from perfect. “Even the best methods can’t identify more than 40 percent of the molecules in samples without making some assumptions about the candidate molecules,” says study author Juho Rousu, professor at Aalto University.

This new metabolite identification model, developed by researchers from Aalto University and the University of Luxembourg, sidesteps the issues associated with LC-MS identification because it was trained on information harvested from dozens of laboratories worldwide. Until now, metabolite data could not be compared between laboratories because the retention times were all different. Eric Bach, study co-author and doctoral student at Aalto, devised a solution that addresses this problem. “Our research shows that while absolute retention times may vary, the retention order is stable across measurements by different labs,” Bach explained in a news release. “This allowed us to merge all publicly available data on metabolites for the first time ever and feed it into our machine learning model.”

Because it was built on such a wide breadth of data from so many labs, LC-MS²Struct can accurately distinguish between stereochemical variants, or mirror image molecules—a capability that no other metabolite identification programs have. This ability is “expected to open up new avenues in drug design and other fields.”

“This new open-source model offers the whole research community an enriched view of small molecules. It will help research into methods to identify metabolic disorders, such as diabetes, or even cancer,” Rousu added.

The LC-MS²Struct source code is available on GitHub.