Machine learning came along at just the right time. The world is now awash in more data than ever before, and computer algorithms that can learn and improve as they perform data analysis promise to help scientists handle that information overload.
Yet researchers who think that machine learning by itself can help solve complex problems in science, engineering, and medicine, should strive for a more balanced approach, says Roman Grigoriev, part of a School of Physics team with new research suggesting a hybrid approach for conducting science that blends new era technologies, old school experimentation, and theoretical analysis. The research suggests faster solutions to complex, data-intensive riddles involving such issues as cancer, earthquakes, weather forecasts, and climate change.
“It’s a combination of existing theoretical understanding — as well as experimental data with machine learning,” says Grigoriev, Physics professor and lead investigator of the Dynamics and Control Group. “Oftentimes people who do machine learning kind of forget about theoretical understanding and almost rely totally on data. It’s relatively simple, but when there’s a lot of data and not enough structure in that data, that approach is bound to fail." Grigoriev explains that there's often just too much data to meaningfully analyze, at which point "the problem becomes intractable. Essentially, harnessing appropriate domain knowledge is critical for finding structure in the data.”
“Robust learning from noisy, incomplete, high-dimensional experimental data via physically constrained symbolic regression,” was in May in Nature Communications. Fellow School of Physics researchers involved in the study are Michael Schatz, professor and the School’s interim chair; graduate research assistant Logan Kageorge; and former graduate research assistant Patrick A.K. Reinbold.
The problem with high-dimensional data
Machine learning uses computer algorithms to find patterns in data, but “most popular machine learning approaches present results in a form that is hard to interpret and explain,” Grigoriev says. “Unless you understand the how and the why, you can’t really say you understand a problem.”
Understanding and predicting complicated behaviorsbehaviors — by crunching a lot of dense, rich data — can help with fundamental and practical problems in science arenas like weather forecasting and characterizing cardiac arrhythmias. The problem is that most of those arenas involve “high-dimensional” data, which means exactly what it sounds like: data with a lot of dimensions or variables, sometimes millions of them.
The dimensionality of the data is so large that “you get lost and it’s hard to see any trends,” Grigoriev says.
His team has come up with a hybrid approach that blends machine learning with elements of the traditional process of scientific discovery. That means a theoretical description, observations, designing experiments to test the description, and “then going back and forth between improving the theories, and designing new experiments. That’s been the traditional approach for hundreds of years.”
The foundation of Science's understanding and progress relies on that scientific method — the combination of theory and experimentation. “They’re not developed just based on the data. They are developed using both existing knowledge as well as some general fundamental laws.”
An approach that spotlights the beauty of equations
Constraining the data to include just those variables that pertain directly to the experiment in question is vital in working with high-dimensional data, Grigoriev says.
“What this approach allows you to do is identify a simpler model that uses the variables you need. It’s a simplified description that applies to a particular situation, but obtained using data that’s computational or experimental. It can do both.”
The result is represented in a mathematical model, Grigoriev says, and “once you see those equations, you understand what the variables are. The equations certainly help explain the essence of a physical problem.” His team’s approach was validated in the research with a fluid dynamics experiment. A thin layer of liquid was suspended in a rectangular tank, with magnetic and electrical fields shot through it to create what physicists call a turbulent flow — irregular shifts happening within the fluid layer that can rapidly change direction and magnitude.
Grigoriev and his team used their hybrid approach to analyze the accessible data, in this case the velocity of the water. Subsequently, they were able to reconstruct variables that couldn’t be measured directly, like water pressure and force.
This is the beauty of the equations — how much they allow you to do, Grigoriev says.
“What we do get is an equation, or set of equations, which are in a familiar form. We know how to explain, how to solve the problem using these equations. This is the nice thing about this approach. We’re working with variables whose meaning we understand; we know how to interpret them.”
The team believes the study’s results will lead to advances like faster, more accurate ways to make predictions of complicated behavior in those large, real world problems in science, engineering, and medicine. For example, as Grigoriev’s team’s research states, “the ability to identify and quantify important patterns and sequences in atmospheric turbulence should enable weather forecasts that are better and more rapid than those currently possible today.”
This material is based upon work supported by the National Science Foundation under Grants No. CMMI-1725587 and CMMI-2028454. The experimental data used in this work was produced by Jeff Tithof. The magnetic field measurements were performed with assistance from Charles Haynes. https://doi.org/10.1038/s41467-021-23479-0