When it comes to scientific circles, data science may be a new kid on the block, but it’s rapidly become everyone’s best friend. A highly interdisciplinary field that blends statistics, computing, algorithms, applied mathematics, and visualization, data science uses automated methods to gather and extract knowledge from very large or complex sets of data. “Data science is difficult to explain because any way you define it, you’re usually excluding something that is critically important,” said Dana Randall, a professor in Georgia Tech’s School of Computer Science.
Granted, people have been collecting and crunching numbers for a long time, but over the past decade several things have changed, noted Charles Isbell, senior associate dean and a professor in Georgia Tech’s College of Computing. “A lot of data became ubiquitous, algorithmic sophistication has increased dramatically, we can construct complicated models to predict things — and we have the machinery to make it happen. Put all this together, and data science suddenly matters.”
Indeed, instead of being relegated to some niche fields, “data science is becoming pervasive,” agreed Steve McLaughlin, chair of Georgia Tech’s School of Electrical and Computer Engineering (ECE). “There are very few fields in sciences, engineering, humanities, or business that aren’t being drastically impacted by data.”
Professors Dana Randall and Srinivas Aluru are co-executive directors of Georgia Tech’s new Institute for Data Engineering and Science.
Take Jeffrey Skolnick, who is leveraging big data and high-performance computing to advance drug development. “Today I’m not only working differently than 20 years ago, I’m working differently than five years ago,” said Skolnick, a professor in Georgia Tech’s School of Biological Sciences and director of the Center for the Study of Systems Biology. “The widespread access to extremely large data sets to learn, train, and test on is a sea change. We’ve figured out ways of using predictive structures rather than experimental ones, which can save time and money.”
Pharma companies typically spend more than $1 billion and take 10 to 15 years to develop a new drug. Yet only one in 5,000 compounds actually makes it from the lab to the medicine chest. On the brighter side, the U.S. Food and Drug Administration has approved some 1,500 drugs for consumer use, so finding new uses for existing drugs could dramatically accelerate translational medicine — which is Skolnick’s bailiwick.
His research group has built a unique knowledge base by developing algorithms that predict possible structures for 86 percent of human proteins from DNA sequencing. Known as Dr. PRODIS (DRugome, PROteome and DISeasome), this knowledge base can suggest alternative uses for FDA-approved drugs for each protein associated with a disease. “It’s not perfect yet, but the database is very useful for giving you a short list of things to try if existing treatments aren’t working,” Skolnick said. One of their success stories has been to suggest a drug, originally developed to combat nausea, to treat a child suffering from a rare form of chronic fatigue.
Professor Jeffrey Skolnick is leveraging big data and high-performance computing to find new uses for existing drug compounds. He is shown in the pharmacy of Georgia Tech’s Stamps Health Services.
Drug repurposing is possible because of the fundamental design properties of proteins and the “promiscuity” of drugs, Skolnick said. His researchers have shown there are a limited number (less than 500) of ligand-binding pockets, where a drug molecule can form a bond with a human protein. “So even if you design a drug to target a ligand-binding pocket in one protein, unintended interactions with similar pockets in other proteins can occur because the number of pocket choices is small,” Skolnick explained.
In addition to drug repurposing, Dr. PRODIS can identify human protein targets for new chemical entities along with possible side effects. “We can help pharma companies by suggesting early in the game if their drug has off-target interactions that could cause it to be withdrawn or fail clinical trials,” Skolnick said, noting that Dr. PRODIS’ success rate is about 44 percent. “Again, this isn’t perfect, but it gives you a good tool.”
What’s more, it’s a fast tool. If a pharma company gives Skolnick a drug molecule and wants to know its side effects, he can produce results within an hour — something that five years ago wasn’t possible in any time frame.
Skolnick chalks up this network analysis to the treasure-trove of data now at his disposal. “When you have very large datasets that are diverse, the statistical likelihood of generalizing the data and getting meaningful results is far higher,” Skolnick said. “Then, high-performance computing enables you to manipulate the data and learn from it. The goal is to build an algorithm that behaves the same way in the real world as in a controlled environment.”
<drinking from the fire hose>
With unprecedented amounts of data suddenly on tap, the challenge many researchers face is how to consume it.
For example, inexpensive sensor technology has made it easy for power companies to collect data on critical high-value assets such as generators and turbines. Yet analytical technology has lagged behind, inhibiting their ability to make sense out of it, said Nagi Gebraeel, associate professor in Georgia Tech’s School of Industrial and Systems Engineering (ISyE) and associate director of the Strategic Energy Institute.
In response, Gebraeel’s research group is developing a new computational platform to provide detection and predictive analytics for the energy industry. This platform remotely assesses the health and performance of equipment in real time and monitors trends to determine such things as:
- The best time to perform maintenance.
- When to order new parts so they don’t linger in inventory, costing money and possibly becoming obsolete.
- How shutting down one piece of equipment will affect the entire network.
“The latter is especially important because any slack caused by shutting down one generator has to be picked up by the rest of the generators,” Gebraeel said. “Now their lifetime has to be re-evaluated because they are working in overload. That’s where optimization and analytics intersect.”
By integrating detection, prediction, and optimization capabilities, the new platform could help power companies achieve significant savings. Indeed, a preliminary study shows a 40 to 45 percent reduction in maintenance costs alone.
In the past, there’s been a lot of unnecessary preventative maintenance, Gebraeel pointed out. “Companies do it because of safety, which is rational, but they are being too conservative because they don’t have enough visibility into their assets.”
Key to creating the computational platform is re-engineering older statistical algorithms that were developed in the context of limited data, Gebraeel said. Today’s algorithms must be executed on processing platforms that can handle terabytes and petabytes of data, deployed across a large number of computer nodes.
Nagi Gebraeel is analyzing large volumes of sensor data from electric power generation equipment to find information that could improve reliability and reduce maintenance costs. He is an associate professor in Georgia Tech’s School of Industrial and Systems Engineering.
<breaking the bottleneck>
Similar analytical challenges exist in life sciences.
Over the past decade, the throughput of sequencing DNA (rate at which DNA can be sampled) has increased by a factor of more than 1 million while costs have decreased by a factor of 1 million. “The raw data contains valuable things, yet we don’t know what they are until we analyze it,” said Srinivas Aluru, a professor in Georgia Tech’s School of Computational Science and Engineering (CSE). “The data comes really fast, so you need the ability to analyze it quickly — otherwise it just sits in storage. Analysis is the bottleneck.”
With funding from the National Science Foundation (NSF), Aluru’s research group is developing techniques that can leverage high-performance computing to analyze data as rapidly as it is generated. For example, Patrick Flick, one of Aluru’s graduate students, created parallel algorithms for distributed-memory construction that can index the entire human genome in less than 10 seconds, winning Flick a prestigious “Best Student Paper” award at the 2015 Supercomputing Conference.
“The data comes really fast, so you need the ability to analyze it quickly — otherwise it just sits in storage. Analysis is the bottleneck.”
Another milestone: The researchers have created a method to predict networks at the entire genome scale — a feat not done before — enabling them to explain how different genes work together in a biological process. This can be used for a wide variety of life-science applications, from determining causes of cancer to advances in plant biology, Aluru said. “For example, we’re working on a biological pathway responsible for nutritional content in plants to see if we can manipulate and improve it.”
Emerging data science tools and techniques are dramatically changing the scale of problems that researchers can tackle, Aluru observed. “Instead of just looking at a single pathway or a few genes, we can go after whole genome scale.”
“There’s no way to pursue any challenging applications of quantum chemistry unless you have access to high-performance computers,” said David Sherrill, a professor in Georgia Tech’s School of Chemistry and Biochemistry who focuses on intermolecular interactions.
He recalls his days as a grad student, when quantum chemistry calculations were extremely hard to do and papers were based on a handful of calculations and data points. “Today it’s a different story,” Sherrill said. “With more sophisticated algorithms, better hardware, and larger clusters of computers, a typical paper is based on hundreds or thousands of quantum calculations, which enables our multiscale models to be more accurate and appropriate.”
Not just users of high-performance computing, Sherrill’s research team is also designing the next generation of quantum chemistry software. “In the last five years, we’ve seen a lot of innovation on the hardware side, such as graphics processing units,” he explained. “That’s forcing us to be smarter about writing software so it easily adapts to different kinds of hardware — something we didn’t worry about 10 years ago.” In light of this change, Sherrill is starting to send his postdocs to computing conferences in addition to traditional chemistry convocations.
Sherrill’s researchers are part of a multiuniversity team designing Psi4, an open-source suite of quantum chemistry software for high-accuracy simulations of molecular properties. The software has a wide range of applications, from understanding how drugs bind to proteins to how crystals pack into a solid.
In addition, Sherrill is one of six principal investigators developing new paradigms for software interoperability, a $3.6 million project funded by NSF. The goal is to create “reusable” software libraries where new features can be used by many different computational chemistry programs. “In the past, different codes have competed with each other, so if one person added a new feature, then everyone had to,” Sherrill explained. “Yet it’s too hard to operate this way. By creating an interoperable library, you’d have much more impact and avoid reinventing the wheel. One small team could add a feature that quickly gets into a variety of different programs.”
Data science is also accelerating the development — and deployment — of new materials, which is key to solving challenges in everything from energy and climate change to health care and security. “Almost every technology is dependent on new materials,” pointed out Dave McDowell, a Regents Professor who holds joint appointments in Georgia Tech’s School of Mechanical Engineering and School of Materials Science and Engineering (MSE).
Due to an emphasis on empirical methods, it has historically taken an average of 15 to 20 years after discovering a material with interesting properties to commercialize it. “Yet thanks to accelerated modeling and simulation protocols, we’re able to assess candidate materials more rapidly for applications,” McDowell said.
He points to fatigue, a major problem in metallic aerospace and automotive structures. New computational techniques can now take a material down to the micron scale, represent its structure digitally, and reproduce the kinds of scatter, variability, and properties seen in laboratory experiments. Even better, this can be done in a couple of weeks versus years in a lab.
“The Hub firmly places Georgia Tech in the national spotlight for big data analysis. We have become the go-to place for data science … a place where problems are solved in much broader context than traditional top-tier research universities.”
Georgia Tech’s materials engineers are unique in their focus on hierarchical materials informatics — a special branch of data science that extracts and communicates knowledge at multiple scales as opposed to only considering a material’s chemical composition.
“This is important because you can have the same chemical composition, but completely different properties at different scales due to how atoms are arranged,” explained Surya Kalidindi, a professor with joint appointments in CSE and MSE, who has written a new textbook on the subject. “Respecting the details at different scales allows you to understand what arrangements are causing the material to respond in a particular way. By changing the arrangement, you can alter the material, making it stronger or weaker or tweaking electronic, magnetic, and thermal properties.”
Beyond solving fundamental problems, Georgia Tech is also helping manufacturers make better decisions. Its Institute for Materials (IMat), launched in 2013, is creating a collaborative materials innovation ecosystem among researchers, industry, national labs, and other universities to link basic research with product development, manufacturing scale-up, and process selection.
Typically, companies haven’t recorded information related to their material recipes or processing, explained McDowell, IMat’s founding and executive director. “Instead, the information resided with ‘seasoned experts,’ causing it to get lost or reinvented. By helping companies digitally track workflows and incorporate modern data science tools, we can enable them to determine, for example, if replacing a material in their production line has enough value to offset economic loss due to downtime.”
<expanding data footprint>
In the past decade, Georgia Tech has been rapidly establishing itself as a leader in data science on a number of fronts.
In 2005 it established the School of Computational Science and Engineering (CSE) to educate students in advanced computing and data analysis combined with other disciplines. Since 2012 Georgia Tech and its collaborators have won more than $15 million in federal awards from the Obama administration’s National Big Data Research and Development Initiative. And last November, Georgia Tech was named one of four NSF Big Data Regional Innovation Hubs in partnership with the University of North Carolina.
Led by Aluru at Georgia Tech, the South Big Data Hub will build public-private partnerships across 16 states and the District of Columbia. “The goal is to leverage data science and foster community efforts to tackle regional, national, and societal challenges,” Aluru said. “We’ll begin by focusing on five areas: health care, coastal hazards, industrial big data, materials and manufacturing, and habitat planning.”
“The Hub firmly places Georgia Tech in the national spotlight for big data analysis,” said David Bader, chair of CSE. “We have become the go-to place for data science … a place where problems are solved in much broader context than traditional top-tier research universities.”
Case in point: Bader’s research group has been pioneering massive-scale graph analytics — technology that can be employed to help prevent disease in human populations, thwart cyberattacks, and bolster the electric power grid, to name a few applications. Graph analytics uncover relationships and extract insights from huge volumes of data, and the CSE researchers have designed parallel algorithms that run extremely fast (while keeping up with edge-arrival rates of 3 million per second), even when graphs have billions and trillions of vertices.
David Bader’s research group is pioneering massive-scale graph analytics — technology that can be employed to help prevent disease in human populations, thwart cyberattacks, and bolster the electric power grid. Bader is chair of Georgia Tech’s School of Computational Science and Engineering.
With these cutting-edge algorithms, the researchers have developed a collection of open-source software, known as STINGER (Spatio-Temporal Interaction Networks and Graphs, Extensible Representation), which can capture analytics on streaming graphs. “In the past, analysts needed to know the size and range of entities before creating a graph,” Bader explained. “Yet STINGER can track a dynamic graph even when future relationships are not known. Running analytics fast and ingesting a streaming firehose of edges simultaneously is like having new engines installed on your plane — while you’re flying.”
Increasing its investment in data science and interdisciplinary research, Georgia Tech will be the anchor tenant in a new 750,000-square-foot, mixed-use property in Midtown Atlanta. Developed by Portman Holdings, the project has been christened “Coda” and will include a 21-story building with 620,000 square feet of office space and 40,000 square feet for retail and restaurants. In addition, an 80,000-square-foot data center will provide advanced cyber infrastructure and national data repositories.
Georgia Tech will occupy about half of the office space, bringing faculty from the data sciences together with a cross-section of basic and applied researchers. The other half of the building will be devoted to industry.
“Data science brings together multiple areas of expertise to solve big, crucial problems — and the building is meant to reflect that,” said Isbell, explaining that Coda will be organized around areas of interest rather than departments or specific disciplines.
SLIDESHOW: The Coda building will be a 21-story, 750,000-square-foot mixed-use facility that will house Georgia Tech’s data science and engineering program. It will be located in Technology Square. Images courtesy of John Portman & Associates.
Indeed, CSE will be the only academic department to be entirely relocated to Coda. Many faculty members across campus will relocate to Coda permanently; others will reside there temporarily, depending on the length of projects, and then return to their home unit.
“The building will be a living laboratory and provide the largest gathering of data science experts in one place of any university in the country,” said McLaughlin, who served on a committee with Randall and Isbell to determine faculty needs and maximize benefits of the new building.
“Midtown is going to be transformed by this building,” Randall said. “Coda will be an outward-looking face for declaring ourselves a mecca for data science.”
Skolnick looks forward to the unique collaborations Coda will make possible. “Serendipity is very important in science, and random interactions are the most exciting ones,” he observed. “Most of the important science and engineering discoveries are done at the interface of disciplines. Having people with different abilities and expertise in one place will accelerate that process.”
The new building will also be home for Georgia Tech’s Institute for Data Engineering and Science (IDEAS), a new interdisciplinary research institute led by Randall and Aluru.
IDEAS has a two-pronged mission, Aluru explained: improving the foundations of data science, and advancing different fields that use data-science tools, such as health care, energy, materials development, finance, and business analytics. The institute will:
Enable one-stop shopping for industry. IDEAS will make it easier to connect companies with students and faculty to support short- or long-term collaborative projects.
“Academics can no longer operate the way they used to, being very domain specific. You need the instincts, intuitions, and techniques that come from multiple fields to really push the boundaries.”
Generate excitement among students. Data science will ultimately impact every discipline at Georgia Tech, so even students who don’t specialize in it will need to be educated about it.
Increase communication and collaboration among faculty. Building an academic community around data science will help with everything from winning funding to sharing equipment and expertise.
Breaking down silos is critical, Randall said. “Take algorithms. Often scientists come to mathematicians with very difficult, specific questions. We might look at a question and say it can’t be solved efficiently — end of story. However, if we know where the question is coming from, the bigger context, we might realize the researcher doesn’t need an exact solution. An approximation would be just as good for their purposes, and could be done efficiently. Yet if you just hand off these encapsulated questions, you miss the crux of where the magic can happen.”
For example, breakthroughs in theoretical computer science have been made by looking at problems as a physicist would, Randall added. “And along the way, we’ve introduced techniques from computing perspectives that have solved long-standing physics problems. This is happening more and more. Academics can no longer operate the way they used to, being very domain specific. You need the instincts, intuitions, and techniques that come from multiple fields to really push the boundaries.”
T.J. Becker is a freelance writer based in Michigan. She writes about business and technology issues.
Title animation displays the relationship of Georgia Institute of Technology to the rest of the wiki universe, derived from wikiverse.io, a web-based interactive 3-D map of Wikipedia that visualizes the website as a cosmic web of information.