Georgia Tech’s data center simulator uses lasers, wireless sensors, and other equipment to study air flow and cooling in server racks. Shown is Yogendra Joshi, a professor in Georgia Tech’s School of Mechanical Engineering.
In the struggle to improve the performance of mobile devices such as smartphones, extending battery life is just one part of the effort.
System designers must increasingly worry about removing heat, an unwanted byproduct of watching a YouTube video, shooting a selfie, or updating a Facebook page.
In the same way that physical limits on the size of transistors may throttle the performance growth promised by Moore’s Law (the expectation that computer processing power will double about every two years), the challenge of removing heat from ever-smaller transistors also poses a threat to continued efficiency improvements. The resulting tradeoffs will affect everything that relies on integrated circuits — from mobile phones and tablets all the way up to high-performance computers and data centers the size of football fields.
At Georgia Tech, researchers are addressing these thermal challenges in broad and bold ways. Their efforts include designing chips that operate with less power, providing new forms of cooling, and optimizing data center operations.
“The challenges on the small scale are very different from the challenges at the large scale,” said Yogendra Joshi, a professor in Georgia Tech’s George W. Woodruff School of Mechanical Engineering, whose research group studies thermal challenges in a comprehensive way. “Everyone wants more capabilities in the devices they are using, but there are tradeoffs to be made at each level.”
Sudhakar Yalamanchili, a Regents Professor in Georgia Tech’s School of Electrical and Computer Engineering, is studying how design can address thermal issues in integrated circuits.
Photo: Fitrah Hamid
Designing Chips for Thermal Management
The whirring fan of a laptop computer is the closest most consumers come to the challenge of thermal management in electronic devices. But the issue really begins much deeper in whatever system they are using, with the design of integrated circuits. In these ICs, billions of transistors carry out computer operations using electrical charges, producing heat that must be removed.
Designers must prevent chip temperatures from going beyond levels that can cause a silicon meltdown. But even temperatures below the damage threshold can cause current leakage and reduce performance, so thermal issues have become a critical component of modern IC design.
“Until recently, whenever transistors became smaller, they required correspondingly less energy, so you could double the number of transistors on an integrated circuit and the power density remained roughly constant,” noted Sudhakar Yalamanchili, a Regents Professor in Georgia Tech’s School of Electrical and Computer Engineering. “Around the middle part of the last decade, this changed for reasons rooted in physics and technology. Now, as we double the number of transistors, the on-chip power density increases. This is not sustainable because eventually we will get to the point where we cannot cool the devices.”
The world’s information technology industry has grown accustomed to continual performance increases that boost productivity. Researchers like Yalamanchili are looking at new computing techniques to continue that beneficial trend.
“There is only so much heat that you can extract from a device cost-effectively, and that is how much power you can burn in that much volume,” he said. “The amount of power you can burn, in turn, determines how much the transistors can consume, which controls how many transistors you can operate concurrently. And the number of active transistors determines how much performance you can get.”
There are strategies for getting the most out of the available energy. One is increasing the use of special-purpose accelerators that are more efficient than general-purpose chips for certain applications; for example, rewriting code to use graphics processing units (GPUs), the more energy-efficient processors originally developed to handle graphics. Another is reducing the movement of data on chips, a strategy of special interest to Yalamanchili.
“Moving a data bit will soon take more energy than the computing operations performed with it,” he explained. “We have to minimize data movement, and this will be a fundamental shift in how computing is done. To continue performance scaling with Moore’s Law, we are going to have to redesign systems to be centered around data and memory systems rather than the CPU.”
Other strategies may involve implementing alternative computing models such as neural networks, inspired by our understanding of how the brain operates. Also, new materials and devices such as conducting films and carbon nanotubes may replace traditional complementary metal oxide semiconductor-based systems.
Ultimately, the future of computing will depend on a different set of tradeoffs, with energy use — governed by cooling — an increasingly important driver.
“It’s now an interdisciplinary research need,” Yalamanchili said. “You have to be able to understand the characteristics of devices, the design of architectures, the demands of applications, and the physics of the overall environment. Industry wants to keep that performance scaling going, and to do that, we are going to have to be more cross-disciplinary.”
Cooling Mobile Devices
It seems there’s now a smartphone in nearly every pocket or purse. These handheld computers can run basic business applications, shoot video, give directions, play games, browse the Web, gather weather updates, send email — and even make phone calls.
Battery life for these mobile devices can be a major issue for heavy users, but addressing the power challenge is much more complex than it seems. Smartphones and tablet computers have only the most rudimentary passive cooling capabilities: Heat flows to the case, where it dissipates to the environment — or to the user’s body. So having more battery power won’t necessarily translate into more performance.
“The thermal management options for these small devices, both phones and tablets, are extremely limited,” Joshi noted. “You can’t have a fan and you can’t have a heat sink. There are some real physical limits on what you can do related to the amount of physical space available and how tightly the components are packed. That limits the performance you can get.”
The temperature of the device case must be kept low enough — less than about 45 degrees Celsius — to avoid alarming users, while internal temperatures have to remain low enough to avoid damage. Only a few watts of power can heat phones to those limits.
One possible solution has been developed in a laboratory led by Joshi and Saibal Mukhopadhyay, where then-Ph.D. students Wen Yueh and Zhimin Wan have achieved what is believed to be the first microfluidic cooling of a commercial system-on-chip for a mobile device. Using deionized water circulated by a tiny piezoelectric pump, the experimenters showed that liquid cooling could reduce energy use by 15 to 20 percent — even after accounting for the pump power — by keeping the chip running cooler.
Though the cooling system isn’t yet fully integrated within the device case, it is serving as a test bed for liquid cooling in mobile platforms, said Mukhopadhyay, who is a professor in Georgia Tech’s School of Electrical and Computer Engineering.
“For the first time, we were able to show that cooling a mobile processor does help it with overall energy efficiency,” he said. “It helps in terms of performance, and in total power consumed. This fully-controlled system can help us understand how active cooling can help with small devices.”
Beyond the performance measures, the cooling system also provided reliability benefits. By controlling the processor operating temperature, the technique kept the chip from shutting down or scaling back its performance even during the highest operational loads. Maintaining lower temperatures should also provide better long-term reliability, Mukhopadhyay said.
Among future challenges are miniaturizing the cooling system, developing a control system to turn it on and off, and understanding the implications of using liquids in small electronic devices. Cost issues, however, could slow the transfer of this technology into consumer devices such as smartphones.
“What we have done so far is to show that there is a pathway for bringing microfluidics into this mobile cooling environment, but there is a tremendous amount of improvement left to be done,” Mukhopadhyay added.
Beyond mobile devices, the work could have implications for robotic vision systems, drones, and other devices that use power-constrained chips in systems with small form factors. The research was supported by Sandia National Laboratories, the Semiconductor Research Corporation, and Qualcomm.
Graduate student Thomas Sarvey demonstrates an experimental setup that provided liquid cooling directly on an operating high-performance CMOS chip. He worked with School of Electrical and Computer Engineering Professor Muhannad Bakir to implement the technology on a stock field-programmable gate array (FPGA) device.
Photo: Rob Felt
Liquid Cooling for FPGA Chips
Using microfluidic passages cut directly into the backs of field-programmable gate array (FPGA) devices, another Georgia Tech research team has put liquid cooling just a few hundred microns from where the transistors are operating.
The new technology could allow development of denser and more powerful integrated electronic systems that would no longer require heat sinks or cooling fans on top of the integrated circuits. Working with 28-nanometer FPGA devices, the researchers demonstrated a monolithically cooled chip that can operate at temperatures more than 60 percent below those of similar air-cooled chips.
In addition to enabling more processing power, the lower temperatures can mean longer device life and less current leakage. The cooling comes from simple deionized water flowing through microfluidic passages that replace the massive air-cooled heat sinks normally placed on the backs of chips.
“We believe we have eliminated one of the major barriers to building high-performance systems that are more compact and energy efficient,” said Muhannad Bakir, a professor in Georgia Tech’s School of Electrical and Computer Engineering. “We believe that reliably integrating microfluidic cooling directly on the silicon will be a disruptive technology for a new generation of electronics.”
Supported by the Defense Advanced Research Projects Agency (DARPA), the research is believed to be the first example of liquid cooling directly on an operating high-performance CMOS chip.
To make their liquid cooling system, Bakir and graduate student Thomas Sarvey removed the heat sink and heat-spreading materials from the backs of stock Altera FPGA chips. They then etched cooling passages into the silicon, incorporating silicon cylinders approximately 100 microns in diameter to improve heat transmission into the liquid. A silicon layer was then placed over the flow passages, and ports were attached for the connection of water tubes.
With a water inlet temperature of approximately 20 degrees Celsius, the liquid-cooled FPGA operated at a temperature of less than 24 degrees Celsius, compared to an air-cooled device that operated at 60 degrees Celsius.
Cooling Massive Data Centers
In a one-story brick building in Georgia Tech’s North Avenue Research Area, cooling fans in banks of computer servers whine as a large air-conditioning system blows cool air into the raised floor below them. The cooled air rises through servers and into an air return built into the ceiling. This data center simulator operates much like the massive facilities that host cloud operations for companies such as Facebook or Microsoft, as well as for innumerable smaller organizations.
But as much as half of the power consumed by such data centers doesn’t go to operate computers. Instead, it’s consumed by the huge air-conditioning systems that carry off the heat generated by the computers. Depending on utility rates and other factors, annual energy bills for such facilities can total several million dollars, providing a research agenda for scientists like Joshi, who studies a broad range of data center energy issues.
“Air flow management is a very important issue inside these facilities,” he noted. “We are studying how air comes up from the perforated tile floor, where it goes, and how we can change that direction. Our simulator allows us to study the critical air flow issues.”
In theory, cooled air is supposed to flow up from the floor through perforated floor tiles, into the server cabinets and then up into the ceiling. Cold air is supposed to be separated from warm air, and each machine is supposed to be kept within a certain range of operating temperatures.
“In reality, you get all sorts of problems with short-circuiting, in which hot air ends up in the cold aisle,” Joshi noted. “We use laser diagnostic equipment, wireless sensors, and other techniques to study how to minimize that with careful air flow control. Using that information, we’re developing techniques for improved air-flow management.”
Most commercial data centers use air cooling, and big data center operators use a range of techniques to cut their energy bills, including locating facilities in cold climates such as Chicago or Buffalo, where outside air can replace air conditioning for significant parts of the year.
Unfortunately, the need to provide rapid response — critical to many business applications — dictates that data centers be located close to where the data is needed, so Joshi and others are studying how to use these cooling dollars most effectively. Development of high-performance computer centers, with their growing appetite for energy, adds urgency to that effort.
“I expect to see a segmented marketplace,” Joshi said. “There is going to be a large class of applications where people will just use air cooling. They may not be the most efficient from the perspective of energy use, but they will be simple. You will also see facilities designed for high-performance computing that will look very different and include liquid cooling and advanced air cooling.”
Co-Design of Computing, Software, and Cooling
Computer servers, system software, and cooling equipment are now designed independently and brought together in the data center. Ada Gavrilovska would like to change that.
A senior research faculty member in Georgia Tech’s College of Computing and the Center for Experimental Research in Computer Systems (CERCS), Gavrilovska sees integration as the way to control energy costs, especially as more high-performance computing systems come online and computing continues its move to the cloud.
Only about 5 percent of data centers are operated by companies such as Google, which can carefully control operating conditions to boost efficiency through advanced cooling solutions or because they know their equipment and what’s being processed. Other data centers still have tremendous overhead associated with cooling. Plus, they often serve multiple clients, run many highly dynamic applications, and don’t know what’s generating heat inside the server racks.
“Traditionally, on the system side, we only focused on managing the compute allocation, the storage allocation, and the provisioning of different services in the data center,” Gavrilovska said. “The primary driver was optimizing the utilization of the machines and guaranteeing performance. In many cases, there was so much focus on performance that it didn’t matter how much it cost for cooling.”
On today’s commercial websites, a single click — a search for a specific product, for example — can generate hundreds or even thousands of actions. A page is displayed with a database operation, while an algorithm suggests related products, information about user interests is aggregated, and fraud detection software is launched.
In Georgia Tech’s data center simulator, researchers study air flow as part of efforts to reduce energy consumption. Cooling can account for as much as half of the energy data centers consume.
Photo: Rob Felt
Modeling being done by Gavrilovska and her colleagues focuses on how thermal needs fluctuate based on operations like these, the software stack in use, the time of day, and other factors. But cooling now tends to be allocated without accounting for those factors, meaning as much as 30 percent of energy expenditures may be unnecessary. Addressing that issue will require more communication between data center operators and tenants — and better modeling.
“We’ve been building a fine-grain, closed-loop system that brings in a lot of data from different levels, including the hardware, the system software stacks, and the applications,” Gavrilovska explained. “We are also building a metering capability so we can account for the overall energy implications of individual applications.”
Server cooling must be allocated to prevent sporadic heavy-use “hot spots” from overheating machines, so excess cooling is often provided. System designers can help by distributing workload among servers to avoid these hot spots, and by consolidating operations where possible, allowing unused machines to be shut down. Still, Gavrilovska pointed out, there is a lot of opportunity to further close the gap through better understanding of the workload and its implication for energy use, heat generation, and cooling demand so that energy-saving decisions can lead to benefits with minimized risks.
The research is supported by the National Science Foundation, the U.S. Department of Energy, and several companies with interest in data centers. With the growing demand for high-performance computing, such coordination and integration can’t come soon enough, Gavrilovska said.
“The cost of energy is becoming very significant,” she said. “If we don’t change the way we are doing things, a simple loop operation on an exascale computer could require megawatts of power. Computing at this scale will very quickly become impractical.”
John Toon is editor of Research Horizons magazine and director of research news at Georgia Tech.