The power and cooling demands of AI processing are far beyond what standard hardware configurations can deliver, according to Schneider Electric.
Schneider Electric is warning the demands of power and cooling for AI are beyond what standard data center designs can handle and says new designs are necessary.
That may be expected from a company like Schneider, which makes power and cooling systems used in data centers. But it doesn’t mean Schneider isn’t correct. AI is a different kind of workload than standard server-side applications, such as databases, and the old ways just don’t cut it anymore.
Schneider’s white paper notes that AI needs ample supply of three things: power, cooling, and bandwidth. GPUs are the most popular AI processors and the most power intensive. Whereas CPUs from Intel and AMD draw about 300 to 400 watts, Nvidia’s newest GPUs draw 700 watts per processor and they are often delivered in clusters of eight at a time.
This leads to greater rack density. In the past, rack density of around 10kW to 20kW was standard and easily addressed by air cooling (heatsinks and fans). But anything over 30kW per rack means that air cooling is no longer a viable option for cooling. At that point, liquid cooling has to be taken into consideration, and liquid cooling is not an easy retrofit.
“AI start-ups, enterprises, colocation providers, and internet giants must now consider the impact of these densities on the design and management of the data center physical infrastructure,” the authors of the paper wrote.
Schneider projects that the total cumulative data center power consumption worldwide will be 54GW this year and hit 90GW by 2028. In that time, AI processing will go from accounting for 8% of all power use this year to 15% to 20% by 2028.
While power and cooling has been top of mind among data center builders, another consideration often overlooked is network throughput and connectivity. For AI training, each GPU needs its own network port with very high throughput.
However, GPUs have greatly outpaced network ports. For example, using GPUs that process data from memory at 900 Gbps with a 100 Gbps compute fabric would slow the GPU down because it has to wait for the network to process all of the data. Alternatively, InfiniBand is much faster than traditional copper wires, but it’s also 10 times more expensive.
One approach to avoid heat density is to physically spread out the hardware. Don’t fill the racks, physically separate them, and so on. But doing so introduces latency given the many terabytes of data that have to be moved around, and latency is the enemy of performance.
Suggestions and solutions
Schneider offers a number of suggestions. The first calls for replacing 120/280V power distribution with 240/415V systems to reduce the number of circuits within high-density racks. It also recommends multiple power distribution units (PDU) to deliver adequate power.
Setting a threshold of 20kW per rack for air cooling is another suggestion. Going beyond 20kW, Schneider recommends using liquid cooling. Given that air cooling maxes out at 30kW, I believe Schneider is being a bit conservative about the limits of air cooling. Or trying to sell liquid cooling hardware.
There are multiple forms of liquid cooling, but Schneider advocates direct liquid cooling. A copperplate is connected to the CPU just like with an air cooled system, but it has two pipes: cool water comes in one pipe, absorbs the heat, and exits via the other pipe, where it is circulated and cooled down.
Schneider doesn’t seem to be a fan of immersion cooling, as the dialectic liquids used for immersion contain fluorocarbons which may be polluting.
Schneider also warns that there is a general lack of standardization in liquid cooling, so a thorough infrastructure assessment – done by experts experienced with the equipment – is important. That’s assuming that a facility can even be retrofitted in the first place. Most data centers using liquid cooling add the infrastructure when the center is being built, not afterwards.
There are a number of other recommendations and guidance included in the white paper.