Americas

  • United States

Nvidia claims near 50% boost in AI storage speed

News
Feb 05, 20253 mins
Cloud Storage

Vendor’s speed claim is achieved through Nvidia Spectrum-X processing of large language models

Cloud Computing Digital Information Data Center Technology. Computer Information Storage. 3d Illustration
Credit: JLStock / Shutterstock

Nvidia is touting a near 50% improvement in storage read bandwidth thanks to intelligence in its Spectrum-X Ethernet networking equipment, according to the vendor’s technical blog post.

Spectrum-X is a combination of the company’s Spectrum-4 Ethernet switch and BlueField-3 SuperNIC smart networking card, which supports RoCE v2 for remote direct memory access (RDMA) over Converged Ethernet.

[ RelatedMore Nvidia news and insights ]

The Spectrum-4 SN5000 switch provides 64 800 Gbps Ethernet ports for up to 51.2 Tbps of total bandwidth. Nvidia says it has added RoCE extensions for adaptive routing and congestion control, so data packets are sent across the least congested network routes to reduce congestion or go around an outage.

Adaptively-routed packets can arrive at the destination out of sequence and the BlueField-3 DPU knows the correct order of the packets so it can reassemble them properly. If packets arrive at the destination out-of-order, “under Legacy Ethernet would require that many packets be retransmitted,” the blog stated.

Because adaptive routing is able to mitigate flow collisions and increase bandwidth efficiency, the storage system’s performance is much higher than with standard RoCE v2, Nvidia claims.

“With Spectrum-X, the SuperNIC or data processing unit (DPU) in the destination host knows the correct order of the packets, placing them in order in the host memory and keeping the adaptive routing transparent to the application. This enables higher fabric utilization for higher effective bandwidth and predictable, consistent outcomes for checkpoint, data fetching, and more,” the blog said.

Storage is an overlooked element of AI that has been overshadowed by all the emphasis on processors, namely GPUs. Large language models (LLMs) measure in the terabytes of size and all that needs to be moved around to be processed. So the faster you can move data, the better, so that the GPUs aren’t sitting around waiting for data to be fed to them.

Nvidia says it has tested out these Spectrum-4 features with its Israel-1 AI supercomputer. The testing process measured the read and write bandwidth generated by Nvidia HGX H100 GPU server clients accessing the storage, first with the network configured as a standard RoCE v2 fabric, and then with the adaptive routing and congestion control from Spectrum-X turned on, Nvidia stated.

Tests were run using a range of GPU servers as clients, from 40 to 800 GPUs. In every case, the enhanced Spectrum-X networking performed better than the standard version, with the modified read bandwidth improving from 20% to 48% and write bandwidth improving from 9% to 41% over standard RoCE networking, according to Nvidia.

Another method for improving efficiency is checkpointing, where the processing job state is saved periodically so that if the training run fails for any reason, it can be restarted from a saved checkpoint state rather than starting it over from the beginning.

Storage vendors DDN, VAST Data, and WEKA are partnering with Nvidia to integrate and optimize their solutions for Spectrum-X.