Americas

  • United States
michael_cooney
Senior Editor

Industry groups drive Ethernet upgrades for AI, HPC

News Analysis
Mar 29, 20248 mins
Data CenterHigh-Performance ComputingNetworking

AI networking and bulkier data center applications are sparking advancements in Ethernet-based communication technologies.

man monitoring network security
Credit: Shutterstock

AI workloads, high-performance computing (HPC) requirements, and sustainability initiatives are spurring tech industry efforts to rework the venerable Ethernet ecosystem.

There’s pressure to increase the scale, stability, and reliability of Ethernet, and, among other outcomes, that pressure has led to wider interest in the Ultra Ethernet Consortium.

AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft originally formed the Ultra Ethernet Consortium in July of last year. Its charter is to bring together industry leaders to build a complete Ethernet-based communication stack architecture for high-performance networking. Since November, when it began accepting new members, 45 companies have joined the UEC. There are now 715 industry experts engaged in the UEC’s eight working groups.

“There is a strong desire to have an open, accessible, Ethernet-based network specifically designed to accommodate AI and HPC workload requirements,” said J Metz, chair of the UEC steering committee, in a statement. “This level of involvement is encouraging; it helps us achieve the goal of broad interoperability and stability.”

The idea behind the UEC wasn’t to come up with a new technology and then make everyone wait seven years for something to show up in the marketplace, said Uri Elzur, chair of the UEC technical advisory committee. 

With the approach UEC is taking, “customers could use existing Ethernet switches that everybody is deploying today, and UEC technologies will work on top of them and take advantage of all of the innovation Ethernet already has at the link and endpoint level,” Elzur said. “So, the next time they are making a buying decision, they can consider some optional features that we’ll offer that will be fully compliant with Ethernet, and they’ll be able to use all of their exiting tools to work with it.”

UEC version 1.0

Work on the UEC specifications is following what the group calls a very aggressive timeline, with version 1.0 slated to be released in the third quarter of 2024. The UEC 1.0 Overview explains some of the group’s priorities for the forthcoming specification.

“Even when considering the advantages of using Ethernet, improvements can and should be made,” the UEC stated. “Networks must evolve to better deliver this unprecedented performance for the increased scale and higher bandwidth of networks of the future. Paramount is the need to have the network support delivery of messages to all participating endpoints as quickly as possible, without long delays for even a few endpoints.”

For example, the UEC cites the need to minimize “tail latency” in the training of AI models: “Training consists of frequent computation and communications phases, where the initiation of the next phase of the training is dependent on the completion of the communication phase across the suite of GPUs. The last message to arrive gates the progress of all GPUs. This tail latency – measured by the arrival time of the last message in the communication phase – is a critical metric in system performance.”

To achieve low tail latency, the UEC specification will address critical networking requirements for the next generation of applications, including:

  • Multi-pathing and packet spraying
  • Flexible delivery order
  • Modern congestion control mechanisms
  • End-to-end telemetry
  • Larger scale, stability, and reliability

“This last point places an extra burden on all of the previous ones,” the UEC stated. “High-performance systems leave little margin for error, which compounds in a larger network. Determinism and predictability become more difficult as systems grow, necessitating new methods to achieve holistic stability.”

Another of the challenges UEC is working to address for AI and high-performance networks is setting up the ability to support multiple pathways for communications between clusters.

 “While we have multi-path communications today, we typically only use one highway to interconnect. So, if there’s a problem there, the whole system slows down,” Elzur said. “We need to enable a system that has many network highways that we can use all of the time and all of them always need to be clear.”

Ethernet Alliance releases 2024 roadmap

AI networking also made its way into another group’s annual roadmap. In the newly released 2024 Ethernet Roadmap, the Ethernet Alliance called AI/ML the new killer app for the Ethernet industry.

Ethernet is evolving to meet the market demands for AI/ML services and other applications with its continued progression towards higher speed interfaces, the widening variety of interconnect options, and advancements in power efficiency, according to Peter Jones, chairman of the Ethernet Alliance.

There are a number of questions around the use of Ethernet over current InfiniBand operations, Jones added. “The real question for me is: How much of the old stuff do you need to be an effective replacement or alternative to what’s being done today? Ideally, the new technology does everything the old technology did, better and cheaper,” Jones said. 

“The bigger changes here will be around doing things like load balancing, and the protocols that run over the top to make things work together,” Jones said.  

Sustainability is also a hot topic for the Ethernet industry, and the Ethernet Alliance highlighted it in its 2014 roadmap. If you look at data center equipment and the network, the proportion of the energy bill being used by the network has been increasing, Jones said. “You get to the stage where you can’t fault the rack because you’ve already consumed all of your power and all of your cooling,” Jones said. “The big issue for Ethernet is going to be how do we increase the service and reduce the power? Ultimately focusing on efficiency and effectiveness is going give us better product.”

2024 Optical Fiber Communication conference

At this week’s 2024 Optical Fiber Communication (OFC) conference and exhibition, members showed off a couple of the Ethernet Alliance’s core roadmap directions: multivendor interoperability and reliability at speeds up to 800 Gigabit Ethernet (GbE).

The Alliance’s installation at the OFC conference incorporated a wide range of switches, routers and interconnects from Arista, Cisco, Juniper, Marvell, Spirent, Synopsis and others. Interfaces included OSFP, QSFP-DD, QSFP, and SFP pluggable form factors. The demo also featured the test and measurement offerings including physical layer and traffic generation tools for ensuring Ethernet’s capacity for accommodating even the most demanding applications, the group stated. 

And there are data center applications demanding such Ethernet changes, according to Kevin Wollenweber, senior vice president and general manager of Cisco’s networking, data center and provider connectivity organization.

“There is no doubt that there is an unrelenting expansion of data center traffic that is fueling the demand for high capacity and highly intelligent data center networking solutions,” Wollenweber said. “And with Ethernet being ubiquitous in enterprise data centers and evolving with increased speeds of 400G, 800G and ultimately 1.6T on the horizon, it continues to be the one network enterprises will use to run nearly all applications. We see AI, both generative and training of models, to be some of the biggest drivers of growth. Beyond AI/ML, there are a number of applications that can take full advantage of the higher Ethernet speeds. These include high-performance compute and applications and storage in particular.”

According to Wollenweber, other future Ethernet growth drivers include:

  • Media content providers and broadcasters can leverage Ethernet to meet the evolving demand for more content and rich media experiences, including more camera feeds, higher resolutions with 4K and 8K video, and virtual reality capabilities.
  • The convergence of Application and IP Storage (NSF, iSCSI or similar) traffic on the same network is increasing bandwidth requirements. Enterprises are using more IP storage and converging their data networks into one modern datacenter network, powered by Ethernet, capable of dealing with congestion and providing non-blocking bandwidth for their applications. 
  • With the increase in graphics resolutions, Ethernet will also support emerging multiplayer cloud gaming and real-time video translation applications.

The 650 Group said recently that Ethernet networking speeds will continue to increase at a rapid pace to keep up with AI and machine learning workloads. Early 2024 demonstrations of 1.6TbE show that Ethernet is keeping pace with AI/ML networking requirements, and 650 Group projected that 1.6 TbE solutions will be the dominant port speed by 2030.

Likewise, the Dell’Oro Group says nearly half of data center switch ports will be driven by 400G speeds and higher by 2027, and 800G is expected to eclipse 400G by 2025.