by Michael Cooney

Senior Editor

Cisco, Arista, HPE, Intel lead consortium to supersize Ethernet for AI infrastructures

News Analysis

Jul 20, 20235 mins

Cisco SystemsGenerative AIMicrosoft

Backed by the Linux Foundation, the new Ultra Ethernet Consortium aims to increase the scale, stability, and reliability of Ethernet networks to satisfy AI’s high performance networking requirements.

Get better simpler networking with network as a service

Credit: HPE

AI workloads are expected to put unprecedented performance and capacity demands on networks, and a handful of networking vendors have teamed up to enhance today’s Ethernet technology in order to handle the scale and speed required by AI.

AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft announced the Ultra Ethernet Consortium (UEC), a group hosted by the Linux Foundation that’s working to develop physical, link, transport and software layer Ethernet advances.

The industry celebrated Ethernet’s 50th anniversary this year. The hallmark of Ethernet has been its flexibility and adaptability, and the venerable technology will undoubtedly play a critical role when it comes to supporting AI infrastructures. But there are concerns that today’s traditional network interconnects cannot provide the required performance, scale and bandwidth to keep up with AI demands, and the consortium aims to address those concerns.

“AI workloads are demanding on networks as they are both data- and compute-intensive. The workloads are so large that the parameters are distributed across thousands of processors. Large Language Models (LLMs) such as GPT-3, Chinchilla, and PALM, as well as recommendation systems like DLRM [deep learning recommendation] and DHEN [Deep and Hierarchical Ensemble Network] are trained on clusters of many 1000s of GPUs sharing the ‘parameters’ with other processors involved in the computation,” wrote Arista CEO Jayshree Ullal in a blog about the new consortium. “In this compute-exchange-reduce cycle, the volume of data exchanged is so significant that any slowdown due to a poor/congested network can critically impact the AI application performance.”

Historically, the only option to connect processor cores and memory has been interconnects such as InfiniBand, PCI Express, Remote Direct Memory Access over Ethernet and other protocols that connect compute clusters with offloads but have limitations when it comes to AI workload requirements.

“Arista and Ultra Ethernet Consortium’s founding members believe it is time to reconsider and replace RDMA limitations. Traditional RDMA, as defined by InfiniBand Trade Association (IBTA) decades ago, is showing its age in highly demanding AI/ML network traffic. RDMA transmits data in chunks of large flows, and these large flows can cause unbalanced and over-burdened links,” Ullal wrote.

“It is time to begin with a clean slate to build a modern transport protocol supporting RDMA for emerging applications,” Ullal wrote. “The [consortium’s] UET (Ultra Ethernet Transport) protocol will incorporate the advantages of Ethernet/IP while addressing AI network scale for applications, endpoints and processes, and maintaining the goal of open standards and multi-vendor interoperability.”

The UEC wrote in a white paper that it will further an Ethernet specification to feature a number of core technologies and capabilities including:

Multi-pathing and packet spraying to ensure AI workflows have access to a destination simultaneously.
Flexible delivery order to make sure Ethernet links are optimally balanced; ordering is only enforced when the AI workload requires it in bandwidth-intensive operations.
Modern congestion-control mechanisms to ensure AI workloads avoid hotspots and evenly spread the load across multipaths. They can be designed to work in conjunction with multipath packet spraying, enabling a reliable transport of AI traffic.
End-to-end telemetry to manage congestion. Information originating from the network can advise the participants of the location and cause of the congestion. Shortening the congestion signaling path and providing more information to the endpoints allows more responsive congestion control.

The UEC said it will increase the scale, stability, and reliability of Ethernet networks along with improved security.

“The UEC transport incorporates network security by design and can encrypt and authenticate all network traffic sent between computation endpoints in an AI training or inference job. The UEC will develop a transport protocol that leverages the proven core techniques for efficient session management, authentication, and confidentiality from modern encryption methods like IPSec and PSP,” the UEC wrote.

“As jobs grow, it is necessary to support encryption without ballooning the session state in hosts and network interfaces. In service of this, UET incorporates new key management mechanisms that allow efficient sharing of keys among tens of thousands of compute nodes participating in a job. It is designed to be efficiently implemented at the high speeds and scales required by AI training and inference,” the UEC stated.

“This isn’t about overhauling Ethernet,” said Dr. J Metz, chair of the Ultra Ethernet Consortium, in a statement. “It’s about tuning Ethernet to improve efficiency for workloads with specific performance requirements. We’re looking at every layer – from the physical all the way through the software layers – to find the best way to improve efficiency and performance at scale.”

The need for improved AI connectivity technology is beginning to emerge. For example, in its most recent “Data Center 5-Year July 2023 Forecast Report,” the Dell’Oro Group stated that 20% of Ethernet data center switch ports will be connected to accelerated servers to support AI workloads by 2027. The rise of new generative AI applications will help fuel more growth in an already robust data center switch market, which is projected to exceed $100 billion in cumulative sales over the next five years, said Sameh Boujelbene, vice president at Dell’Oro.

In another recently released report, the 650 Group stated that AI/ML puts a tremendous amount of bandwidth performance requirements on the network, and AI/ML is one of the major growth drivers for data center switching over the next five years.

“With bandwidth in AI growing, the portion of Ethernet switching attached to AI/ML and accelerated computing will migrate from a niche today to a significant portion of the market by 2027. We are about to see record shipments in 800Gbps based switches and optics as soon as products can reach scale in production to address AI/ML,” said Alan Weckel, founder and technology analyst at 650 Group.

by Michael Cooney

Senior Editor

Michael Cooney is a Senior Editor with Network World who has written about the IT world for more than 25 years. He can be reached at michael_cooney@foundryco.com.

Show me more

Palo Alto Networks firewall bug being exploited by threat actors: Report

By Howard Solomon

Feb 14, 20253 mins

FirewallsVulnerabilitiesZero-day vulnerability

Nvidia forges healthcare partnerships to advance AI-driven genomics, drug discovery

By Zeus Kerravala

Feb 14, 20256 mins

Networking

Americas

Topics

About

Policies

Our Network

More

Cisco, Arista, HPE, Intel lead consortium to supersize Ethernet for AI infrastructures

Backed by the Linux Foundation, the new Ultra Ethernet Consortium aims to increase the scale, stability, and reliability of Ethernet networks to satisfy AI’s high performance networking requirements.

More from this author

Juniper CEO: ‘I am disappointed and somewhat puzzled’ by DOJ merger rejection

Juniper unveils EX4000 access switches to simplify enterprise network operations

Cisco financials catch AI demand, enterprise networking growth

HPE expands ProLiant server portfolio, boosts AI and security features

Cisco data center switches feature baked-in security for AI, networking duties

Cisco launches AI Renewals Agent with Mistral AI

Fortinet targets branch offices with upgraded firewalls

Kyndryl expands Palo Alto deal to offer managed SASE service

Show me more

Palo Alto Networks firewall bug being exploited by threat actors: Report

Nvidia forges healthcare partnerships to advance AI-driven genomics, drug discovery

Arm secures Meta as first customer in chip push, challenging industry giants

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

How to use the lsblk command

How to use the fdisk command

How to use the du command

Cisco, Arista, HPE, Intel lead consortium to supersize Ethernet for AI infrastructures

Backed by the Linux Foundation, the new Ultra Ethernet Consortium aims to increase the scale, stability, and reliability of Ethernet networks to satisfy AI’s high performance networking requirements.

From our editors straight to your inbox

More from this author

Juniper CEO: ‘I am disappointed and somewhat puzzled’ by DOJ merger rejection

Juniper unveils EX4000 access switches to simplify enterprise network operations

Cisco financials catch AI demand, enterprise networking growth

HPE expands ProLiant server portfolio, boosts AI and security features

Cisco data center switches feature baked-in security for AI, networking duties

Cisco launches AI Renewals Agent with Mistral AI

Fortinet targets branch offices with upgraded firewalls

Kyndryl expands Palo Alto deal to offer managed SASE service

Show me more

Palo Alto Networks firewall bug being exploited by threat actors: Report

Nvidia forges healthcare partnerships to advance AI-driven genomics, drug discovery

Arm secures Meta as first customer in chip push, challenging industry giants

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

How to use the lsblk command

How to use the fdisk command

How to use the du command