Monday, December 23, 2024

How AWS engineers infrastructure to power generative AI

Must read

Delivering low-latency, large-scale networking

Generative AI models require massive amounts of data to train and run efficiently. The larger and more complex the model, the longer the training time. As you increase time to train, you’re not only increasing operating costs but also slowing down innovation. Traditional networks are not sufficient for the low latency and large scale needed for generative AI model training.

We’re constantly working to reduce network latency and improve performance for customers. Our approach is unique in that we have built our own network devices and network operating systems for every layer of the stack—from the Network Interface Card, to the top-of-rack switch, to the data center network, to the internet-facing router and our backbone routers. This approach not only gives us greater control over improving security, reliability, and performance for customers, but also enables us to move faster than others to innovate. For example, in 2019, we introduced Elastic Fabric Adapter (EFA), a network interface custom-built by AWS that provides operating system bypass capabilities to Amazon EC2 instances. This enables customers to run applications requiring high levels of inter-node communications at scale. EFA uses Scalable Reliable Datagram (SRD), a high-performance, lower-latency network transport protocol that was designed specifically by AWS, for AWS.

More recently, we moved fast to deliver a new network for generative AI workloads. Our first generation UltraCluster network, built in 2020, supported 4,000 graphics processing units, or GPUs, with a latency of eight microseconds between servers. The new network, UltraCluster 2.0, supports more than 20,000 GPUs with 25% latency reduction. It was built in just seven months, and this speed would not have been possible without the long-term investment in our own custom network devices and software. Internally, we call UltraCluster 2.0 the “10p10u” network, as it delivers tens of petabits per second of throughput, with a round-trip time of less than 10 microseconds. The new network results in at least 15% reduction in time to train a model.

Latest article