Intel's chief architect Olivier Franza has revealed the groundbreaking details of the Aurora supercomputer, a project that is set to redefine the future of supercomputing. The Aurora project, housed at Argonne National Laboratory in Illinois, is one of Intel's most anticipated and highly visible projects, marking a significant milestone in the company's entire system portfolio.
Aurora's Unprecedented Performance
Aurora is expected to be the world's first supercomputer with a peak performance reaching two exaflops, equivalent to 2x10^18 floating-point operations per second. This makes it the world's largest GPU cluster, a feat that has put immense pressure on Franza, a 22-year Intel veteran who became the chief architect in 2021.
Franza's responsibilities include defining the overall system architecture, ensuring general performance metrics, power envelope, and essential features like reliability, availability, and serviceability. The architecture encompasses everything from individual nodes to the complete system, including networking fabric and storage components.
A Shift in Roadmap Leads to Innovation
The initial planning for Aurora consisted of a collection of Intel technologies. However, changes to Intel's product roadmap, including the end of the Xeon Phi and Omnipath product families, required a restart. This led to the design of the Intel® Data Center GPU Max Series (code-named Ponte Vecchio).
Aurora's design has informed Intel's strategy and product portfolio, addressing scale and performance at the highest level. The architecture and concept for the Intel® Xeon® CPU Max Series were inspired by features from the Intel Xeon Phi platform. The need for high performance drove advances across all subsystems, including storage.
Intel even architected a completely new storage concept, DAOS (distributed asynchronous object storage), an open-source software ecosystem for high-speed storage. Aurora will be among the first and largest systems to use it.
Collaboration and Design Complexity
The Aurora project required system-level thinking and broad collaboration across various Intel business units, Argonne scientists, and engineers at Hewlett Packard Enterprise. The system is made up of 10,624 compute blades, 63,744 Intel Max Series GPUs, and 21,248 Intel Xeon Max CPUs across 166 racks, covering the size of four tennis courts.
The final blade was installed in June, but the project continues through testing, stabilization, and validation at scale. Franza leads a large team working on system bring-up, validation, stabilization, optimization, and full-system performance workloads.
A Once-in-a-Lifetime Effort for Impactful Research
What drives the team is the opportunity to build an extraordinary machine that will power impactful research, including cancer research and generative AI. Aurora will enable one of the biggest large language models planned to date, the 1 trillion parameter Aurora GenAI project.
Franza emphasizes the teamwork and perseverance required for such an immense challenge, describing it as a marathon mentality. The accomplishment, he says, is something that very few can claim to have achieved.