AI-Oriented Cloud Computing Infrastructure

2024-12-03 | wicinternet.org

Introduction

Based on massive training datasets and complex models with ultra-high parameter counts, leveraging powerful computational capabilities to abstract information from data has become the mainstream approach toward achieving General Artificial Intelligence (AGI). This makes the efficient utilization of computational power particularly urgent, necessitating a systematic upgrade of the cloud computing software and hardware architectural frameworks.

In 2009, Alibaba Cloud proposed the visionary concept that “A data center is a single computer.” Today, in the era of AI, such a technical framework has become even more essential. Serving as a supercomputer, cloud computing can efficiently integrate heterogeneous computing resources, breaking through the performance limitations of individual chips, and collaboratively accomplish large-scale intelligent computing tasks.

A Brand New AI-oriented Cloud Computing Infrastructure Has been Developed

In contrast to the traditional IT era, the AI era demands higher performance and efficiency from infrastructure. The computing architecture, once dominated by CPUs, has rapidly shifted towards a GPU-centric AI computing system. Alibaba Cloud is reconstructing its underlying hardware, computing, storage, networking, databases, and big data with AI at the core, integrating and adapting these elements organically to Al scenarios. This accelerates model development and application, building cloud computing infrastructure for the AI era.

The newly launched Panjiu AI server supports up to 16 GPUs per machine. The high-performance network architecture HPN7.0, designed for AI, can stably connect over 100,000 GPUs, increasing end-to-end training performance by more than 10%. Alibaba Cloud’s CPFS (Cloud Parallel File System) provides data throughput of 20TB/s, offering exponentially scalable storage capacity for AI computations. The new generation of intelligent computing products, Lingjun, is built on a novel computing, storage, and networking technology system, providing cluster-level accelerated computing services for AI applications. The container computing service has introduced GPU container computing power for the first time, enhancing computational affinity and performance through topology-aware scheduling. The artificial intelligence platform PAI (Platform of Artificial Intelligence) has achieved integrated elastic scheduling for training and inference at the ten-thousand GPU level, with effective AI computing power utilization exceeding 90%.

This project has gained industry recognition, appearing in analytical reports from well-known organizations such as Forrester and Gartner, while advancements in related core technology exploration have been included in several top academic conferences, including SIGCOMM, HPCA, ICDE, and ACM.

Supporting the Innovative Development of the AI Industry

Based on the outcomes of this project, the product solutions serve over 50% of China’s large model enterprises and institutions. In China, 80% of technology companies and over 50% of large model startups use Alibaba Cloud’s AI infrastructure services. The market capitalization of the AI-native enterprises served has exceeded 84 billion RMB, and the cumulative download count of open-source models by upper-layer enterprise users has surpassed 32 million times. Multiple models rank at the top of relevant performance evaluation charts. Alibaba Cloud’s AI infrastructure not only provides strong technical support but also promotes innovation and development in large model applications through various means, including building an open ecosystem and participating in international exchanges.

From Academic Breakthroughs to Industry Empowerment

Driving Research and Innovation in Al Infrastructure: The flow control algorithm HPCC, aimed at large-scale and high-complexity scenarios such as AI training and storage, was selected for SIGCOMM 2019. The design of an artificial intelligence cluster architecture was chosen for HPCA 2020, and the AI cluster network HPN 7.0, which achieves scalability at the level of hundreds of thousands of nodes, was selected for SIGCOMM 2024. These contributions provide a new high-performance standard for the design and practice of global AI network infrastructure.Promoting Innovation in AI Applications: In 2022, we collaborated with Xiaopeng Motors to create the autonomous driving intelligent computing center "Fuyiao", significantly reducing the training time for Xiaopeng’s core autonomous driving model from 7 days to under 1 hour, resulting in a 170-fold increase in iteration speed. In 2023, we assisted Geely in establishing the Xingrui intelligent computing center, which improved the overall R&D efficiency of Geely’s autonomous driving business by 20%. Additionally, in 2023, we supported Fudan University in developing the CFFF research computing platform, which received numerous accolades, including the “Innovation Pioneer” case at the 2023 China Computing Power Conference and “Smart Empowerment Leading Case” at the

2023 China International Trade in Services Expo. In 2024, at the Paris. Olympics, Alibaba Cloud not only supported the full migration of video broadcasting to the cloud but also reshaped the viewing experience using comprehensive AI technology capabilities.

The World Internet Conference (WIC) was established as an international organization on July 12, 2022, headquartered in Beijing, China. It was jointly initiated by Global System for Mobile Communication Association (GSMA), National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT), China Internet Network Information Center (CNNIC), Alibaba Group, Tencent, and Zhijiang Lab.