On December 2nd, Baidu released X-MAN3.0, a super AI computing platform optimized for deep neural networks at the 2018 Conference on Neural Information Processing Systems (NeurIPS) held in Montreal, Canada. Jointly developed by Inspur and Baidu, the X-MAN3.0 solution can acheive 2000 trillion deep neural network operations per second.
As a leading global conference in artificial intelligence, NeurIPS covers 156 fields including deep learning, neuroscience,cognition science, psychology, computer vision, statistical linguistics and information theory. As a main driving force behind the development of AI, deep learning has been one of the most talked-about topics at the conference. The innovation of computing technologies, alongside data and algorithms, is one of the most important components that has propelled the advancement of deep learning.
Two-level PCIe Switches and Pooling of GPU Resources
As one of Baidu’s most important strategic partners in the field of data center computing and storage infrastructure, Inspur has long been working with Baidu to develop AI-specific computing platforms, including X-MAN3.0, a specialized platform for ultra-large-scale AI training. The first generation of the product was released in 2016, and has been upgraded to the third generation.
The 8U X-MAN3.0 consists of two independent 4U AI modules, each supporting 8 of the latest NVIDIA V100 GPUs. The two AI modules are connected by high-speed interconnected backplanes with 48 NVLink links. The GPUs can directly communicate through Switch, and the overall unidirectional bandwidth among all GPUs is up to 2400GB/s.
X-MAN 3.0 is also equipped with two levels of PCIe switch supporting interconnections among CPU, AI accelerators and other IO. The logical relationship between CPU and GPU can be set in a software-defined manner, so as to flexibly support diversified AI workloads without system bottlenecks. This is a significant difference between X-MAN3.0 and other products in the industry.
Super AI Computing Platform Optimized for Deep Neural Networks
Today, the large-scale and distributed training is bringing increasing challenges for computing platforms. To improve the accuracy of AI models, the average size of training datasets has increased by more than 300 times. By the end of 2017, the number of labelled pictures in Google Open Image reached 9 million. The complexity of models has surged at such a high speed that some Internet companies’ AI models have reached 100 billion parameters.
This surge in data requires users to deploy larger GPU computing platforms with greater scale-up capability to solve the increasing challenges in communication between GPUs. For example, the three-dimensional Fast Fourier Transform, an algorithm commonly used in AI models, requires one global communication for every three operations in GPU parallel processing, heavily dependent on the communication bandwidth between GPUs.
X-MAN3.0 supports the largest number of GPUs among today’s computing platforms. With Switch, the platform can alleviate communication bottlenecks, delivering more-than-expected application values to Internet companies’ ultra-large-scale AI training.
With the rapid development of deep learning, silicon chip giants as well as start-ups are developing new AI accelerators which are expected to be deployed in late 2019, and this brings more choices for large internet companies. In light of this, X-MAN3.0 is designed in mind with a concept of modular HW components, standard interfaces, and flexible topologies, which provides a key technical foundation for Baidu to quickly adopt more competitive AI training solutions quickly and efficiently.
Achieving 6 Records in the Industry of AI Computing Platform
As a pioneer in deep learning research and application, Baidu has successfully launched 3 generations of AI computing platform X-MAN, achieving 6 records in the industry with this series. X-MAN1.0 was first released in Q2 2016, and achieved 5 records including a single compute node supporting 16 AI accelerators, system scalability to support 4/8/16/32/64 cards, HW disaggregation architecture between CPU and AI accelerators, PCIe Fabric architecture to dynamically allocate AI accelerators based on specific workload requirements, and peer-to-peer communication capability with native performance even in virtual machines. X-MAN2.0 released in Q3 2017 achieved another record to solve the thermal challenge with liquid cooling technology and was thus able to support more and more powerful AI accelerators.
In recent years, through its JDM (joint design manufacturing) mode, Inspur has been providing innovative and customized computing platforms for Internet companies. In addition to the X-MAN series, Inspur and Baidu also jointly developed a number of industry-leading products, such as the ABC all-in-one compute and storage platform, the SR-AI Rack, a scalable platform which supports up to 64 GPUs in a single physical cluster, and the Scorpio Rack Standard-based cold storage server. Each of these have been massively deployed in Baidu, greatly improving the computing power and scalability of Baidu data centers.
About Inspur：As the world’s leading AI computing-power provider, Inspur is committed to creating an agile, efficient and optimized AI infrastructure at four layers: computing platform, management suite, framework optimization, and application acceleration. Inspur has become the most important AI server supplier for top players in the Internet field and is also working closely with leading AI companies in systems and applications, joining with partners such as IFLYTEK, SenseTime, Face++, Toutiao and DiDi, to help achieve order-of-magnitude performance increases in speech, image, video, searching and network applications.