SBIR-STTR Award

Accelerating MPI and PGAS Libraries for HPC and Deep Learning Applications on Exascale Systems with Programmable Smart NICs
Award last edited on: 12/23/2020

Sponsored Program
SBIR
Awarding Agency
DOE
Total Award Amount
$1,845,800
Award Phase
2
Solicitation Topic Code
02b
Principal Investigator
Donglai Dai

Company Information

X-ScaleSolutions LLC

750 Deer Run Drive
Columbus, OH 43230
   (614) 316-4209
   contactus@x-scalesolutions.com
   www.x-scalesolutions.com
Location: Single
Congr. District: 03
County: Franklin

Phase I

Contract Number: DE-SC0020554
Start Date: 2/18/2020    Completed: 2/17/2021
Phase I year
2020
Phase I Amount
$206,500
The extremely high compute and communication capabilities offered by modern CPUs/GPUs and high-performance interconnects have led to the creation of HPC platforms with dense many-core CPUs, multiple GPUs and high-performance interconnects. State-of-the-art implementations of popular parallel programming libraries like MPI and SHMEM must be enhanced for various science domains in traditional HPC and DL e.g. molecular dynamics, lattice QCD, seismology, image classification, and fusion research, computer vision) to take advantage of these emerging platforms. Unfortunately, state-of-the-art production quality implementations of the popular programming models do not have the appropriate support for next-generation programmable Network Interface Controllers NICs) like Mellanox Bluefield to deliver the best performance and scalability for applications on such dense many-core CPU/GPU systems. The proposed SMART-CAM product will build upon existing and recognized capabilities to enhance state-of-the-art production quality implementations of the popular MPI and SHMEM programming models to take advantage of next-generation networking and storage technologies available with Mellanox Bluefield network adapters to deliver the best possible scale-up and scale-out for HPC and DL applications on emerging dense many-core CPU/GPU systems. We will develop the following innovations in SMART-CAM to enable scale-up and scale-out of various driving science domains in as indicated above on emerging dense many-core CPU/GPU platforms: 1) Efficient designs for asynchronous progress; 2) Offloading rendezvous communication; 3) In-network collective communication; 4) Novel datatype processing to improve application performance; 5) Offloaded datatype processing; 6) Optimized RMA operations; 7) Accelerated I/O and checkpoint-restart; 8) Online data compression and coalescing; and 9) Carry out integrated development and evaluation to ensure proper integration of proposed designs with the MPI and PGAS libraries. Tasks 1, 2, 3, and relevant portions of 9 will be carried out as part of Phase-1 activities. The transformative impact of the proposed SMART- CAM product will be achieve scalability and performance out of HPC and DL frameworks/applications to take advantage of emerging dense many-core CPU/GPU platforms with programmable NICs like Bluefield. We expect that the solutions that SMART-CAM will provide can reduce the CPU utilization of popular communication middleware by up to 30%. This can result in significant boost to application level performance by making more CPU time available to it. Further, the acceleration of various communication and I/O operations made possible by newer technologies introduced by the Bluefield series of network adapters like RDMA over NVMEoF can reduce the I/O processing time by a factor of up to 5 for I/O intensive workloads like DL training. Furthermore, the availability of programmable ARM cores on the Bluefield SoC can reduce the processing overhead for activities like datatype processing, data compression, and collective communication by up to a factor or 3 for traditional HPC as well as emerging Deep Learning applications.

Phase II

Contract Number: DE-SC0020554
Start Date: 5/3/2021    Completed: 5/2/2023
Phase II year
2021
Phase II Amount
$1,639,300
State-of-the-art, production quality implementations of popular parallel pro- gramming libraries like MPI and SHMEM must be enhanced to support next-generation programmable Network Interface Controllers (SMART NICs) like NVIDIA/Mellanox Bluefield and Broadcom Stingray for various science domains in HPC and DL to take advantage of emerging HPC platforms with ex- tremely high compute and communication capabilities. We will develop the following innovations in SMART-CAM to en- able scale-up and scale-out of various driving science domains on emerging smart NIC powered dense many-core CPU/GPU platforms: 1) Efficient designs for asynchronous progress; 2) Offloading ren- dezvous communication; 3) In-network collective communication; 4) Offloaded datatype processing; 5) Optimized RMA operations; 6) Accelerated I/O and checkpoint-restart; 7) Online data compression and coalescing, and 8) Carry out integrated development and evaluation to ensure proper integration of proposed designs with the MPI and PGAS libraries. We have demonstrated the feasibility of the SMART-CAM product by working on selected activities under goals #1, #2, and #3.The initial prototype can run successfullyon 32 nodes with 32 Bluefield-2 adapters and up to 1,024 MPI processes for a set of representativeMPI applications. Initial customer engagements with the prototype product have takenplacewith various organizations including NVIDIA/Mellanox, Broadcom, LANL, NSSC-Singapore, and UO. These customers are enthusiastic about using the new product (please see attached letters) to accelerate their HPC and DL applications with SMART-CAM. We aim to build on top of the success of Phase-I to build the complete SMART-CAM product while primarily focusing on the major goals (4, 5, 6, 7, and 8), remaining partsof the other goals, in-depth Q&A testing, and commercialization. The transformative impact of SMART-CAM will be to achieve extreme scalability and performance for HPC and DL frameworks/applications while taking advantage of SMART NICs. SMART-CAM can reduce the CPU utilization of popular communication middleware by up to 30%, resulting in a significant boost to application-level performance by making more CPU time available to it. The acceleration of communication and I/O operations made possible by the NVMeoF technology introduced by Bluefield/Stingray smart NICs can reduce the I/O processing time by a factor of up to 5 for I/O intensive workloads like DL training. Furthermore, the availability of programmable ARM cores on the Bluefield/Stingray SoC can reduce the processing overhead for activities like datatype processing, data compression, and collective communication by up to a factor of 3 for HPC and DL applications.