Phase II Amount
$1,639,300
State-of-the-art, production quality implementations of popular parallel pro- gramming libraries like MPI and SHMEM must be enhanced to support next-generation programmable Network Interface Controllers (SMART NICs) like NVIDIA/Mellanox Bluefield and Broadcom Stingray for various science domains in HPC and DL to take advantage of emerging HPC platforms with ex- tremely high compute and communication capabilities. We will develop the following innovations in SMART-CAM to en- able scale-up and scale-out of various driving science domains on emerging smart NIC powered dense many-core CPU/GPU platforms: 1) Efficient designs for asynchronous progress; 2) Offloading ren- dezvous communication; 3) In-network collective communication; 4) Offloaded datatype processing; 5) Optimized RMA operations; 6) Accelerated I/O and checkpoint-restart; 7) Online data compression and coalescing, and 8) Carry out integrated development and evaluation to ensure proper integration of proposed designs with the MPI and PGAS libraries. We have demonstrated the feasibility of the SMART-CAM product by working on selected activities under goals #1, #2, and #3.The initial prototype can run successfullyon 32 nodes with 32 Bluefield-2 adapters and up to 1,024 MPI processes for a set of representativeMPI applications. Initial customer engagements with the prototype product have takenplacewith various organizations including NVIDIA/Mellanox, Broadcom, LANL, NSSC-Singapore, and UO. These customers are enthusiastic about using the new product (please see attached letters) to accelerate their HPC and DL applications with SMART-CAM. We aim to build on top of the success of Phase-I to build the complete SMART-CAM product while primarily focusing on the major goals (4, 5, 6, 7, and 8), remaining partsof the other goals, in-depth Q&A testing, and commercialization. The transformative impact of SMART-CAM will be to achieve extreme scalability and performance for HPC and DL frameworks/applications while taking advantage of SMART NICs. SMART-CAM can reduce the CPU utilization of popular communication middleware by up to 30%, resulting in a significant boost to application-level performance by making more CPU time available to it. The acceleration of communication and I/O operations made possible by the NVMeoF technology introduced by Bluefield/Stingray smart NICs can reduce the I/O processing time by a factor of up to 5 for I/O intensive workloads like DL training. Furthermore, the availability of programmable ARM cores on the Bluefield/Stingray SoC can reduce the processing overhead for activities like datatype processing, data compression, and collective communication by up to a factor of 3 for HPC and DL applications.