SBIR-STTR Award

A Unified Profiling Infrastructure and Tool for Extreme-scale Deep Learning
Award last edited on: 12/23/2020

Sponsored Program
SBIR
Awarding Agency
DOE
Total Award Amount
$206,500
Award Phase
1
Solicitation Topic Code
07b
Principal Investigator
Donglai Dai

Company Information

X-ScaleSolutions LLC

750 Deer Run Drive
Columbus, OH 43230
   (614) 316-4209
   contactus@x-scalesolutions.com
   www.x-scalesolutions.com
Location: Single
Congr. District: 03
County: Franklin

Phase I

Contract Number: DE-SC0020551
Start Date: 00/00/00    Completed: 00/00/00
Phase I year
2020
Phase I Amount
$206,500
Efficient exploitation of high compute and communication capabilities offered by modern CPUs/GPUs and high-performance interconnects bring forth many challenges to DL scientists, middleware developers, and end users primarily because it is very challenging to understand the impact of and interplay between these modern hardware elements and different portions of distributed DL training. State-of-the-art approach being used to monitor, analyze, and understand the performance of DL applications and frameworks on HPC systems require a plethora of tools and significant manual effort is required to understand the interplay between the various components. This is primarily due to the lack of a holistic performance monitoring and analysis tool for the emerging DL area. The proposed DeepIntrospect DI) product will build upon existing and recognized capabilities to enhance state-of- the-art capabilities to holistically monitor, analyze, and understand the performance of DL applications and frameworks on next-generation HPC systems. X-ScaleSolutions proposes to create Deep- Introspect DI), a highly efficient and easy to use tool for performance engineering of extreme-scale deep learning applications. To realize our vision for the DI tool we propose the following key develop- ment goals: 1) Develop a Performance Monitoring framework for CPU, GPU, interconnects intra-node and inter-node), and file systems; 2) Design an integrated profiler to capture low-level communication statistics as well as offer high-level insights into application layerÂ’s performance; 3) Develop a unified log parser with basic and advanced modules to combine various sources of performance data; 4) Design a responsive GUI that can act as a one-stop dashboard for all DI users; 5) Develop a high-performance DataStore including support for text-based logs as well as a structured database for efficient access; 6) Design the connectors API for internal usage as well as for interoperatability with external tools; and 7) Develop a plugin architecture for GUI and implement one plugin as an example for external developers. As part of Phase-I activities, we will work on parts of 1, 2, 3, and 4 and demonstrate the significance and benefits of the DI tool for a representative set of DL applications. There is a sizable market for Deep Learning of $3.18 Billion/year in 2018, which is expected to grow to $18.16 billion by the end of 2023 according to MarketsandMarkets research. DeepIntrospect has the potential to fundamentally simplify the vast majority of tasks that are being performed by several researchers and scientists across the world. Deep Learning being a relatively new area with ever-changing software libraries and domain-specific hardware presents a unique challenge. Beginners entering the field can end up wasting not only several man-hours but tons of GPU/CPU cycles in training DNNs. DeepIntrospect will allow the beginner scientists and researchers to better align their efforts if they can easily introspect and analyze the behaviour of their model in a simplified fashion. For advanced users and program/product managers, it is more about offloading the cumbersome parts of performance monitoring and analysis to developers and scientists. DeepIntrospect will greatly increase developer/scientists productivity thereby helping the senior/middle management to focus more on design and marketing of DL products and less on low-level details.

Phase II

Contract Number: ----------
Start Date: 00/00/00    Completed: 00/00/00
Phase II year
----
Phase II Amount
----