Enabling Performance Observability for Heterogeneous HPC Workflows with SOMA

Heterogeneous workflows represent a promising approach for overcoming traditional application performance limitations and to accelerate scientific insight on high-performance computing (HPC) platforms. As HPC platforms grow in size and complexity, managing and optimizing workflow resources while maximizing scientific output assumes vital importance. Optimal workflow resource allocation requires high-quality and timely information about the state of the hardware resources, the status of the pending tasks, the performance of the tasks that have already been executed, and the current status of the workflow itself. A robust performance observability framework that captures and delivers this information can fundamentally improve the quality of decision-making within the workflow system, setting the stage for the adaptive execution of workflow tasks. We propose the use of SOMA, a service-based performance observability framework for such HPC workflows. With the RADICAL-Pilot runtime system as a development vehicle, SOMA demonstrates that service-based architectures coupled with an appropriate data model can serve the performance monitoring needs of large-scale ensemble workflows in a low-overhead fashion. Effective observability of workflow performance requires exporting, storing, and analyzing several types of performance data from across the application and workflow software stacks. Our study finds significant benefits in integrating observability frameworks as first-class citizens within an HPC workflow software stack. In this paper, we demonstrate how SOMA can simultaneously observe the performance states of the individual tasks, system hardware, and the workflow as a whole. Such information can then be employed to calculate better resource allocation and task configuration.

Citation

Yokelson, Dewi, et al. "Enabling performance observability for heterogeneous hpc workflows with SOMA." Proceedings of the 53rd International Conference on Parallel Processing. 2024.

Authors from IE Research Datalab