A Scheduling-Based Framework for Efficient Massively Parallel Execution

Status: Completed

Start Date: 2016-04-20

End Date: 2020-02-18

Description: Modeling and simulation on high-end computing systems has grown increasingly complex in recent years as both models and computer systems continue to advance. The majority of coding and debugging time is not spent defining the problem physics but instead in balancing computations between multiple heterogeneous devices, handling communication of data, managing distributed memory systems, and providing fault-tolerance. Often, the resulting programs are barely readable as the details of the work being performed are obscured by hardware-specific setup and communication code that dominates a program's codebase. Even worse, the code used to balance computation, manage data communication, and provide fault-tolerance is re-implemented in each piece of an application even though it performs the same tasks across those sections of the software. This makes software more difficult to maintain and upgrade, and hinders porting to new hardware platforms as they become available. The time spent improving, modifying, or debugging these device specific code paths and common code sections could be better spent improving kernel performance or adding new features. To address the problem of separating physical science from computing science, we are developing a solution that decouples the problem definition from the platform-specific implementation details. This is accomplished by dividing the computation into distinct tasks, each of which takes some defined input data and produces some output data. These tasks can then be connected into a task graph by defining their dependencies on each other. This task graph describing a particular code can then be used to automatically manage data and schedule work across heterogeneous devices without requiring further user intervention. Therefore, to make use of new hardware, the user need only port any tasks that might take advantage of the new hardware, and all scheduling, data management, and synchronization required are handled automatically.
Benefits: These tools could be used to reduce software development and maintenance time and improve the computational performance and scalability of a variety of high-performance computing applications. Specifically, we intend to initially focus on applying this technology to GEOS-5 for earth modeling, this framework can also benefit other earth modeling packages. Another application area of this technology is CFD solvers such as Fun3D and OVERFLOW. It can also be an enabler for the High-End Computing Capability (HECC) project, by enhancing both usability and performance of applications able to take advantage of heterogeneous compute architectures. Additionally, it permits more flexibility in hardware design and purchasing for high-end computing systems by reducing the effort required to port applications to new hardware architectures, such as GPUs and Xeon Phis.

Most HPC software will be able to benefit from this technology, particularly applications meant to scale to large computer systems and/or target heterogeneous hardware configurations. Expected application domains include electromagnetics simulations, computational chemistry, oil and gas exploration, and financial modeling. It also includes any domains that involve large-scale sparse linear algebra operations, large-scale image processing, and other physics-based and multi-disciplinary modeling applications.

Lead Organization: EM Photonics, Inc.