Jun 8 – 10, 2020
Indico / zoom
Europe/Berlin timezone

A job management system for utilization of idle supercomputer resources

Jun 10, 2020, 10:00 AM
15m
Indico / zoom

Indico / zoom

https://zoom.us/j/98141351045?pwd=SHlYK1VOSk1WdTBwbmhoamhJZndQUT09 Passwort: DLC-2020 Meeting-ID: 981 4135 1045

Speaker

Elena Fedotova (SINP MSU)

Description

We propose a system for executing low-priority non-parallel jobs on idle supercomputer resources to increase the effective load of the resources. The jobs are executed inside containers so the checkpoint mechanism can be used to save the state of the jobs during the execution and resume it on a different node. Thanks to splitting the execution of the low-priority jobs into separate shorter intervals, the system can utilize idle computational nodes with little impact on performance with respect to the regular jobs.

The system consists of two components. The first component is a control program that maintains a queue of low-priority non-parallel jobs, assigns both the new jobs and the jobs saved as checkpoints to computational nodes, tracks their status, and manages the checkpoints. It also interacts with the supercomputer scheduler. The second component is an agent program that is executed on computational nodes and interacts with container software by starting the assigned jobs inside containers and saving the progress as checkpoints before the allotted time is over.

Based on our estimates, under varying assumptions 40% to 89% of the idle resources can be effectively utilized with the proposed system.

Primary authors

Presentation materials