Online Scheduling for Exploratory Training Jobs in Deep Learning Clusters
AbstractResource management for Deep Learning (DL) clusters is essential for system efficiency and model training quality. Existing schedulers provided by DL frameworks are mostly adaptations from traditional HPC clusters and usually work on jobs' makespan, assuming that DL training jobs finish completely. Unfortunately, a fair amount of training jobs are exploratory jobs and often finish unsuccessfully (over 30%) in production clusters. Existing DL cluster schedulers using offline algorithms are not suitable for exploratory jobs when unexpected early terminations can cause noticeable resource waste. Moreover, DL training jobs are iterative and usually yield diminishing returns as they progress, which results in inefficiency when equally allocating resource among training iterations. The fundamental goal of a DL training job is to gain model quality improvement, usually indicated by the loss reduction (job profit) of a DNN model. This paper introduces a novel scheduling problem for exploratory jobs that seeks to maximize the overall profit of a DL cluster. To solve it, we propose a solution based on the primal-dual framework, coupled with a resource price function that emphasizes the importance of job profit to resource consumption ratio, that resulting in a competitive ratio of 2α that belongs to O(ln n). We design an efficient online algorithm ExplSched, which integrates Dynamic Programming (DP) and heuristic algorithms to jointly consider both the scheduling performance and overhead with a time complexity of O(nE j). Experimental results show that ExplSched achieved an average system utility improvement of 83.82% compared with other related work.