# Tiresias: A GPU Cluster Managerfor Distributed Deep Learning

### Background and Motivation

The authors argue three primary challenges faced by distributed deep learning(DDL) scheduler in production.

#### Unpredictable job duration

Some of the existing schedulers try to predict the DL job training times by assuming DL jobs have smooth loss curves. However, because of the trial-and-error characteristic of DL jobs, their loss curves are not as smooth as the curves of the best model ultimately picked at the end of exploration. Thus, the scheduler should not rely on the loss curve for predicting eventual job completion time.&#x20;

#### Over-aggressive job consolidation

Because DL jobs are sensitive to GPU locality, many existing solutions assign all components of the job to the same or the minimum number of servers. As a result, jobs often wait when they cannot be consolidated, even if there are enough spare resources elsewhere in the cluster.&#x20;

#### Time overhead of preemption

The common way to preempt Unlike preemption in CPU, GPU preemption usually takes tens of milliseconds.&#x20;

### Tiresias

The main objectives of Tiresias are 1) minimizing the average job completion time(JCT), 2) achieving high GPU utilization and 3) avoiding starvation.&#x20;

![](/files/-M3iGeSoFTNFopB5psXm)

To address the aforementioned challenges, Tiresias uses an aged based scheduler called **Two-dimensional Attained Service-Based Scheduler**(2DAS). 2DAS assigns each job a priority based on its **attained service**. The attained service of a job is calculated based on the number of GPUs it uses and the amount of time it has been running so far.&#x20;

When no job duration information is provided, the priority function applies the Least-Attained-Service(LAS) algorithm where a job's priority is inverse to its attained service. If the distribution of job duration is provided, then a job's priority equals its Gittins index value.&#x20;

Using continuous priorities can lead to a sequence of preemptions(preemption is both time-consuming and expensive in GPUs) and subsequent resumptions for all jobs. Tiresias address this challenge by using the classic **Multi-Level Feedback Queue algorithm**.&#x20;

![](/files/-M3iKOwZcvyPe2LtHbi6)

Another insight of Tiresias is that **the skew of the model structure** can be a good predictor of whether a job is sensitive to consolidation, because the message size distribution in DLL depends on the tensor size distribution of the model. Based on this insight, Tiresias profiler identifies the amount of skew in tensor distributions across parameter servers and if it is larger than a predefined threshold, Tiresias attempts to consolidate the job in as few machines as possible.&#x20;

###


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://xzhu0027.gitbook.io/blog/ml-system/sys-ml-index/tiresias-a-gpu-cluster-managerfor-distributed-deep-learning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
