I like the simple, intuitive ideas to address the problem, but, in my view, there are two limitations that are not sufficiently discussed in the paper: 1. Weight stashing stores multiple versions of the parameters, which will greatly increase the memory footprint. The authors claim that PipeDream's peak per-worker memory usage is on par with data parallelism. However, one of the known limitations of data parallelism is it doesn't work if the model is too large to fit into memory. I think reducing the memory footprint will be interesting future work. 2. The paper did not discuss how to handle stragglers. I feel that this problem is more challenge in pipeline parallelism than data parallelism because of intra-stage dependency.