The Case for Task Sampling based Learning for Cluster Job Scheduling
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Accurately estimating job runtime properties is crucial for effective job scheduling in cluster job scheduling.
- Traditional online cluster job schedulers use history-based learning to estimate runtime properties, but this can lead to inaccurate predictions due to changing technology and user inputs.
- The proposed approach is task-sampling-based, involving proactive sampling and scheduling of a small fraction of tasks from each job.
- This approach exploits the similarity among task runtime properties within the same job, making it immune to changing job behavior.
- The study focuses on two key questions: (1) Can learning in space be more accurate than learning in time? (2) Can delaying the scheduling of remaining tasks until the completion of sampled tasks improve job performance?
- Analytical and experimental analysis demonstrate that learning in space significantly improves accuracy compared to history-based learning.
- Simulation and testbed evaluation show that learning in space reduces average Job Completion Time (JCT) by 1.28x, 1.56x, and 1.32x compared to history-based predictors.
- This research highlights the potential and limitations of real-time learning of job runtime properties through task-sampling-based approaches.
- It provides valuable insights into improving cluster job scheduling by leveraging similarities among task runtime properties within a job while adapting to changing environments.
Authors: Akshay Jajoo, Y. Charlie Hu, Xiaojun Lin, Nan Deng
Abstract: The ability to accurately estimate job runtime properties allows a scheduler to effectively schedule jobs. State-of-the-art online cluster job schedulers use history-based learning, which uses past job execution information to estimate the runtime properties of newly arrived jobs. However, with fast-paced development in cluster technology (in both hardware and software) and changing user inputs, job runtime properties can change over time, which lead to inaccurate predictions. In this paper, we explore the potential and limitation of real-time learning of job runtime properties, by proactively sampling and scheduling a small fraction of the tasks of each job. Such a task-sampling-based approach exploits the similarity among runtime properties of the tasks of the same job and is inherently immune to changing job behavior. Our study focuses on two key questions in comparing task-sampling-based learning (learning in space) and history-based learning (learning in time): (1) Can learning in space be more accurate than learning in time? (2) If so, can delaying scheduling the remaining tasks of a job till the completion of sampled tasks be more than compensated by the improved accuracy and result in improved job performance? Our analytical and experimental analysis of 3 production traces with different skew and job distribution shows that learning in space can be substantially more accurate. Our simulation and testbed evaluation on Azure of the two learning approaches anchored in a generic job scheduler using 3 production cluster job traces shows that despite its online overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28x, 1.56x, and 1.32x compared to the prior-art history-based predictor.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.