Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle

AI-generated keywords: Divide-and-Conquer Principle OnnxRuntime Scalability Inference Batching

AI-generated Key Points

The paper addresses the issue of poor scalability of machine learning models when deployed on CPUs.
The authors propose a novel approach based on the Divide-and-Conquer Principle to tackle this problem.
Instead of allocating all available computing resources to the entire problem, they suggest breaking it into smaller chunks and letting the framework decide how computing resources should be allocated among those chunks.
The proposed allocation mechanism is implemented in OnnxRuntime, a popular framework for training and inferencing ML models.
The effectiveness of this approach is demonstrated with several use cases, including highly popular models for image processing (PaddleOCR) and NLP tasks (BERT).
Section 2 elaborates on various reasons why inference commonly does not scale well on CPUs.
In Section 3, the authors describe in detail the concept and implementation details of their proposed Divide-and-Conquer Principle as it applies to inference.
Section 4 presents several use cases where this principle can be applied along with performance evaluation results demonstrating its benefits.
Their approach allows efficient batching of inference requests of various sizes eliminating the need for padding and letting the framework allocate computing resources proportionally to the length of each sequence.
Related work is discussed in Section 5 before concluding in Section 6.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Alex Kogan

arXiv: 2301.05099v1 - DOI (cs.LG)

License: CC BY-SA 4.0

Abstract: Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the job into independent parts that can be executed in parallel, each with the number of cores according to its expected computational cost. We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases, including the well-known models for optical character recognition (PaddleOCR) and natural language processing (BERT).

Submitted to arXiv on 12 Jan. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2301.05099v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In this paper, the authors address the issue of poor scalability of machine learning models when deployed on CPUs. They propose a novel approach based on the Divide-and-Conquer Principle to tackle this problem. Instead of allocating all available computing resources to the entire problem, they suggest breaking it into smaller chunks and letting the framework decide how computing resources should be allocated among those chunks. The authors argue that in many use cases, such a division is natural and requires only trivial changes in user code. The proposed allocation mechanism is implemented in OnnxRuntime, a popular framework for training and inferencing ML models. The inference API is extended to allow user code to invoke parallel inference on multiple inputs. The effectiveness of this approach is demonstrated with several use cases, including highly popular models for image processing (PaddleOCR) and NLP tasks (BERT). In Section 2, the authors elaborate on various reasons why inference commonly does not scale well on CPUs. One reason is that the amount of computation required by a model during inference may not be "enough" for efficient parallelization. In Section 3, they describe in detail the concept and implementation details of their proposed Divide-and-Conquer Principle as it applies to inference. Section 4 presents several use cases where this principle can be applied along with performance evaluation results demonstrating its benefits. For instance, their approach allows efficient batching of inference requests of various sizes eliminating the need for padding and letting the framework allocate computing resources proportionally to the length of each sequence. In Section 5, related work is discussed before concluding in Section 6. Overall, this paper provides an insightful solution to address poor scalability issues faced by machine learning models when deployed on CPUs using a simple yet effective approach based on Divide-and-Conquer Principle.

- The paper addresses the issue of poor scalability of machine learning models when deployed on CPUs.
- The authors propose a novel approach based on the Divide-and-Conquer Principle to tackle this problem.
- Instead of allocating all available computing resources to the entire problem, they suggest breaking it into smaller chunks and letting the framework decide how computing resources should be allocated among those chunks.
- The proposed allocation mechanism is implemented in OnnxRuntime, a popular framework for training and inferencing ML models.
- The effectiveness of this approach is demonstrated with several use cases, including highly popular models for image processing (PaddleOCR) and NLP tasks (BERT).
- Section 2 elaborates on various reasons why inference commonly does not scale well on CPUs.
- In Section 3, the authors describe in detail the concept and implementation details of their proposed Divide-and-Conquer Principle as it applies to inference.
- Section 4 presents several use cases where this principle can be applied along with performance evaluation results demonstrating its benefits.
- Their approach allows efficient batching of inference requests of various sizes eliminating the need for padding and letting the framework allocate computing resources proportionally to the length of each sequence.
- Related work is discussed in Section 5 before concluding in Section 6.

The paper talks about a problem with machine learning models not working well on regular computers. The authors suggest breaking the problem into smaller parts and letting the computer decide how to use its resources for each part. They tested this idea on popular image and language models and it worked well. They explain their idea in detail in Section 3 and show examples of how it can be used in Section 4. This approach allows for efficient use of computing resources without needing extra padding. Definitions: - Machine learning models: computer programs that can learn from data and make predictions or decisions based on that data - Scalability: the ability to handle larger amounts of work or data without losing performance - CPUs: central processing units, the main component of a computer that performs most of its processing tasks - Divide-and-Conquer Principle: a strategy where a big problem is broken down into smaller, more manageable problems - OnnxRuntime: a software framework used for training and running machine learning models

Scalability of Machine Learning Models on CPUs: A Divide-and-Conquer Approach

The scalability of machine learning models when deployed on CPUs is a major issue that has been plaguing the field for some time. In this paper, the authors propose a novel approach based on the Divide-and-Conquer Principle to tackle this problem. This approach is implemented in OnnxRuntime, a popular framework for training and inferencing ML models. The effectiveness of this approach is demonstrated with several use cases, including highly popular models for image processing (PaddleOCR) and NLP tasks (BERT).

Background

When deploying machine learning models on CPUs, there are several issues that can lead to poor scalability. One reason is that the amount of computation required by a model during inference may not be "enough" for efficient parallelization. Furthermore, existing approaches often require manual tuning or padding to achieve optimal performance which can be difficult and time consuming.

Proposed Solution

The authors propose an alternative solution based on the Divide-and-Conquer Principle which allows users to break down their problem into smaller chunks and let the framework decide how computing resources should be allocated among those chunks. This requires only trivial changes in user code as it relies heavily on natural divisions within many use cases such as batching requests of various sizes without requiring padding or manual tuning. To implement this approach in OnnxRuntime, they extended its inference API to allow user code to invoke parallel inference on multiple inputs using their proposed allocation mechanism. Performance evaluation results demonstrate its benefits compared to existing methods such as improved throughput when dealing with large batches due to better resource utilization and elimination of padding overhead when dealing with small batches due to dynamic resource allocation across different sequences lengths.

Related Work

In Section 5, related work is discussed before concluding in Section 6 where they summarize their findings and discuss future directions for research in this area such as exploring other applications beyond image processing and NLP tasks where their proposed solution could be applied effectively.

Conclusion

Overall, this paper provides an insightful solution to address poor scalability issues faced by machine learning models when deployed on CPUs using a simple yet effective approach based on Divide-and-Conquer Principle implemented in OnnxRuntime framework which allows users to break down their problems into smaller chunks while letting the framework decide how computing resources should be allocated among those chunks resulting in improved throughput when dealing with large batches due to better resource utilization and elimination of padding overhead when dealing with small batches due to dynamic resource allocation across different sequences lengths compared with existing methods..

Created on 15 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

59.9%

Efficiently Scaling Transformer Inference

cs.LG

49.8%

Sparks of Artificial General Intelligence: Early experiments with GPT-4

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.