Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

AI-generated keywords: Sparse Mixture-of-Experts Instruction Tuning Large Language Models Task-Agnostic Learning FLAN-MOE

AI-generated Key Points

  • Sheng Shen et al. propose a neural architecture design called Sparse Mixture-of-Experts (MoE) for Large Language Models (LLMs)
  • Sparse MoE allows for the addition of learnable parameters without increasing inference cost
  • Introduction of instruction tuning, a technique for training LLMs to follow instructions
  • MoE models benefit more from instruction tuning compared to dense models
  • Empirical studies conducted across three experimental setups:
  • Direct fine-tuning on individual downstream tasks without instruction tuning shows MoE models underperform dense models with identical computational capacity
  • Instruction tuning followed by in-context few-shot or zero-shot generalization significantly improves performance of MoE models compared to dense models
  • Instruction tuning supplemented by further fine-tuning on individual downstream tasks leads to even better results
  • FLAN-MOE-32B model surpasses FLAN-PALM-62B on four benchmark tasks while using only one third of the FLOPs
  • Advancements showcased by FLAN-MOE inspire reevaluation of large scale language model design principles within task agnostic learning framework
  • Combining Sparse Mixture-of Experts with instruction tuning enhances performance of Large Language Models while maintaining computational efficiency
  • Task agnostic learning approaches like MoE models have significant potential for improving language understanding and generation capabilities in various applications.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

Preprint
License: CC BY 4.0

Abstract: Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.

Submitted to arXiv on 24 May. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2305.14705v2

In their paper titled "Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models," Sheng Shen et al. propose a neural architecture design called Sparse Mixture-of-Experts (MoE) that allows for the addition of learnable parameters to Large Language Models (LLMs) without increasing inference cost. They also introduce instruction tuning, a technique for training LLMs to follow instructions. The authors advocate combining these two approaches and find that MoE models benefit more from instruction tuning compared to dense models. To evaluate the effectiveness of their proposed approach, the authors conduct empirical studies across three experimental setups. First, they perform direct fine-tuning on individual downstream tasks without instruction tuning. This scenario shows that MoE models underperform dense models with identical computational capacity. In the second scenario, instruction tuning is followed by in-context few-shot or zero-shot generalization on downstream tasks. This approach significantly improves the performance of MoE models compared to dense models. In the third scenario, instruction tuning is supplemented by further fine-tuning on individual downstream tasks, leading to even better results. The authors highlight their most powerful model, FLAN-MOE-32B, which surpasses the performance of FLAN-PALM-62B on four benchmark tasks while using only one third of the FLOPs (floating point operations). These advancements showcased by FLAN-MOE inspire a reevaluation of large scale language model design principles within the framework of task agnostic learning. Overall, this study demonstrates that combining Sparse Mixture-of Experts with instruction tuning can enhance the performance of Large Language Models while maintaining computational efficiency. The findings suggest that task agnostic learning approaches like MoE models have significant potential for improving language understanding and generation capabilities in various applications.
Created on 19 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.