Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models

AI-generated keywords: EEVE-Korean-v1.0 Efficient and Effective Vocabulary Expansion language models Korean adaptation training efficiency

AI-generated Key Points

  • Introduction of \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models
  • Utilizes the Efficient and Effective Vocabulary Expansion (EEVE) method for parameter freezing and subword initialization
  • EEVE-Korean-10.8B-v1.0 model excels in processing Korean language tasks while maintaining proficiency in English with just 2 billion tokens
  • Project aims to expand vocabulary expansion methodology to additional languages for generalizability and effectiveness
  • Focus on evaluating reasoning and generative capabilities through tasks like GSM8K and human evaluations in interactive settings
  • Enhancing pre-training data quality and analyzing performance in code-switching scenarios to improve model's robustness and versatility
  • Making models available to research community to develop more inclusive and efficient language processing technologies
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Seungduk Kim, Seungtaek Choi, Myeongho Jeong

License: CC BY 4.0

Abstract: This report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models that exhibit remarkable capabilities across English and Korean text understanding. Building on recent highly capable but English-centric LLMs, such as SOLAR-10.7B and Phi-2, where non-English texts are inefficiently processed with English-centric tokenizers, we present an efficient and effective vocabulary expansion (EEVE) method, which encompasses parameter freezing and subword initialization. In contrast to previous efforts that believe new embeddings require trillions of training tokens, we show that our method can significantly boost non-English proficiency within just 2 billion tokens. Surpassing most instruction-tuned LLMs on the Open Ko-LLM Leaderboard, as of January 2024, our model \texttt{EEVE-Korean-10.8B-v1.0} ranks as the leading Korean pre-trained model in the open-source community, according to Hugging Face's leaderboard. We open-source our models on Huggingface to empower the open research community in various languages.

Submitted to arXiv on 22 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.14714v1

The report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models that showcases impressive capabilities in understanding both English and Korean text. By leveraging the Efficient and Effective Vocabulary Expansion (EEVE) method, which involves parameter freezing and subword initialization, the EEVE-Korean-10.8B-v1.0 model has been able to excel in processing Korean language tasks while maintaining strong proficiency in English. This advancement has been achieved with just 2 billion tokens, highlighting a significant improvement in training efficiency and effectiveness for language models. Moving forward, the project aims to expand on this success by exploring the application of the vocabulary expansion methodology to additional languages to assess its generalizability and effectiveness. The goal is not only to broaden the linguistic range of the EEVE-Korean model but also to delve deeper into evaluating its reasoning and generative capabilities through diverse tasks such as complex mathematical reasoning tests like GSM8K and human evaluations in interactive settings like chatbots. Furthermore, future efforts will focus on enhancing pre-training data quality and analyzing performance in code-switching scenarios to refine the model's robustness and versatility. These initiatives are designed to broaden the model's applicability and efficacy, pushing the boundaries of what can be achieved with advanced language models. By making these models available to the research community, the project aims to contribute towards developing more inclusive and efficient language processing technologies that can benefit a wide range of users across different languages.
Created on 29 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.