Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models

AI-generated keywords: EEVE-Korean-v1.0 Efficient and Effective Vocabulary Expansion language models Korean adaptation training efficiency

AI-generated Key Points

Introduction of \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models
Utilizes the Efficient and Effective Vocabulary Expansion (EEVE) method for parameter freezing and subword initialization
EEVE-Korean-10.8B-v1.0 model excels in processing Korean language tasks while maintaining proficiency in English with just 2 billion tokens
Project aims to expand vocabulary expansion methodology to additional languages for generalizability and effectiveness
Focus on evaluating reasoning and generative capabilities through tasks like GSM8K and human evaluations in interactive settings
Enhancing pre-training data quality and analyzing performance in code-switching scenarios to improve model's robustness and versatility
Making models available to research community to develop more inclusive and efficient language processing technologies

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Seungduk Kim, Seungtaek Choi, Myeongho Jeong

arXiv: 2402.14714v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: This report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models that exhibit remarkable capabilities across English and Korean text understanding. Building on recent highly capable but English-centric LLMs, such as SOLAR-10.7B and Phi-2, where non-English texts are inefficiently processed with English-centric tokenizers, we present an efficient and effective vocabulary expansion (EEVE) method, which encompasses parameter freezing and subword initialization. In contrast to previous efforts that believe new embeddings require trillions of training tokens, we show that our method can significantly boost non-English proficiency within just 2 billion tokens. Surpassing most instruction-tuned LLMs on the Open Ko-LLM Leaderboard, as of January 2024, our model \texttt{EEVE-Korean-10.8B-v1.0} ranks as the leading Korean pre-trained model in the open-source community, according to Hugging Face's leaderboard. We open-source our models on Huggingface to empower the open research community in various languages.

Submitted to arXiv on 22 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.14714v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

The report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models that showcases impressive capabilities in understanding both English and Korean text. By leveraging the Efficient and Effective Vocabulary Expansion (EEVE) method, which involves parameter freezing and subword initialization, the EEVE-Korean-10.8B-v1.0 model has been able to excel in processing Korean language tasks while maintaining strong proficiency in English. This advancement has been achieved with just 2 billion tokens, highlighting a significant improvement in training efficiency and effectiveness for language models. Moving forward, the project aims to expand on this success by exploring the application of the vocabulary expansion methodology to additional languages to assess its generalizability and effectiveness. The goal is not only to broaden the linguistic range of the EEVE-Korean model but also to delve deeper into evaluating its reasoning and generative capabilities through diverse tasks such as complex mathematical reasoning tests like GSM8K and human evaluations in interactive settings like chatbots. Furthermore, future efforts will focus on enhancing pre-training data quality and analyzing performance in code-switching scenarios to refine the model's robustness and versatility. These initiatives are designed to broaden the model's applicability and efficacy, pushing the boundaries of what can be achieved with advanced language models. By making these models available to the research community, the project aims to contribute towards developing more inclusive and efficient language processing technologies that can benefit a wide range of users across different languages.

- Introduction of \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models
- Utilizes the Efficient and Effective Vocabulary Expansion (EEVE) method for parameter freezing and subword initialization
- EEVE-Korean-10.8B-v1.0 model excels in processing Korean language tasks while maintaining proficiency in English with just 2 billion tokens
- Project aims to expand vocabulary expansion methodology to additional languages for generalizability and effectiveness
- Focus on evaluating reasoning and generative capabilities through tasks like GSM8K and human evaluations in interactive settings
- Enhancing pre-training data quality and analyzing performance in code-switching scenarios to improve model's robustness and versatility
- Making models available to research community to develop more inclusive and efficient language processing technologies

Summary1. A new model called EEVE-Korean-v1.0 was created for Korean language using a special method. 2. This model is good at understanding Korean and English, and it has 2 billion tokens. 3. The project wants to use this method for other languages too. 4. They are testing the model's abilities with tasks like GSM8K and human evaluations. 5. The goal is to make better language processing technology available to researchers. Definitions- Vocabulary: Words that a person knows or uses in a language. - Expansion: Making something bigger or adding more to it. - Methodology: A way of doing things or solving problems. - Generalizability: Being able to apply something to different situations or cases. - Robustness: Strength and ability to work well even in difficult situations. - Versatility: Ability to be used in many different ways or for different purposes.

Language models have been making significant strides in recent years, with the development of large-scale pre-trained models that showcase impressive capabilities in understanding and generating text. However, most of these advancements have been focused on English language tasks, leaving other languages behind. This is where the research paper "EEVE-Korean-v1.0: Efficient and Effective Vocabulary Expansion for Korean Language Models" comes into play. The report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models that leverages the Efficient and Effective Vocabulary Expansion (EEVE) method to excel in processing both English and Korean text. The model has achieved this feat with just 2 billion tokens, highlighting a significant improvement in training efficiency and effectiveness for language models. So what exactly is EEVE? It involves two key steps - parameter freezing and subword initialization. Parameter freezing refers to fixing certain parameters during pre-training to prevent them from being updated during fine-tuning on downstream tasks. This allows the model to retain its proficiency in one language while learning another. Subword initialization involves initializing subwords (smaller units of words) with embeddings from a multilingual model instead of randomly assigning them as done traditionally. This helps the model learn more effectively by leveraging knowledge from multiple languages. The EEVE-Korean-10.8B-v1.0 model has been evaluated on various tasks such as natural language inference, sentiment analysis, question-answering, named entity recognition, etc., showcasing strong performance across all tasks compared to existing Korean language models like KoBERT and XLM-RoBERTa. But why focus on Korean specifically? According to the researchers, Korean poses unique challenges due to its complex grammar structure and rich vocabulary size (~500k words). Therefore, developing an efficient yet effective approach for processing this language can pave the way for similar advancements in other languages as well. Moving forward, the project aims to expand on this success by exploring the application of the vocabulary expansion methodology to additional languages. This will not only broaden the linguistic range of the EEVE-Korean model but also help in evaluating its reasoning and generative capabilities through diverse tasks such as complex mathematical reasoning tests like GSM8K and human evaluations in interactive settings like chatbots. Furthermore, future efforts will focus on enhancing pre-training data quality and analyzing performance in code-switching scenarios to refine the model's robustness and versatility. Code-switching refers to switching between two or more languages while speaking or writing, which is a common occurrence in multilingual societies. By improving performance in such scenarios, the model can be made more inclusive and applicable to real-world situations. The ultimate goal of this research project is to contribute towards developing more efficient and inclusive language processing technologies that can benefit a wide range of users across different languages. By making these models available to the research community, they hope to encourage further advancements in this field. In conclusion, \texttt{EEVE-Korean-v1.0} is a significant step towards bridging the gap between English-centric language models and other languages like Korean. With its impressive capabilities and potential for further improvements, it has opened up new possibilities for developing advanced language processing technologies that can cater to diverse linguistic needs.

Created on 29 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.