Phi-4 Technical Report

AI-generated keywords: Data Quality

AI-generated Key Points

Development of phi-4, a 14-billion parameter language model:
Strong emphasis on data quality throughout training process
Strategic incorporation of synthetic data at various stages
Inclusion of reasoning-dense documents from sources like arXiv, PubMed Central, and GitHub, as well as licensed books
Filtering approach for capturing information-rich web sources:
Selection of high-quality documents using classifiers trained on LLM-generated annotations
Over-indexing on STEM-related keywords with specialized pipeline to amplify non-STEM content
Integration of multilingual datasets:
Custom extraction and cleaning pipelines implemented for uniformity across heterogeneous data sources
Post-training phases:
Supervised fine-tuning using curated user prompts and direct preference optimization based on rejection sampling and LLM evaluation
Model architecture and performance:
Based on decoder-only transformer architecture with 14 billion parameters
Emphasis on accuracy in code execution and proofs validity while fostering systematic reasoning
Creation of synthetic data spanning 50 types of datasets to enhance pretraining and midtraining phases
Surpassed GPT-4 in STEM-focused QA capabilities due to improved data quality, training curriculum enhancements, and innovations in post-training techniques

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang

arXiv: 2412.08905v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.

Submitted to arXiv on 12 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.08905v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the development of phi-4, a 14-billion parameter language model, there was a strong emphasis on data quality throughout the training process. Unlike traditional language models that rely primarily on organic data sources such as web content or code for pre-training, phi-4 strategically incorporated synthetic data at various stages of training. This approach aimed for comprehensiveness and cleanliness above standard corpora by including major repositories of reasoning-dense documents from sources like arXiv, PubMed Central, and GitHub, as well as licensed books. To capture information-rich web sources beyond traditional datasets, a filtering approach was taken to select high-quality documents from web dumps using classifiers trained on LLM-generated annotations. This method over-indexed on STEM-related keywords and prompted the development of a specialized pipeline to amplify non-STEM content such as arts, history, travel, culture, and entertainment. Multilingual datasets were also integrated to ensure the model's ability to handle various languages. Custom extraction and cleaning pipelines were implemented to maintain uniformity across heterogeneous data sources by ingesting different file formats and preserving fragile content like equations and code blocks in web data. During post-training phases, supervised fine-tuning using curated user prompts and direct preference optimization based on rejection sampling and LLM evaluation were conducted. The phi-4 model is based on a decoder-only transformer architecture with 14 billion parameters. Emphasis was placed on accuracy in code execution and proofs validity while fostering systematic reasoning through chain-of-thought learning approaches. Synthetic data spanning 50 types of datasets were created using different seeds and prompting procedures to enhance pretraining and midtraining phases. Overall, phi-4's performance surpassed its teacher model GPT-4 in STEM-focused QA capabilities due to improved data quality, training curriculum enhancements, and innovations in post-training techniques. The model also demonstrated strong performance on reasoning-focused benchmarks despite minimal changes to its architecture from phi-3.

- Development of phi-4, a 14-billion parameter language model:
- Strong emphasis on data quality throughout training process
- Strategic incorporation of synthetic data at various stages
- Inclusion of reasoning-dense documents from sources like arXiv, PubMed Central, and GitHub, as well as licensed books
- Filtering approach for capturing information-rich web sources:
- Selection of high-quality documents using classifiers trained on LLM-generated annotations
- Over-indexing on STEM-related keywords with specialized pipeline to amplify non-STEM content
- Integration of multilingual datasets:
- Custom extraction and cleaning pipelines implemented for uniformity across heterogeneous data sources
- Post-training phases:
- Supervised fine-tuning using curated user prompts and direct preference optimization based on rejection sampling and LLM evaluation
- Model architecture and performance:
- Based on decoder-only transformer architecture with 14 billion parameters
- Emphasis on accuracy in code execution and proofs validity while fostering systematic reasoning
- Creation of synthetic data spanning 50 types of datasets to enhance pretraining and midtraining phases
- Surpassed GPT-4 in STEM-focused QA capabilities due to improved data quality, training curriculum enhancements, and innovations in post-training techniques

Summary- A big language model called phi-4 was made with 14 billion parts. - They focused a lot on making sure the information they used was good. - They added fake information at different times to help teach the model. - They used smart documents from places like arXiv, PubMed Central, and GitHub. - The model got better at answering science questions by using special ways to find important web pages. Definitions- Language model: A computer program that helps understand and generate human language. - Parameters: Parts of a system that can be adjusted or changed. - Synthetic data: Artificially created data used for training models. - STEM: Science, Technology, Engineering, and Mathematics.

In recent years, language models have become increasingly powerful and sophisticated, thanks to advancements in artificial intelligence and machine learning. One such model that has garnered attention is phi-4, a 14-billion parameter language model developed by OpenAI. In this article, we will delve into the research paper that details the development of phi-4 and its unique approach to data quality. Traditional language models rely heavily on organic data sources such as web content or code for pre-training. However, phi-4 took a different approach by strategically incorporating synthetic data at various stages of training. This decision was driven by the need for comprehensiveness and cleanliness above standard corpora. To achieve this goal, major repositories of reasoning-dense documents were included in the training process. These sources included arXiv, PubMed Central, GitHub, as well as licensed books. By including these diverse sources of information-rich documents, phi-4 aimed to improve its understanding and reasoning capabilities. But how did phi-4 ensure the quality of these datasets? A filtering approach was taken to select high-quality documents from web dumps using classifiers trained on LLM-generated annotations. This method over-indexed on STEM-related keywords but also prompted the development of a specialized pipeline to amplify non-STEM content such as arts, history, travel, culture, and entertainment. Moreover, multilingual datasets were integrated into the training process to ensure that phi-4 could handle various languages effectively. Custom extraction and cleaning pipelines were implemented to maintain uniformity across heterogeneous data sources by ingesting different file formats while preserving fragile content like equations and code blocks in web data. During post-training phases, supervised fine-tuning using curated user prompts was conducted along with direct preference optimization based on rejection sampling and LLM evaluation techniques. These approaches helped further enhance the performance of phi-4. The architecture of phi-4 is based on a decoder-only transformer with 14 billion parameters - an impressive feat in itself. Emphasis was placed on accuracy in code execution and proofs validity while fostering systematic reasoning through chain-of-thought learning approaches. In addition to the use of synthetic data, phi-4 also utilized 50 types of datasets created using different seeds and prompting procedures to enhance pretraining and midtraining phases. This innovative approach helped improve the model's performance, surpassing its teacher model GPT-4 in STEM-focused QA capabilities. Despite minimal changes to its architecture from phi-3, the phi-4 model demonstrated strong performance on reasoning-focused benchmarks. This is a testament to the importance of data quality and how it can significantly impact a language model's capabilities. In conclusion, phi-4's development process highlights the crucial role that data quality plays in creating powerful language models. By incorporating diverse sources of information-rich documents, implementing specialized pipelines for cleaning and extraction, and utilizing innovative post-training techniques, phi-4 has set a new standard for language models' performance. As technology continues to advance, we can only imagine what future language models will achieve with an emphasis on data quality like that seen in phi-4.

Created on 12 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

67.0%

Textbooks Are All You Need II: phi-1.5 technical report

cs.CL

66.3%

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

cs.CL

64.6%

Small Language Models: Survey, Measurements, and Insights

cs.CL

63.2%

Yi: Open Foundation Models by 01.AI

cs.CL

62.6%

Scaling Synthetic Data Creation with 1,000,000,000 Personas

cs.CL

62.5%

Training a Helpful and Harmless Assistant with Reinforcement Learning from Hu…

cs.CL

62.2%

Sparks of Artificial General Intelligence: Early experiments with GPT-4

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.