Phi-4 Technical Report

AI-generated keywords: Data Quality

AI-generated Key Points

  • Development of phi-4, a 14-billion parameter language model:
  • Strong emphasis on data quality throughout training process
  • Strategic incorporation of synthetic data at various stages
  • Inclusion of reasoning-dense documents from sources like arXiv, PubMed Central, and GitHub, as well as licensed books
  • Filtering approach for capturing information-rich web sources:
  • Selection of high-quality documents using classifiers trained on LLM-generated annotations
  • Over-indexing on STEM-related keywords with specialized pipeline to amplify non-STEM content
  • Integration of multilingual datasets:
  • Custom extraction and cleaning pipelines implemented for uniformity across heterogeneous data sources
  • Post-training phases:
  • Supervised fine-tuning using curated user prompts and direct preference optimization based on rejection sampling and LLM evaluation
  • Model architecture and performance:
  • Based on decoder-only transformer architecture with 14 billion parameters
  • Emphasis on accuracy in code execution and proofs validity while fostering systematic reasoning
  • Creation of synthetic data spanning 50 types of datasets to enhance pretraining and midtraining phases
  • Surpassed GPT-4 in STEM-focused QA capabilities due to improved data quality, training curriculum enhancements, and innovations in post-training techniques
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang

License: CC BY 4.0

Abstract: We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.

Submitted to arXiv on 12 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.08905v1

In the development of phi-4, a 14-billion parameter language model, there was a strong emphasis on data quality throughout the training process. Unlike traditional language models that rely primarily on organic data sources such as web content or code for pre-training, phi-4 strategically incorporated synthetic data at various stages of training. This approach aimed for comprehensiveness and cleanliness above standard corpora by including major repositories of reasoning-dense documents from sources like arXiv, PubMed Central, and GitHub, as well as licensed books. To capture information-rich web sources beyond traditional datasets, a filtering approach was taken to select high-quality documents from web dumps using classifiers trained on LLM-generated annotations. This method over-indexed on STEM-related keywords and prompted the development of a specialized pipeline to amplify non-STEM content such as arts, history, travel, culture, and entertainment. Multilingual datasets were also integrated to ensure the model's ability to handle various languages. Custom extraction and cleaning pipelines were implemented to maintain uniformity across heterogeneous data sources by ingesting different file formats and preserving fragile content like equations and code blocks in web data. During post-training phases, supervised fine-tuning using curated user prompts and direct preference optimization based on rejection sampling and LLM evaluation were conducted. The phi-4 model is based on a decoder-only transformer architecture with 14 billion parameters. Emphasis was placed on accuracy in code execution and proofs validity while fostering systematic reasoning through chain-of-thought learning approaches. Synthetic data spanning 50 types of datasets were created using different seeds and prompting procedures to enhance pretraining and midtraining phases. Overall, phi-4's performance surpassed its teacher model GPT-4 in STEM-focused QA capabilities due to improved data quality, training curriculum enhancements, and innovations in post-training techniques. The model also demonstrated strong performance on reasoning-focused benchmarks despite minimal changes to its architecture from phi-3.
Created on 12 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.