Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

AI-generated keywords: Culinary dimensions Multilingual recipe data Ingredient embeddings Metapath2Vec variants Food insights

AI-generated Key Points

FlavorGraph's 300-D embeddings revealed fifteen interpretable culinary dimensions
Epicure project introduces three sibling skip-gram ingredient embeddings retrained from scratch
Training data includes 4.14 million recipes from 11 platforms in multiple languages
Raw ingredients are normalized to a curated vocabulary of 1,790 canonical entries
Epicure controls balance between chemistry-related information and recipe-context signals during training
Three sibling models (Cooc, Chem, Core) share architecture but differ in input data sources
Embeddings reveal recoverable supervised directions representing cuisine styles, food groups, processing classes, macronutrient categories, and sensory attributes
FastICA decomposition uncovers 20 interpretable axes per model; Gaussian-mixture-model partitioning identifies 150-200 culinary modes per model
Pipeline involves aggregating multilingual recipe data, normalizing NER terms, constructing graphs, training Metapath2Vec variants, and analyzing resulting embeddings

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jakub Radzikowski, Josef Chen

arXiv: 2605.22391v1 - DOI (cs.AI)

License: CC BY 4.0

Abstract: We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.

Submitted to arXiv on 21 May. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2605.22391v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In previous research by Radzikowski and Chen in 2026, FlavorGraph's 300-D embeddings were analyzed. This revealed fifteen interpretable culinary dimensions encompassing taste, texture, nutrition, geography, culture, and processing. However, this analysis was limited due to FlavorGraph's fixed pretraining on an English-centric corpus and a varied ingredient vocabulary that included non-food items and preparation details. To address these limitations, the Epicure project introduces a family of three sibling skip-gram ingredient embeddings that have been retrained from scratch. The training data consists of a vast multi-language corpus comprising 4.14 million recipes sourced from 11 different platforms in languages such as English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German and Indian-English. These raw ingredients are normalized to a curated vocabulary of 1.790 canonical entries through an LLM-curated pipeline. The key innovation of Epicure lies in its ability to control the balance between chemistry-related information and recipe-context signals during training. The three sibling models - Cooc focusing on recipe co-occurrence walks, Chem emphasizing typed FlavorDB compound-ingredient metapaths walks, and Core blending both types of walks - share the same architecture and hyperparameters but differ in their input data sources. Upon training completion, the embeddings reveal linearly recoverable supervised directions representing cuisine styles, food groups, processing classes according to NOVA standards, USDA macronutrients categories as well as sensory attributes. An unsupervised FastICA decomposition uncovers 20 interpretable axes per model while Gaussian-mixture-model partitioning identifies 150-200 named culinary modes per model. The pipeline for this project involves five stages: aggregating multilingual recipe data; normalizing raw NER terms into a standardized ingredient vocabulary; constructing co-occurrence and typed-compound graphs; training Metapath2Vec variants (Cooc, Core, Chem) on these graphs; analyzing resulting embeddings with supervised direction probes and unsupervised factor/mode discovery. Overall, Epicure offers a comprehensive exploration of the emergent geometry within food ingredient embeddings by leveraging diverse multilingual recipe data sources and advanced embedding techniques to uncover rich culinary insights across various dimensions.

- FlavorGraph's 300-D embeddings revealed fifteen interpretable culinary dimensions
- Epicure project introduces three sibling skip-gram ingredient embeddings retrained from scratch
- Training data includes 4.14 million recipes from 11 platforms in multiple languages
- Raw ingredients are normalized to a curated vocabulary of 1,790 canonical entries
- Epicure controls balance between chemistry-related information and recipe-context signals during training
- Three sibling models (Cooc, Chem, Core) share architecture but differ in input data sources
- Embeddings reveal recoverable supervised directions representing cuisine styles, food groups, processing classes, macronutrient categories, and sensory attributes
- FastICA decomposition uncovers 20 interpretable axes per model; Gaussian-mixture-model partitioning identifies 150-200 culinary modes per model
- Pipeline involves aggregating multilingual recipe data, normalizing NER terms, constructing graphs, training Metapath2Vec variants, and analyzing resulting embeddings

SummaryFlavorGraph's 300-D embeddings showed fifteen clear cooking categories. Epicure project made three new ingredient embeddings from scratch. They used 4.14 million recipes in various languages for training. Ingredients were changed to a list of 1,790 standard entries. Epicure balanced chemistry and recipe info during training. Definitions- Embeddings: Representations of words or items in a mathematical space. - Sibling: Brothers or sisters who share the same parents. - Retrained: To train again from the beginning. - Normalized: Made consistent or standardized. - Vocabulary: A set of words known by a person or used in a particular field.

Introduction: In 2026, Radzikowski and Chen conducted research on FlavorGraph's 300-D embeddings, which revealed fifteen interpretable culinary dimensions. However, this analysis was limited due to the fixed pretraining on an English-centric corpus and a varied ingredient vocabulary that included non-food items and preparation details. To address these limitations, the Epicure project introduces a family of three sibling skip-gram ingredient embeddings that have been retrained from scratch. Overview of Epicure Project: The Epicure project aims to explore the emergent geometry within food ingredient embeddings by leveraging diverse multilingual recipe data sources and advanced embedding techniques. This project involves five stages: aggregating multilingual recipe data; normalizing raw NER terms into a standardized ingredient vocabulary; constructing co-occurrence and typed-compound graphs; training Metapath2Vec variants (Cooc, Core, Chem) on these graphs; analyzing resulting embeddings with supervised direction probes and unsupervised factor/mode discovery. Data Collection: The training data for this project consists of a vast multi-language corpus comprising 4.14 million recipes sourced from 11 different platforms in languages such as English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German and Indian-English. These raw ingredients are normalized to a curated vocabulary of 1.790 canonical entries through an LLM-curated pipeline. Training Process: The key innovation of Epicure lies in its ability to control the balance between chemistry-related information and recipe-context signals during training. The three sibling models - Cooc focusing on recipe co-occurrence walks, Chem emphasizing typed FlavorDB compound-ingredient metapaths walks, and Core blending both types of walks - share the same architecture and hyperparameters but differ in their input data sources. Results: Upon completion of training, the embeddings reveal linearly recoverable supervised directions representing cuisine styles, food groups, processing classes according to NOVA standards, USDA macronutrients categories as well as sensory attributes. An unsupervised FastICA decomposition uncovers 20 interpretable axes per model while Gaussian-mixture-model partitioning identifies 150-200 named culinary modes per model. Interpretation of Results: The results from Epicure's embeddings offer a comprehensive exploration of the emergent geometry within food ingredient embeddings. The diverse multilingual recipe data sources and advanced embedding techniques used in this project have uncovered rich culinary insights across various dimensions such as taste, texture, nutrition, geography, culture, and processing. Conclusion: In conclusion, the Epicure project has successfully addressed the limitations of previous research on FlavorGraph's embeddings by introducing a family of three sibling skip-gram ingredient embeddings that have been retrained from scratch. This project has provided valuable insights into the complex relationships between food ingredients and their cultural and geographical contexts. Further research using these embeddings could potentially lead to advancements in fields such as food science, nutrition, and even cultural studies.

Created on 02 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

45.9%

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

cs.AI

45.4%

The effect of fine-tuning on language model toxicity

cs.AI

45.2%

Augmenting Interpretable Models with LLMs during Training

cs.AI

45.1%

SAGE: A Realistic Benchmark for Semantic Understanding

cs.AI

44.8%

A Systematic Survey of Prompt Engineering in Large Language Models: Technique…

cs.AI

44.8%

A Taxonomy of Transcendence

cs.AI

44.8%

Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Tho…

cs.AI

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.