Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

AI-generated keywords: Culinary dimensions Multilingual recipe data Ingredient embeddings Metapath2Vec variants Food insights

AI-generated Key Points

  • FlavorGraph's 300-D embeddings revealed fifteen interpretable culinary dimensions
  • Epicure project introduces three sibling skip-gram ingredient embeddings retrained from scratch
  • Training data includes 4.14 million recipes from 11 platforms in multiple languages
  • Raw ingredients are normalized to a curated vocabulary of 1,790 canonical entries
  • Epicure controls balance between chemistry-related information and recipe-context signals during training
  • Three sibling models (Cooc, Chem, Core) share architecture but differ in input data sources
  • Embeddings reveal recoverable supervised directions representing cuisine styles, food groups, processing classes, macronutrient categories, and sensory attributes
  • FastICA decomposition uncovers 20 interpretable axes per model; Gaussian-mixture-model partitioning identifies 150-200 culinary modes per model
  • Pipeline involves aggregating multilingual recipe data, normalizing NER terms, constructing graphs, training Metapath2Vec variants, and analyzing resulting embeddings
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jakub Radzikowski, Josef Chen

License: CC BY 4.0

Abstract: We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.

Submitted to arXiv on 21 May. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2605.22391v1

In previous research by Radzikowski and Chen in 2026, FlavorGraph's 300-D embeddings were analyzed. This revealed fifteen interpretable culinary dimensions encompassing taste, texture, nutrition, geography, culture, and processing. However, this analysis was limited due to FlavorGraph's fixed pretraining on an English-centric corpus and a varied ingredient vocabulary that included non-food items and preparation details. To address these limitations, the Epicure project introduces a family of three sibling skip-gram ingredient embeddings that have been retrained from scratch. The training data consists of a vast multi-language corpus comprising 4.14 million recipes sourced from 11 different platforms in languages such as English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German and Indian-English. These raw ingredients are normalized to a curated vocabulary of 1.790 canonical entries through an LLM-curated pipeline. The key innovation of Epicure lies in its ability to control the balance between chemistry-related information and recipe-context signals during training. The three sibling models - Cooc focusing on recipe co-occurrence walks, Chem emphasizing typed FlavorDB compound-ingredient metapaths walks, and Core blending both types of walks - share the same architecture and hyperparameters but differ in their input data sources. Upon training completion, the embeddings reveal linearly recoverable supervised directions representing cuisine styles, food groups, processing classes according to NOVA standards, USDA macronutrients categories as well as sensory attributes. An unsupervised FastICA decomposition uncovers 20 interpretable axes per model while Gaussian-mixture-model partitioning identifies 150-200 named culinary modes per model. The pipeline for this project involves five stages: aggregating multilingual recipe data; normalizing raw NER terms into a standardized ingredient vocabulary; constructing co-occurrence and typed-compound graphs; training Metapath2Vec variants (Cooc, Core, Chem) on these graphs; analyzing resulting embeddings with supervised direction probes and unsupervised factor/mode discovery. Overall, Epicure offers a comprehensive exploration of the emergent geometry within food ingredient embeddings by leveraging diverse multilingual recipe data sources and advanced embedding techniques to uncover rich culinary insights across various dimensions.
Created on 02 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.