In previous research by Radzikowski and Chen in 2026, FlavorGraph's 300-D embeddings were analyzed. This revealed fifteen interpretable culinary dimensions encompassing taste, texture, nutrition, geography, culture, and processing. However, this analysis was limited due to FlavorGraph's fixed pretraining on an English-centric corpus and a varied ingredient vocabulary that included non-food items and preparation details. To address these limitations, the Epicure project introduces a family of three sibling skip-gram ingredient embeddings that have been retrained from scratch. The training data consists of a vast multi-language corpus comprising 4.14 million recipes sourced from 11 different platforms in languages such as English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German and Indian-English. These raw ingredients are normalized to a curated vocabulary of 1.790 canonical entries through an LLM-curated pipeline. The key innovation of Epicure lies in its ability to control the balance between chemistry-related information and recipe-context signals during training. The three sibling models - Cooc focusing on recipe co-occurrence walks,
Chem emphasizing typed FlavorDB compound-ingredient metapaths walks,
and Core blending both types of walks - share the same architecture and hyperparameters but differ in their input data sources. Upon training completion,
the embeddings reveal linearly recoverable supervised directions representing cuisine styles,
food groups,
processing classes according to NOVA standards,
USDA macronutrients categories as well as sensory attributes. An unsupervised FastICA decomposition uncovers 20 interpretable axes per model while Gaussian-mixture-model partitioning identifies 150-200 named culinary modes per model. The pipeline for this project involves five stages: aggregating multilingual recipe data;
normalizing raw NER terms into a standardized ingredient vocabulary;
constructing co-occurrence and typed-compound graphs;
training Metapath2Vec variants (Cooc, Core, Chem) on these graphs;
analyzing resulting embeddings with supervised direction probes and unsupervised factor/mode discovery. Overall,
Epicure offers a comprehensive exploration of the emergent geometry within food ingredient embeddings by leveraging diverse multilingual recipe data sources and advanced embedding techniques to uncover rich culinary insights across various dimensions.
- - FlavorGraph's 300-D embeddings revealed fifteen interpretable culinary dimensions
- - Epicure project introduces three sibling skip-gram ingredient embeddings retrained from scratch
- - Training data includes 4.14 million recipes from 11 platforms in multiple languages
- - Raw ingredients are normalized to a curated vocabulary of 1,790 canonical entries
- - Epicure controls balance between chemistry-related information and recipe-context signals during training
- - Three sibling models (Cooc, Chem, Core) share architecture but differ in input data sources
- - Embeddings reveal recoverable supervised directions representing cuisine styles, food groups, processing classes, macronutrient categories, and sensory attributes
- - FastICA decomposition uncovers 20 interpretable axes per model; Gaussian-mixture-model partitioning identifies 150-200 culinary modes per model
- - Pipeline involves aggregating multilingual recipe data, normalizing NER terms, constructing graphs, training Metapath2Vec variants, and analyzing resulting embeddings
SummaryFlavorGraph's 300-D embeddings showed fifteen clear cooking categories. Epicure project made three new ingredient embeddings from scratch. They used 4.14 million recipes in various languages for training. Ingredients were changed to a list of 1,790 standard entries. Epicure balanced chemistry and recipe info during training.
Definitions- Embeddings: Representations of words or items in a mathematical space.
- Sibling: Brothers or sisters who share the same parents.
- Retrained: To train again from the beginning.
- Normalized: Made consistent or standardized.
- Vocabulary: A set of words known by a person or used in a particular field.
Introduction:
In 2026, Radzikowski and Chen conducted research on FlavorGraph's 300-D embeddings, which revealed fifteen interpretable culinary dimensions. However, this analysis was limited due to the fixed pretraining on an English-centric corpus and a varied ingredient vocabulary that included non-food items and preparation details. To address these limitations, the Epicure project introduces a family of three sibling skip-gram ingredient embeddings that have been retrained from scratch.
Overview of Epicure Project:
The Epicure project aims to explore the emergent geometry within food ingredient embeddings by leveraging diverse multilingual recipe data sources and advanced embedding techniques. This project involves five stages: aggregating multilingual recipe data; normalizing raw NER terms into a standardized ingredient vocabulary; constructing co-occurrence and typed-compound graphs; training Metapath2Vec variants (Cooc, Core, Chem) on these graphs; analyzing resulting embeddings with supervised direction probes and unsupervised factor/mode discovery.
Data Collection:
The training data for this project consists of a vast multi-language corpus comprising 4.14 million recipes sourced from 11 different platforms in languages such as English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian,
German and Indian-English. These raw ingredients are normalized to a curated vocabulary of 1.790 canonical entries through an LLM-curated pipeline.
Training Process:
The key innovation of Epicure lies in its ability to control the balance between chemistry-related information and recipe-context signals during training. The three sibling models - Cooc focusing on recipe co-occurrence walks,
Chem emphasizing typed FlavorDB compound-ingredient metapaths walks,
and Core blending both types of walks - share the same architecture and hyperparameters but differ in their input data sources.
Results:
Upon completion of training, the embeddings reveal linearly recoverable supervised directions representing cuisine styles,
food groups,
processing classes according to NOVA standards,
USDA macronutrients categories as well as sensory attributes. An unsupervised FastICA decomposition uncovers 20 interpretable axes per model while Gaussian-mixture-model partitioning identifies 150-200 named culinary modes per model.
Interpretation of Results:
The results from Epicure's embeddings offer a comprehensive exploration of the emergent geometry within food ingredient embeddings. The diverse multilingual recipe data sources and advanced embedding techniques used in this project have uncovered rich culinary insights across various dimensions such as taste, texture, nutrition, geography, culture, and processing.
Conclusion:
In conclusion, the Epicure project has successfully addressed the limitations of previous research on FlavorGraph's embeddings by introducing a family of three sibling skip-gram ingredient embeddings that have been retrained from scratch. This project has provided valuable insights into the complex relationships between food ingredients and their cultural and geographical contexts. Further research using these embeddings could potentially lead to advancements in fields such as food science, nutrition, and even cultural studies.