amp-mathml
One challenging aspect of the recipe parsing problem is the task of predicting ingredient components from the ingredient phrases. Recipes display ingredients like “1 tablespoon fresh lemon juice,” but the database stores ingredients broken down by name (“lemon juice”), quantity (“1”), unit (“tablespoon”) and comment (“fresh”). There is no regular expression clever enough to identify these labels from the ingredient phrases.
Example, Ingredient Labels
QUANTITY | UNIT | COMMENT | NAME | NAME | |
---|---|---|---|---|---|
Ingredient Phrase | 1 | tablespoon | fresh | lemon | juice |
This type of problem is referred to as a structured prediction problem because we are trying to predict a structure — in this case a sequence of labels — rather than a single label. Structured prediction tasks are difficult because the choice of a particular label for any one word in the phrase can change the model’s prediction of labels for the other words. The added model complexity allows us to learn rich patterns about how words in a sentence interact with the words and labels around them.
We chose to use a discriminative structured prediction model called a linear-chain conditional random field (CRF), which has been successful on similar tasks such as part‐of‐speech tagging and named entity recognition.
The basic set up of the problem is as follows:
Let
For example, if
The goal is to use data to learn a model that can predict the tag sequence for any ingredient phrase we throw at it, even if the model has never seen that ingredient phrase before. We approach this task by modeling the conditional probability of a sequence of tags given the input, denoted
The process of learning that probability model is described in detail below, but first imagine that someone handed us the perfect probability model
Armed with this model, we could predict the best sequence of labels for an ingredient phrase by simply searching over all tag sequences and returning the one that has the highest probability.
For example, suppose our ingredient phrase is “pinch of salt.” Then we need to score all the possible sequences of 3 tags.
While this seems like a simple problem, it can quickly become computationally unpleasant to score all of the
So given a model
A linear-chain CRF models this probability in the following way:
where
Let’s break this equation down in English.
The above equation introduces a “potential” function
The potential function is a weighted average of simple feature functions, each of which captures a single attribute of the labels and words.
We often define feature functions to return either 0 or 1. Each feature function,
There is a feature function for every word/label pair and for every consecutive label pair, plus some hand-crafted functions. By modeling the conditional probability of labels given words the following way, we have reduced our task of learning
For example,
Due to properties of the model — chiefly, that the function is convex with respect to the weights — there is one best set of weights and we can find it using an iterative optimization algorithm . We used the CRF++ implementation to do the optimization and inference.
-
Written by @jaygray0919
with contributions from @gigster99