EN

amp-mathml

One challenging aspect of the recipe parsing problem is the task of predicting ingredient components from the ingredient phrases. Recipes display ingredients like “1 tablespoon fresh lemon juice,” but the database stores ingredients broken down by name (“lemon juice”), quantity (“1”), unit (“tablespoon”) and comment (“fresh”). There is no regular expression clever enough to identify these labels from the ingredient phrases.

Example, Ingredient Labels

QUANTITY UNIT COMMENT NAME NAME
Ingredient Phrase 1 tablespoon fresh lemon juice

This type of problem is referred to as a structured prediction problem because we are trying to predict a structure — in this case a sequence of labels — rather than a single label. Structured prediction tasks are difficult because the choice of a particular label for any one word in the phrase can change the model’s prediction of labels for the other words. The added model complexity allows us to learn rich patterns about how words in a sentence interact with the words and labels around them.

We chose to use a discriminative structured prediction model called a linear-chain conditional random field (CRF), which has been successful on similar tasks such as part‐of‐speech tagging and named entity recognition.

The basic set up of the problem is as follows:

Let be the set of ingredient phrases, e.g. {“½ cups whole wheat flour”, “pinch of salt”, …} where each is an ordered list of words. Associated with each is a list of tags, .

For example, if then . A tag is either a NAME, UNIT, QUANTITY, COMMENT or OTHER (i.e., none of the above).

The goal is to use data to learn a model that can predict the tag sequence for any ingredient phrase we throw at it, even if the model has never seen that ingredient phrase before. We approach this task by modeling the conditional probability of a sequence of tags given the input, denoted or using the above notation, .

The process of learning that probability model is described in detail below, but first imagine that someone handed us the perfect probability model that returns the “true” probability of a sequence of labels given an ingredient phrase. We want to use to discover (or infer) the most probable label sequence.

Armed with this model, we could predict the best sequence of labels for an ingredient phrase by simply searching over all tag sequences and returning the one that has the highest probability.

For example, suppose our ingredient phrase is “pinch of salt.” Then we need to score all the possible sequences of 3 tags.

While this seems like a simple problem, it can quickly become computationally unpleasant to score all of the sequences. The beauty of the linear-chain CRF model is that it makes some conditional independence assumptions that allow us to use dynamic programming to efficiently search the space of all possible label sequences. In the end, we are able to find the best tag sequence in a time that is quadratic in the number of tags and linear in the number of words .

So given a model that encodes whether a particular tag sequence is a good fit for a ingredient phrase, we can return the best tag sequence. But how do we learn that model?

A linear-chain CRF models this probability in the following way:

where is the number of words in the ingredient phrase

Let’s break this equation down in English.

The above equation introduces a “potential” function that takes two consecutive labels, and , and the ingredient phrase, . We construct so that it returns a large, non-negative number if the labels and are a good match for the and words in the sentence respectively, and a small, non-negative number if not. (Remember that probabilities must be non-negative.)

The potential function is a weighted average of simple feature functions, each of which captures a single attribute of the labels and words.

We often define feature functions to return either 0 or 1. Each feature function, , is chosen by the person who creates the model, based on what information might be useful to determine the relationship between words and labels. Some feature functions we used for this problem were:

There is a feature function for every word/label pair and for every consecutive label pair, plus some hand-crafted functions. By modeling the conditional probability of labels given words the following way, we have reduced our task of learning to the problem of learning “good” weights on each of the feature functions. By good, we mean that we want to learn large positive weights on features that capture highly likely patterns in the data, large negative weights on features that capture highly unlikely patterns in the data and small weights on features that don’t capture any patterns in the data.

For example, describes a likely pattern in the data good (“½” is likely a quantity), describes an unlikely pattern in the data (the word “flour” is almost never a quantity) and doesn’t capture a common pattern (the ingredient phrases are almost always lowercased). In this case, we want to be a large positive number, to be a large negative number and to be close to 0.

Due to properties of the model — chiefly, that the function is convex with respect to the weights — there is one best set of weights and we can find it using an iterative optimization algorithm . We used the CRF++ implementation to do the optimization and inference.