Finding The Correlation Between Simplicity And Rating Of Recipes

Author: Minh Quang Pham

This data science project, conducted at UCSD, aims to discover the relationship between the amount of ingredients in a recipe on food.com and its average rating.

Introduction

First launched in 1999, food.com has become one of the biggest website to share recipes for any kind of dish you would think of, featuring over 500,000 user-generated recipes and millions of review. However, people find it harder than ever to prepare food due to many reasons, such as time, thus possibly making simpler recipes to be more popular. This raises a question: Do simple recipes receive higher rating on average compared to other recipes? To find this, I choose to analyze a subset of data in the report, containing recipes and reviews posted since 2008.

The first dataset, recipes, contains 83782 rows and 10 columns:

Column	Description
`'name'`	Recipe name
`'id'`	Recipe ID
`'minutes'`	Minutes to prepare recipe
`'contributor_id'`	User ID who submitted this recipe
`'submitted'`	Date recipe was submitted
`'tags'`	Food.com tags for recipe
`'nutrition'`	Nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value”
`'n_steps'`	Number of steps in recipe
`'steps'`	Text for recipe steps, in order
`'description'`	User-provided description

The second dataset, interactions, contains 731927 rows and 5 columns:

Column	Description
`'user_id'`	User ID
`'recipe_id'`	Recipe ID
`'date'`	Date of interaction
`'rating'`	Rating given
`'review'`	Review text

From these datasets, I will analyze whether people rate recipes with less ingredients higher than other recipes or not. To answer that, the most relevant columns in my research will be 'n_ingredients', 'rating', which will be the rating an user gave on their review, and 'average_rating', which is the average rating a recipe has.

From this research, I aim to inform the readers about a potential trend in people’s preference to simple recipes, where they do not have to prepare many ingredients and save time and money. I hope the research can help food.com improve their report on recipes and see how people react to different recipes based on their complexity.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

In order to have the most effective analysis on this topic, I will clean the data in the following steps:

Left merge the recipes dataset with the ratings dataset on 'id' and 'recipe_id', respectively.
In the merged dataframe, fill all ratings of 0 with np.nan. Because rating is on a scale of 1 to 5, which mean that 1 will be the lowest rating while 5 means the highest rating, I think that 0 indicates that the review did not include a rating or the rating was missing. Therefore, replacing 0 with np.nan can help avoid bias.
Use groupby method to find the average rating of each recipe as a Series.
Assign a new column 'average_rating' to the dataframe.
Add column 'simple' to the dataframe. From my data cleaning process, I discover that the average ingredients per recipe of the recipes dataset is 9. As a result, I assume that a simple ingredients would have less ingredients than this amount. 'simple' is a boolean column that has True values for recipes with strictly less than 9 ingredients, and False values for other recipes.
Add column 'Year' to the dataframe. I do this by convert the 'submitted' column, which is in string, to pd.datetime, and then extract the year from that column.

The resulting dataframe has 234429 rows and 20 columns. Because of the large number of columns, I decided to only show the most relevant columns to my research. Below is the first 5 unique recipes of the cleaned dataframe:

name	id	minutes	n_steps	n_ingredients	rating	average_rating	simple	Year
1 brownies in the world best ever	333281	40	10	9	4	4	False	2008
1 in canada chocolate chip cookies	453467	45	12	11	5	5	False	2011
412 broccoli casserole	306168	40	6	9	5	5	False	2008
millionaire pound cake	286009	120	7	7	5	5	True	2008
2000 meatloaf	475785	90	17	13	5	5	False	2012

Univarate Analysis

I analyzed the n_ingredients column to see the distribution of the amount of ingredients for each recipe.

From this histogram, we can observe that the distribution is right-skewed, showing that recipes on food.com tend to have less ingredients with 8 being the most common number of ingredients. There is also a decreasing trend that less recipes have many ingredients.

Bivariate Analysis

I used the 'Year' column in the dataset and the number of entries for simple and not simple recipes to create a bivariate analysis. In the graph above, there is a gradual decrease in both simple recipes and ‘complex’ recipes over the years. We can also see that as time goes on, the number of simple and complex recipes submitted each year became increasingly similar.

Interesting Aggregate

I aggregate the average rating of each type of recipe in each year from 2008 to 2018. The pivot table showcases an unexpected result, as the average ratings of both recipe types seem to be not so different and fairly high over the years.

Year	False	True
2008	4.6524	4.67036
2009	4.67023	4.69261
2010	4.69371	4.70456
2011	4.68996	4.72652
2012	4.74331	4.69166
2013	4.73265	4.67093
2014	4.71671	4.71799
2015	4.80851	4.41176
2016	4.57778	4.43307
2017	4.44889	4.49153
2018	4.55422	4.3

Assessment of Missingness

MNAR Analysis

From my research, I found several columns in the dataset with missing values, but the column that is likely MNAR is 'review'. The missing can come from the fact that many people often just do not have anything to say about the recipe, and they would just submit the rating or not interact with the recipe at all. Text review is usually optional, so users can leave it out when making a rating.

Missingness Dependency

For this section, I am going to test the dependency of the missingness at column 'description'. The other two columns to be tested are 'minutes' and 'n_ingredients'. My test statistic will be the absolute difference of means between distribution of each column when description is missing and not missing. The significance level for both tests will be 0.05.

Minutes and Description

Null Hypothesis: Distribution of 'minutes' when 'description' is missing is similar to distribution of 'minutes' when 'description' is not missing.

Alternative Hypothesis: Distribution of 'minutes' when 'description' is missing is not similar to distribution of 'minutes' when 'description' is not missing.

After performing permutation test for 1000 times to collect 1000 samples, my p-value (0.23) > 0.05. We fail to reject the null hypothesis, so the missingness of 'description' does not depend on 'minutes'.

Ingredients and Description

The other column to be tested is 'n_ingredients'.

Null Hypothesis: Distribution of 'n_ingredients' when 'description' is missing is similar to distribution of 'minutes' when 'description' is not missing.

Alternative Hypothesis: Distribution of 'n_ingredients' when 'description' is missing is not similar to distribution of 'minutes' when 'description' is not missing.

I performed a permutation test for 100 times to collect 1000 samples.

The p-value I found is 0.02. Since my p-value is smaller than 0.05, we reject the null hypothesis, therefore the missingness of 'description' does depend on 'n_ingredients'.

Hypothesis Testing

I am going to examine whether simple recipes have higher average rating than other recipes or not by permutation test. The relevant columns for this test are 'simple' and 'average_rating'.

Null Hypothesis: Simple recipes have the same average ratings as more complex recipes.

Alternative Hypothesis: Simple recipes have higher average ratings than more complex recipes.

Test Statistic: Difference in means between average rating of simple recipes and complex recipes.

Significance Level: 0.05

Below is the empirical distribution of difference of means between simple recipes and complex recipes, with the observed difference being indicated by a red line.

I performed a permutation test by shuffling the 'simple' column with 1000 simulations and the resulting p-value was 0.00. Because my p-value is smaller than 0.05, the result can be considered statistically significant and we reject the null hypothesis. Therefore, recipes with less ingredients required usually receive higher ratings than other recipes on average.

Framing a Prediction Problem

From the hypothesis testing section, I decided to frame a prediction model for average rating. This will be a regression problem since average ratings can be decimals (continuous quantitative type) and have values from 1 to 5.

The metric to be used for evaluation will be RMSE. I chose this because it can be more sensitive to outliers, such as the 1-or-2-star reviews for a recipe with 4.5 average rating. RMSE can also penalize large mistakes more than other metrics. For instance, if my model predicts a 4 while the real rating for the recipe is 5, that would be a big failure.

Baseline Model

My baseline model used a Linear Regressor with two features: 'n_steps' and 'n_ingredients'. As both of these features are quantitative, I standardized them with StandardScaler to avoid dealing with large numbers of steps or ingredients, which can be considered outliers to my data. This model is expected to help users balance out their recipes so that the average rating for their recipes would be as good as possible.

After testing out the model, my resulting RMSE of this baseline model is 0.4899. It means the difference my predicted average rating and the actual average rating of a recipe is typically 0.4899. It seems that my model can be fairly accurate, but this can also face problem of overfitting, and 0.4899 can be considered a quite large distance. Therefore, I believe that there is room for improvement, with more data points and features.

Final Model

For my final model, I decided to add two more features: 'steps_per_ingredient' and 'time_per_step' using FunctionTransformer. 'steps_per_ingredient' can be calculated by dividing 'steps' by 'n_ingredients'. I chose this feature as it helps capture the complexity of the process related to a recipe, therefore reflecting general trends of the recipes better. For instance, a recipe with 20 steps for 2 ingredients can be considered as more complex than a recipe with 20 steps for 20 ingredients. We can get 'time_per_step' by dividing 'minutes' by 'steps' to get an approximation of average minutes spent on a step. Short time may suggest a quick and simple process, while longer time could indicate a more intensive prep.

I also changed my modelling algorithm to RandomForestRegressor, because I find it better at capturing non-linear relationship between my features, and it is less likely to be affected by outliers in the dataset. The hyperparameters that performed that best were max_depth and n_estimators. The best resulting combination of the two hyperparameters are None for max_depth and 100 for n_estimators.

My metric RMSE is now 0.465, which is a 0.0249 decrease in my metric. I can say that the model was improved based on the decrease of RMSE in the performance of the final model.

Fairness Analysis

I decided to used simple and complex recipes, with the same condition as mentioned in data cleaning section, for my fairness analysis. I will perform a permutation test by shuffling the simple labels for 1000 times to collect 1000 samples.

Null Hypothesis: The model is fair. Its RMSE for simple recipes and complex recipes is roughly the same.

Alternative Hypothesis: The model is unfair. Its RMSE for complex recipes is higher than its RMSE for simple recipes.

Test Statistic: Difference in RMSE between simple recipes and complex recipes.

Significance Level: 0.05

Below is the empirical distribution after performing the permutation test:

After performing the permutation test, the p-value I received was 0.02. Since it was smaller than 0.05, we reject the null hypothesis, meaning that our model was not fair enough, and RMSE for complex recipes tends to be higher than RMSE for simple recipes.