All the code and data is here
Project brief
You are working for the company that buys coffee from the farmers and sells it to cafes. Every time they get beans they have to pay q-graders to grade the coffee so they can set better prices.
Your manager has asked to develop the model that would predict the grade points without the need to hire graders.Also any other insights from the data you can find.
Country of origin
The majority of coffee beans come from the Americas and are mostly processed using a washed method.
Natural (dry) method is often used in regions with limited access to water and Brazil is one of these regions. However this method has some advantages in terms of taste. Dried coffee tends to have a rich and heavy body, which many coffee drinkers prefer. Also it allows us to experiment with different fermentation techniques.
Processing methods
As I mentioned above the majority of bages processed using a washed method followed by a natural method. Other bags fall to the semi-washed category and its different variations (like pulped, honey etc).
Qality and Altitude
It’s a commonly-held belief that the higher the altitude, the better the quality. There is some truth to that and low altitude coffees tend to taste earthy and dull and are best avoided.
However we don’t see a strong correlation which means there are many other factors that affect the taste. Saying that, all the bags with 85 or higher grade points grow at 1000 metres and higher.
Grade categories
The highest grading point we have in this dataset is 90.58. Which means we don’t really have bags with ‘outstanding’ coffee. There are some “excellent” (85+) coffees but most bags fall into the ‘Very Good’ (80+) category. Which means only 13.5% of our coffee has a grade below 80 and can not be labelled as ‘Specialty Coffee’. So if you would randomly pick a bag from our dataset most likely you will enjoy your cup of coffee (assuming you know how to brew it).
Grade distribution per country
Modelling
set.seed(150)
rows <- sample(nrow(coffee))
# Randomly order data
shuffled_coffee <- coffee[rows,]
# Determine row to split on: split
split <- round(nrow(coffee) * .80)
# Create train
train <- shuffled_coffee[1:split,]
# Create test
test <- shuffled_coffee[(split +1):nrow(shuffled_coffee),]
Cross validation model
Let’s try lm model with cross validation. It’s a good start.
set.seed(66)
# Fit lm model using 10-fold CV: model
model_cv <- train(
total.cup.points ~.,
coffee,
method = "lm",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
p <- predict(model_cv, test)
To evaluate a regression model I like to use RMSE(root mean square error). The smaller here is better.
error <- p - test[,"total.cup.points"]
rmse_cv <- sqrt(mean(error$total.cup.points^2)) #SD = 1.722239
rmse_cv
## [1] 1.94814
Good start, but we want RMSE that less than standart deviation in dataset, so let’s try Random Forest
Random forest
set.seed(60)
model_rf <- train(
total.cup.points ~.,
tuneLength = 1,
data = coffee,
method = "ranger",
trControl = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE
)
)
p <- predict(model_rf, test)
And calculate RMSE
error <- p - test[,"total.cup.points"]
# Calculate RMSE
rmse_rf <- sqrt(mean(error$total.cup.points^2)) #
rmse_rf
## [1] 1.357035
Conclusion
We got RMSE 1.357035 which is a pretty good result given SD = 2.601557. And we don’t know what the human error is , I doubt that q-graders can be more consistent than a ML Model.
Since hiring q-graders can be expensive, this model can be used to save money and time for grading the beans.