🌳Random Forest Models

In the following section we will learn a thing or two about random forest models, which are based on decision trees. While these still fall under the category of supervised models, they can one-up regression models in certain ways.

Have you noticed that we have asked you to pay attention about numbers of observations in the linear regression models? This is due to OLS models simply being a projection matrix - a mathematical concept that requires all elements of a matrix to have a values - which does not allow for missing values like we have in our data. Do not worry too much about what is going on mathematically, if you are curious this document has fantastic explanations, which can always benefit your data science knowledge.

Getting back to the point: missing values are a problem here. Random forest models are able to interpolate missing values in theory using different models, however, you need to spay special attention to this during implementation of your code. Furthermore, as you may also have noticed for some of your regression models, while predicting data from the training subsets seems to achieve good results, predictions from the testing data subset may be poor - a classic sign of overfitting. Our data may be especially vulnerable to this as we are dealing with macro-data and limited data entry points (there are only so many countries on this planet - they still need to get analyzed though!). Random forest models also allow for more variables to be included: while we only included whether one of the top 3 export or import partners of each country is an advanced economy or not in our regression model, you can now reformat this part of the data from long to wide and use these variables to build random forest models - or use the data in long form as is. Try it out and see what works best for you! You would never want to include too many variables in a regression to avoid overfitting or too many irrelevant variables that do not actually yield you any information. Random forest models makes sure to avoid overfitting. However, be warned that random forest models are, in a sense, like a black box. What variable is contributing how much in what way we cannot really tell, unlike in regressions. But this is a problem you will look at when evaluating your models. For now, follow the below steps:

Find suitable random forest packages for your language and check the documentation: how do the packages handle the missing values
Train a random forest model on your training data. Start by building a model with the variables you also used in your regression, trying to predict log GDP per capita.
Extend your inital models to include more variables. Keep in mind that modelling log GDP per capita on certain variables, such as GDP in USD, makes little sense
Adjust parameters sensibly, such as how many trees are in the forest

Do not forget to extensively comment your code, explain the parameters to your audience and the reasioning behind the choice of your variables. Briefly explain the output of your model, you will fully evaluate your model in the next step.

The sklearn package already offers Random Forest Model functionality. For our task, you want to look into the RandomForestRegressor class of the package.

PreviousEvaluating Your Models NextEvaluating Your Models

Last updated 11 months ago