Set Up

To build our models and evaluate whether the models that we have found have prediction power, we need to set up our data.

Typically, we randomly split the entire data set, meaning all observations, into two parts: parts of the data are used to train the models one, the other part of the data is used to test our model. While there is no set way on what share of data should be reserved for training and testing, we will reserve 2/3 for training and 1/3 for testing. Most importantly, we need to randomly split up our data. Thus, do not forget to set seed. If you do not know what a seed is: you can imagine a seed to be a starting value to generate a random sequence of numbers. However, when using the same seed you can always reproduce the same sequence of random numbers, which is useful in testing programs as we will do here, but also in simulations. Summing up what to do:

Remember that if you compute a new variable, such as a new dummy variable you would like to use for your regression, you need to do this for the training and the testing data frame!

Using the train_test_split() function from the sklearn.model_selection module allows you to split the dataset into two parts. A random seed can be set with the random_state parameter of the function.

Last updated