The Clustering Model

In this exercise, we will use the KMeans clustering algorithm to identify patterns in crime locations based on their latitude and longitude. Clustering can help reveal spatial patterns that are not immediately obvious.

Extract and Sample

Start by extracting only the geographical coordinates of a sample of 5000 crime incidents.
Create a scatter plot of the geographical data so visualize the initial distribution of crime locations.

The result should look something like this:

Train the model

The KMeans algorithm requires us to specify the number of clusters (k). A good way to determine the optimal k is by using metrics like inertia and the silhouette score.

Iterate over a range of k values (e.g. from 3 to 15)
Plot the silhouette score against k to find the value of k that maximizes the score.

The result should look look something like this:

How does your plot look? What is the optimal k? The silhouette score helps identify the number of clusters that best separate the data points.

Cluster the Data

Once you've chosen the optimal number of clusters:

Apply the KMeans algorithm to the geographical data.
Add the cluster labels to the dataset.

🏴‍☠️: The sample() function will help you to extract a random sample. To generate clusters with the KMeans method use the function kmeans(). This function requires you to know how many clusters you want to generate. To determine the optimal number with the silhouette score through iteration use map_dbl() from the package purrr and pam() from the package cluster. To plot your results you can work with ggplot().

🐍: To generate a random sample from your dataset use the sample() method. For clustering, use KMeans from sklearn.cluster to group data into clusters. Start by calculating metrics like intertia and silhouette score to evaluate the clustering performance for a given k. This you can find in sklearn.metrics as silhouette_score. Use fit() to train the model, and predict() to assign cluster labels, you can also use fit_predict(). Check sklearn documentation for more.

PreviousK-Means Clustering NextVisualize the clusters

Last updated 5 months ago