๐ŸŽฌ
Data Science - Wintersemester 24/25
  • Welcome
  • Whatโ€™s Data Science and How Do I Do It?
    • ๐Ÿ“†Timeline
    • ๐Ÿดโ€โ˜ ๏ธR Overview
      • ๐Ÿ“ฉInstallation
      • ๐Ÿˆโ€โฌ›GitHub Setup
      • ๐Ÿฅ—DataCamp Courses
    • ๐ŸPython Overview
      • ๐Ÿ“ฉInstallation
      • ๐Ÿˆโ€โฌ›GitHub Setup
      • ๐Ÿ“ฆVirtual Environment Setup
      • ๐Ÿฅ—DataCamp Courses
  • Introduction to Your Project
    • About the Project Guide
    • What is this Project About?
  • Exploratory Data Analysis (EDA)
    • Getting started
    • Discovering the Data ๐Ÿ”Ž
      • Initial Exploration Tasks
      • Initial Data Visualization
    • Data Cleaning and Transformation
      • Cleaning the Crime Dataset๐Ÿ‘ฎ๐Ÿผ
      • Cleaning the Weather Dataset๐ŸŒฆ๏ธ
    • Data Visualization
      • Crime Rate Over Time
      • Crime Types
    • Grouping and Merging Data
    • Linear Regression
    • Impress us!
    • Internship Complete!
  • Advanced
    • Introduction
    • K-Means Clustering
      • The Clustering Model
      • Visualize the clusters
    • Impress us!
  • โœ…Exercise Checklist
  • Legal Disclaimer
Powered by GitBook
On this page
  • Extract and Sample
  • Train the model
  • Cluster the Data
  1. Advanced
  2. K-Means Clustering

The Clustering Model

PreviousK-Means ClusteringNextVisualize the clusters

Last updated 3 months ago

In this exercise, we will use the KMeans clustering algorithm to identify patterns in crime locations based on their latitude and longitude. Clustering can help reveal spatial patterns that are not immediately obvious.

Extract and Sample

The result should look something like this:

Train the model

The KMeans algorithm requires us to specify the number of clusters (k). A good way to determine the optimal k is by using metrics like inertia and the silhouette score.

The result should look look something like this:

How does your plot look? What is the optimal k? The silhouette score helps identify the number of clusters that best separate the data points.

Cluster the Data

Once you've chosen the optimal number of clusters:

๐Ÿดโ€โ˜ ๏ธ: The sample() function will help you to extract a random sample. To generate clusters with the KMeans method use the function kmeans(). This function requires you to know how many clusters you want to generate. To determine the optimal number with the silhouette score through iteration use map_dbl() from the package purrr and pam() from the package cluster. To plot your results you can work with ggplot().

๐Ÿ: To generate a random sample from your dataset use the sample() method. For clustering, use KMeans from sklearn.cluster to group data into clusters. Start by calculating metrics like intertia and silhouette score to evaluate the clustering performance for a given k. This you can find in sklearn.metrics as silhouette_score. Use fit() to train the model, and predict() to assign cluster labels, you can also use fit_predict(). Check sklearn documentation for more.