# The Clustering Model

In this exercise, we will use the KMeans clustering algorithm to identify patterns in crime locations based on their latitude and longitude. Clustering can help reveal spatial patterns that are not immediately obvious.

### Extract and Sample

* [ ] Start by extracting only the geographical coordinates of a sample of 5000 crime incidents.
* [ ] Create a scatter plot of the geographical data so visualize the initial distribution of crime locations.

The result should look something like this:

<figure><img src="/files/CSgEkraGzpeA9GDZjAAI" alt="" width="375"><figcaption></figcaption></figure>

### Train the model

The KMeans algorithm requires us to specify the number of clusters (`k`). A good way to determine the optimal `k` is by using metrics like inertia and the silhouette score.

* [ ] Iterate over a range of `k` values (e.g. from 3 to 15)
* [ ] Plot the silhouette score against `k` to find the value of `k` that maximizes the score.

The result should look look something like this:

<figure><img src="/files/yFh2n2nDwGQC7XDNpDG5" alt="" width="375"><figcaption></figcaption></figure>

How does your plot look? What is the optimal `k`? The silhouette score helps identify the number of clusters that best separate the data points.

### Cluster the Data

Once you've chosen the optimal number of clusters:

* [ ] Apply the KMeans algorithm to the geographical data.
* [ ] Add the cluster labels to the dataset.

{% hint style="info" %}
🏴‍☠️: The `sample()` function will help you to extract a random sample. To generate clusters with the KMeans method use the function `kmeans()`. This function requires you to know how many clusters you want to generate. To determine the optimal number with the **silhouette score** through iteration use `map_dbl()` from the package `purrr` and `pam()` from the package `cluster`. To plot your results you can work with `ggplot()`.
{% endhint %}

{% hint style="info" %}
🐍: To generate a random sample from your dataset use the `sample()` method.  For clustering, use `KMeans` from `sklearn.cluster` to group data into clusters. Start by calculating metrics like **intertia** and **silhouette score** to evaluate the clustering performance for a given `k`. This you can find in `sklearn.metrics` as `silhouette_score`. Use `fit()` to train the model, and `predict()` to assign cluster labels, you can also use `fit_predict()`. Check `sklearn` documentation for more.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://techacademy.gitbook.io/data-science-wintersemester-24-25/advanced/k-means-clustering/the-clustering-model.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
