# Cleaning the Crime Dataset👮🏼

Before analyzing or building models from the crime dataset, it's essential to clean the data to ensure accuracy and consistency. Data cleaning helps handle missing values, standardizes formats, and prepares the data for efficient analysis. By doing this, we minimize errors, improve data quality, and make the dataset easier to work with for further steps like visualization or modeling.

Here are the tasks for cleaning the crime dataset:&#x20;

* [ ] Review the dataset and drop any columns that are mostly empty, irrelevant or unlikely to be used in the analysis.
* [ ] Make the column names lowercase and remove any spaces and replace them with underscores (`_`) to follow naming conventions. You can also choose entirely new names that are more meaningful.
* [ ] Replace any missing values in the dataset with `NA` (if not already done during data upload).
* [ ] Convert the date column into a proper date format.
* [ ] Filter out any entries with invalid geocoordinates (`LAT` and `LON`).
* [ ] Filter out entries with invalid age (age = 0).

{% hint style="info" %}
🏴‍☠️: If you’re unsure how to complete these tasks in R, you can use the `as.Date()` function to convert the date column into the correct format. To remove spaces from variable names, you might try using `gsub(" ", "_", names(df))` or take advantage of the `janitor::clean_names()` function for automatic renaming. To handle missing values, you can replace them with `NA` using `is.na()` or by using `dplyr::mutate()` combined with `ifelse()`.

The **dplyr** and **janitor** packages will be helpful in simplifying many of these steps.
{% endhint %}

{% hint style="info" %}
:snake::&#x20;

* Start by dropping irrelevant or sparse columns using `drop()`, focusing on those with little data or no analytical value.&#x20;
* Rename columns with `rename()` and a dictionary to make names consistent, meaningful, and easier to work with—replacing spaces with underscores is a common convention. Or make use of `str.replace()`and `str.lower()`
* For string to date conversions, try out  `pd.to_datetime`, here you *must* specify what standard format the date comes in: For example 3-20-2024 has a `%m-%d-%Y`, more on this [here](https://www.datacamp.com/tutorial/converting-strings-datetime-objects).
* Also, to filter out certain entries, you can try using `pd.drop()`with condition. Good luck!

If you need help, make use of the [Panda cheat sheets](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)!
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://techacademy.gitbook.io/data-science-wintersemester-24-25/exploratory-data-analysis-eda/data-cleaning-and-transformation/cleaning-the-crime-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
