Cleaning the Crime Dataset👮🏼

Before analyzing or building models from the crime dataset, it's essential to clean the data to ensure accuracy and consistency. Data cleaning helps handle missing values, standardizes formats, and prepares the data for efficient analysis. By doing this, we minimize errors, improve data quality, and make the dataset easier to work with for further steps like visualization or modeling.

Here are the tasks for cleaning the crime dataset:

Review the dataset and drop any columns that are mostly empty, irrelevant or unlikely to be used in the analysis.
Make the column names lowercase and remove any spaces and replace them with underscores (_) to follow naming conventions. You can also choose entirely new names that are more meaningful.
Replace any missing values in the dataset with NA (if not already done during data upload).
Convert the date column into a proper date format.
Filter out any entries with invalid geocoordinates (LAT and LON).
Filter out entries with invalid age (age = 0).

🏴‍☠️: If you’re unsure how to complete these tasks in R, you can use the as.Date() function to convert the date column into the correct format. To remove spaces from variable names, you might try using gsub(" ", "_", names(df)) or take advantage of the janitor::clean_names() function for automatic renaming. To handle missing values, you can replace them with NA using is.na() or by using dplyr::mutate() combined with ifelse().

The dplyr and janitor packages will be helpful in simplifying many of these steps.

🐍:

Start by dropping irrelevant or sparse columns using drop(), focusing on those with little data or no analytical value.
Rename columns with rename() and a dictionary to make names consistent, meaningful, and easier to work with—replacing spaces with underscores is a common convention. Or make use of str.replace()and str.lower()
For string to date conversions, try out pd.to_datetime, here you must specify what standard format the date comes in: For example 3-20-2024 has a %m-%d-%Y, more on this here.
Also, to filter out certain entries, you can try using pd.drop()with condition. Good luck!

If you need help, make use of the Panda cheat sheets!

PreviousData Cleaning and Transformation NextCleaning the Weather Dataset🌦️

Last updated 5 months ago