Cleaning the Crime Dataset👮🏼

Before analyzing or building models from the crime dataset, it's essential to clean the data to ensure accuracy and consistency. Data cleaning helps handle missing values, standardizes formats, and prepares the data for efficient analysis. By doing this, we minimize errors, improve data quality, and make the dataset easier to work with for further steps like visualization or modeling.

Here are the tasks for cleaning the crime dataset:

🏴‍☠️: If you’re unsure how to complete these tasks in R, you can use the as.Date() function to convert the date column into the correct format. To remove spaces from variable names, you might try using gsub(" ", "_", names(df)) or take advantage of the janitor::clean_names() function for automatic renaming. To handle missing values, you can replace them with NA using is.na() or by using dplyr::mutate() combined with ifelse().

The dplyr and janitor packages will be helpful in simplifying many of these steps.

🐍:

  • Start by dropping irrelevant or sparse columns using drop(), focusing on those with little data or no analytical value.

  • Rename columns with rename() and a dictionary to make names consistent, meaningful, and easier to work with—replacing spaces with underscores is a common convention. Or make use of str.replace()and str.lower()

  • For string to date conversions, try out pd.to_datetime, here you must specify what standard format the date comes in: For example 3-20-2024 has a %m-%d-%Y, more on this here.

  • Also, to filter out certain entries, you can try using pd.drop()with condition. Good luck!

If you need help, make use of the Panda cheat sheets!

Last updated