🎬
Data Science - Wintersemester 24/25
  • Welcome
  • What’s Data Science and How Do I Do It?
    • 📆Timeline
    • 🏴‍☠️R Overview
      • 📩Installation
      • 🐈‍⬛GitHub Setup
      • 🥗DataCamp Courses
    • 🐍Python Overview
      • 📩Installation
      • 🐈‍⬛GitHub Setup
      • 📦Virtual Environment Setup
      • 🥗DataCamp Courses
  • Introduction to Your Project
    • About the Project Guide
    • What is this Project About?
  • Exploratory Data Analysis (EDA)
    • Getting started
    • Discovering the Data 🔎
      • Initial Exploration Tasks
      • Initial Data Visualization
    • Data Cleaning and Transformation
      • Cleaning the Crime Dataset👮🏼
      • Cleaning the Weather Dataset🌦️
    • Data Visualization
      • Crime Rate Over Time
      • Crime Types
    • Grouping and Merging Data
    • Linear Regression
    • Impress us!
    • Internship Complete!
  • Advanced
    • Introduction
    • K-Means Clustering
      • The Clustering Model
      • Visualize the clusters
    • Impress us!
  • ✅Exercise Checklist
  • Legal Disclaimer
Powered by GitBook
On this page
  1. Exploratory Data Analysis (EDA)
  2. Data Cleaning and Transformation

Cleaning the Crime Dataset👮🏼

PreviousData Cleaning and TransformationNextCleaning the Weather Dataset🌦️

Last updated 4 months ago

Before analyzing or building models from the crime dataset, it's essential to clean the data to ensure accuracy and consistency. Data cleaning helps handle missing values, standardizes formats, and prepares the data for efficient analysis. By doing this, we minimize errors, improve data quality, and make the dataset easier to work with for further steps like visualization or modeling.

Here are the tasks for cleaning the crime dataset:

🏴‍☠️: If you’re unsure how to complete these tasks in R, you can use the as.Date() function to convert the date column into the correct format. To remove spaces from variable names, you might try using gsub(" ", "_", names(df)) or take advantage of the janitor::clean_names() function for automatic renaming. To handle missing values, you can replace them with NA using is.na() or by using dplyr::mutate() combined with ifelse().

The dplyr and janitor packages will be helpful in simplifying many of these steps.

:

  • Start by dropping irrelevant or sparse columns using drop(), focusing on those with little data or no analytical value.

  • Rename columns with rename() and a dictionary to make names consistent, meaningful, and easier to work with—replacing spaces with underscores is a common convention. Or make use of str.replace()and str.lower()

  • For string to date conversions, try out pd.to_datetime, here you must specify what standard format the date comes in: For example 3-20-2024 has a %m-%d-%Y, more on this .

  • Also, to filter out certain entries, you can try using pd.drop()with condition. Good luck!

If you need help, make use of the !

🐍
here
Panda cheat sheets