Data Analytics

Top Data Cleaning Methods You Should Know!!

Variable Types problems:

Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.
Before preparing to analyze and extract insights from our data, we need to make sure our variables have the correct data types, otherwise we risk compromising our analysis.
You should check the data types of your columns using the .dtypes attribute or the .info() method.

Example one: String to integer

Hint : The assert statement, which takes in a condition as input, as returns nothing if that condition is met, and an error if it is not.

Example two: Numeric to categorical

Marital status column (0 for never married, 1 for married, 2 for separated, and 3 for divorced) However it will be imported of type integer, which could lead to misleading results when trying to extract some statistical summaries.
Solution:

Data Range Constraints Problem:

We need to pay attention to the range of our data.

Problems:

Movie rate 6 problem.
Subscription service in the future.

reasons for this problem may be in data collection or parsing

Solutions:

Dropping incorrect values: only drop data when a small proportion of your dataset is affected by out-of-range values, you really need to understand your dataset before deciding to drop values.
Setting custom minimums or maximums to your columns:
Treat it as missing and impute it.
We could also, dependent on the business assumptions behind our data, assign a custom value for any values of our data that go beyond a certain range.

Solution For Problem one:

Solution For Problem two:

# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
# Save today's date
today = dt.date.today()
# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

Duplicate Values:

Why:

It is because data entry & human error
Duplicate data can also arise because of bugs and design errors whether in business processes or data pipelines.
They often most arise from the necessary act of joining and consolidating data from various resources, which could retain duplicate values.

There are two types of duplicate values:

Duplicate values for all columns.
Duplicate values for all columns except some columns

How to find duplicate values and treat the problem:

# Find duplicates
duplicates = ride_sharing.duplicated(subset='ride_id',keep=False)
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values(by='ride_id')
# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])

# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby(['ride_id']).agg(statistics).reset_index()
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

Membership Constraints Problem:

Why:

Data Entry Errors(Free text or Dropdowns)
Parsing Errors

Solutions:

we can drop the rows with incorrect categories.
We can attempt remapping incorrect categories to correct ones, and more.

Example one:

Example Two:

# Print categories DataFrame
print(categories)
# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique() , "\n")
print('Safety: ', airlines['safety'].unique() , "\n")
print('Satisfaction: ', airlines['satisfaction'].unique() , "\n")

# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

Kerolous Ashraf
Mar, 27 2022