Data Analytics
Top Data Cleaning Methods You Should Know!!
Variable Types problems:
- Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.
- Before preparing to analyze and extract insights from our data, we need to make sure our variables have the correct data types, otherwise we risk compromising our analysis.
- You should check the data types of your columns using the .dtypes attribute or the .info() method.
Example one: String to integer
Hint : The assert statement, which takes in a condition as input, as returns nothing if that condition is met, and an error if it is not.
Example two: Numeric to categorical
- Marital status column (0 for never married, 1 for married, 2 for separated, and 3 for divorced) However it will be imported of type integer, which could lead to misleading results when trying to extract some statistical summaries.
- Solution:
Data Range Constraints Problem:
- We need to pay attention to the range of our data.
Problems:
- Movie rate 6 problem.
- Subscription service in the future.
reasons for this problem may be in data collection or parsing
Solutions:
- Dropping incorrect values: only drop data when a small proportion of your dataset is affected by out-of-range values, you really need to understand your dataset before deciding to drop values.
- Setting custom minimums or maximums to your columns:
- Treat it as missing and impute it.
- We could also, dependent on the business assumptions behind our data, assign a custom value for any values of our data that go beyond a certain range.
Solution For Problem one:
Solution For Problem two:
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
# Save today's date
today = dt.date.today()
# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())
Duplicate Values:
Why:
- It is because data entry & human error
- Duplicate data can also arise because of bugs and design errors whether in business processes or data pipelines.
- They often most arise from the necessary act of joining and consolidating data from various resources, which could retain duplicate values.
There are two types of duplicate values:
- Duplicate values for all columns.
- Duplicate values for all columns except some columns
How to find duplicate values and treat the problem:
# Find duplicates
duplicates = ride_sharing.duplicated(subset='ride_id',keep=False)
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values(by='ride_id')
# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby(['ride_id']).agg(statistics).reset_index()
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0
Membership Constraints Problem:
Why:
- Data Entry Errors(Free text or Dropdowns)
- Parsing Errors
Solutions:
- we can drop the rows with incorrect categories.
- We can attempt remapping incorrect categories to correct ones, and more.
Example one:
Example Two:
# Print categories DataFrame
print(categories)
# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique() , "\n")
print('Safety: ', airlines['safety'].unique() , "\n")
print('Satisfaction: ', airlines['satisfaction'].unique() , "\n")# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])
- Kerolous Ashraf
- Mar, 27 2022
Add New Comments
Please login in order to make a comment.