Top Data Cleaning Methods You Should Know!!
Data Analytics

Top Data Cleaning Methods You Should Know!!

Variable Types problems:

  • Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.
  • Before preparing to analyze and extract insights from our data, we need to make sure our variables have the correct data types, otherwise we risk compromising our analysis. 
  • You should check the data types of your columns using the .dtypes attribute or the .info() method.

Example one: String to integer 

Hint : The assert statement, which takes in a condition as input, as returns nothing if that condition is met, and an error if it is not.

Example two: Numeric to categorical

  • Marital status column (0 for never married, 1 for married, 2 for separated, and 3 for divorced) However it will be imported of type integer, which could lead to misleading results when trying to extract some statistical summaries.
  • Solution:

Data Range Constraints Problem:

  • We need to pay attention to the range of our data.

Problems:

  1. Movie rate 6 problem.
  2. Subscription service in the future.

reasons for this problem may be in data collection or parsing

Solutions:

  1. Dropping incorrect values: only drop data when a small proportion of your dataset is affected by out-of-range values, you really need to understand your dataset before deciding to drop values.
  2. Setting custom minimums or maximums to your columns:
  3. Treat it as missing and impute it.
  4. We could also, dependent on the business assumptions behind our data, assign a custom value for any values of our data that go beyond a certain range.

Solution For Problem one:

    Solution For Problem two:

    • # Convert ride_date to date
      ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
      # Save today's date
      today = dt.date.today()
      # Set all in the future to today's date
      ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
      # Print maximum of ride_dt column
      print(ride_sharing['ride_dt'].max())

    Duplicate Values:

    Why:

    • It is because data entry & human error
    • Duplicate data can also arise because of bugs and design errors whether in business processes or data pipelines.
    • They often most arise from the necessary act of joining and consolidating data from various resources, which could retain duplicate values.

    There are two types of duplicate values:

    • Duplicate values for all columns.
    • Duplicate values for all columns except some columns

    How to find duplicate values and treat the problem:

    • # Find duplicates
      duplicates = ride_sharing.duplicated(subset='ride_id',keep=False)
      # Sort your duplicated rides
      duplicated_rides = ride_sharing[duplicates].sort_values(by='ride_id')
      # Print relevant columns of duplicated_rides
      print(duplicated_rides[['ride_id','duration','user_birth_year']])

    • # Drop complete duplicates from ride_sharing
      ride_dup = ride_sharing.drop_duplicates()
      # Create statistics dictionary for aggregation function
      statistics = {'user_birth_year': 'min', 'duration': 'mean'}
      # Group by ride_id and compute new statistics
      ride_unique = ride_dup.groupby(['ride_id']).agg(statistics).reset_index()
      # Find duplicated values again
      duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
      duplicated_rides = ride_unique[duplicates == True]
      # Assert duplicates are processed
      assert duplicated_rides.shape[0] == 0

      Membership Constraints Problem:

      Why:

      • Data Entry Errors(Free text or Dropdowns)
      • Parsing Errors

      Solutions:

      • we can drop the rows with incorrect categories.
      • We can attempt remapping incorrect categories to correct ones, and more.

      Example one:


      Example Two:

      • # Print categories DataFrame
        print(categories)
        # Print unique values of survey columns in airlines
        print('Cleanliness: ', airlines['cleanliness'].unique() , "\n")
        print('Safety: ', airlines['safety'].unique() , "\n")
        print('Satisfaction: ', airlines['satisfaction'].unique() , "\n")

        # Find the cleanliness category in airlines not in categories
        cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
        # Find rows with that category
        cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
        # Print rows with inconsistent category
        print(airlines[cat_clean_rows])

        # Find the cleanliness category in airlines not in categories
        cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
        # Find rows with that category
        cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
        # Print rows with inconsistent category
        print(airlines[cat_clean_rows])
        # Print rows with consistent categories only
        print(airlines[~cat_clean_rows])
      • Kerolous Ashraf
      • Mar, 27 2022

      Add New Comments

      Please login in order to make a comment.

      Recent Comments

      Be the first to start engaging with the bis blog.