Data Science

Effective Machine Learning Through Data Preparation

Introduction

Data preparation involves the operation of cleaning and transforming raw data, unprocessed data, unstructured/structured data prior to analysis and building predictive machine learning models. It is an important step prior to processing and often involves transforming data and engineering variable to meet the requirements of the learning algorithm, this is due to the fact that most learning algorithm have expectation of their data(E.G, having a Gaussian distribution, being all numeric etc), and if this expectations aren’t meet. The algorithm performs below expectations. This and many more are reason why data preparation should be given attention separately when developing predictive model. Data preparation also involves taking due process in other to avoid leakage of data to the learning algorithm.

"...Practitioners agree that the vast majority of time in building a machine learning pipeline is spent on feature engineering and data cleaning. Yet, despite its importance, the topic is rarely discussed on its own..."

— Page vii, Feature Engineering for Machine Learning, 201

This article highlight the importance on data preparation in building machine learning models. It also elaborates on core steps involved in data preparation. Lastly, An hands-on putting our write up to implementation.

At the end of this article you should conversant with:

The necessity for data preparation.
Core steps Involved in data preparation.
Statistical method use for data preparation.
Effects of data preparation on model accuracy.

Data preparation appears to be one of the most laborious task in most data science and machine learning project own to the fact that it is more of an iterative task(you have to go over and over it again until the desired result is received) rather than a sequential process another reason is that each datasets is most likely to be inconsistent and highly distinct to the project. Nevertheless, there quite a number of similar task and steps during machine learning projects that we can define a wobbly sequence of steps and sub tasks that are likely to be perform repeatedly. This procedures provides a context where by we can consider the data preparation required for the project, informed both by the definition of the project performed before data preparation and the evaluation of machine learning algorithms performed after.

"...the right features can only be defined in the context of both the model and the data; since data and models are so diverse, it’s difficult to generalize the practice of feature engineering across projects"

— Page vii, Feature Engineering for Machine Learning, 2018.

Although your project might be unique, the steps on the path to a good outcome are generally the same across other projects. This is sometimes referred to as the applied machine learning process. The process of applied machine learning consists of a sequence of steps that a likely repeated sequentially across numerous projects.

Note: The steps are the same, but the names of the steps and tasks performed may differ based on based on their interpretation. Furthermore, the steps are written sequentially, but you’re likely to go forward and backward when working on a project.

I like to define the process using the four high-level steps:

Step 1: Define Problem.
Step 2: Data preparation.
Step 3: Evaluate Models.
Step 4: Finalize Model.

As interesting as all the steps may appear to be, I’d be focusing mainly on step two, which is "Data preparation”. Being that the sole aim of this article is to draw your attention to the essence of data preparation for effective machine learning.

DEFINING DATA PREPARATION IN THE CONTEXT OF MACHINE LEARNING

Raw data from source cannot be used directly when working on a machine learning project. For reasons such as:

Complex nonlinear relationships may be teased out of the data.
Some machine learning algorithms impose requirements on the data.
Statistical noise and errors in the data may need review.
Machine learning algorithms require data to be numbers.
Most data come in unstructured format.
Most input variable tends out to be of no use to the model.

As such, the raw data must be pre-processed prior to being used to fit and evaluate a learning model. This step in a predictive modeling project is referred to as data preparation, although it goes by many other names, but they all insinuate the same agenda.Some of these names may better fit as sub-tasks for the broader data preparation process. Hence, Data preparation is the cleaning, analyzing and most importantly, transforming of raw data into a form that is more suitable for modeling and predicting pattern occurrence in the data.

"...Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc., can be a painstakingly laborious process."

— Page v, Data Wrangling with R, 2016.

Despite the fact that your data preparation are highly specific to your data, the goals of your project, and the algorithms that will be used to model your data. Nevertheless, there are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.

These tasks include:

Data Cleaning
Feature Selection
Data Transforms
Feature Engineering
Dimensionality Reduction

DATA CLEANING

Data cleaning involves finding and fixing logical problems or errors in messy data. The most useful data cleaning involves deep domain expertise and could involve identifying and addressing specific observations that may be incorrect. There are many reasons data may have incorrect values, such as typos, corrupted, duplicated, and etc. Once messy, noisy, corrupt, and erroneous observations are identified, they can be addressed. This might involve removing a row or a column. Alternately, it might involve replacing observations with new values. As such, there are general data cleaning operations that can be performed, such as:

Using statistics to define normal data and identify outliers
Identifying columns that have the same value or no variance and removing them
Identifying duplicate rows of data and removing them.
Marking empty values as missing .
Imputing missing values using statistics or a learned model.

Overview of Dimensionality Data Cleaning

image gotten from "https://machinelearningmastery.com"

FEATURE SELECTION

Feature selection refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted. This is important as irrelevant and redundant input variables can distract or mislead learning algorithms possibly resulting in lower predictive performance. Statistical methods, such as correlation, are popular for scoring input features. The features can then be ranked by their scores and a subset with the largest scores used as input to model. Additionally, there are different common feature selection use cases we may encounter in a predictive modeling project, such as:

Categorical inputs for a classification target variable.
Numerical inputs for a classification target variable.
Numerical inputs for a regression target variable .

Overview of Dimensionality Feature Techniques.

image gotten from "https://machinelearningmastery.com"

DATA TRANSFORM

Data transforms are used to change the type or distribution of data variables. This is a large umbrella of different techniques and they may be just as easily applied to input and output variables. Recall that data may have one of a few types, such as numeric or categorical, with sub-types for each, such as integer and real-valued floating point values for numeric, and nominal, ordinal, and boolean for categorical.

Overview of Dimensionality Data Transform Techniques.

image gotten from "https://machinelearningmastery.com"

FEATURE ENGINEERING

Feature engineering refers to the process of creating new input variables from the available data. Engineering new features is highly specific to your data and data types. As such, it often requires the collaboration of a subject matter expert to help identify new features that could be constructed from the data. This specialization makes it a challenging topic to generalize to general methods. Nevertheless, there are some techniques that can be reused, such as:

Adding a boolean flag variable for some state.
Adding a group or global summary statistic, such as a mean.
Adding new variables for each component of a compound variable, such as a date-time.

DIMENSIONALITY REDUCTION

The number of input features for a datasets may be considered the dimensionality of the data. For example, two input variables together can define a two-dimensional area where each row of data defines a point in that space. This idea can then be scaled to any number of input variables to create large multi-dimensional hyper-volumes. The problem is, the more dimensions this space has (e.g. the more input variables), the more likely it is that the datasets represents a very sparse and likely unrepresentative sampling of that space. This is referred to as the curse of dimensionality.

This motivates feature selection, although an alternative to feature selection is to create a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.

Overview of Dimensionality Reduction Techniques.

image gotten from "https://machinelearningmastery.com"

Lastly, Hands-on implementation showing effectiveness of appropriate data preparation techniques.

Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
Data preparation must be prepared on the training set and used to transform the train and test set so as to avoid data leakage.
Lets see how to implement data preparation without data leakage for train-test splits in Python.

# naive approach to normalizing the data before splitting the data and evaluating the model

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,

random_state=7)

# standardize the dataset

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model

model = LogisticRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % (accuracy*100))

Output of above code snippet

# correct approach for normalizing the data after the data is split before the model is

evaluated

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,

random_state=7)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# define the scaler

scaler = MinMaxScaler()

# fit on the training dataset

scaler.fit(X_train)

# scale the training dataset

X_train = scaler.transform(X_train)

# scale the test dataset

X_test = scaler.transform(X_test)

# fit the model

model = LogisticRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % (accuracy*100))

Output of the above code snippet

From the accuracy result it clearly obvious.

References :

Feature Engineering and Selection, 2019. https://amzn.to/3aydNGf

Feature Engineering for Machine Learning, 2018. https://amzn.to/2XZJNR2

Single-precision floating-point format, Wikipedia. https://en.wikipedia.org/wiki/Single-precision_floating-point_format

Data preparation, Wikipedia. https://en.wikipedia.org/wiki/Data_preparation

Data cleansing, Wikipedia. https://en.wikipedia.org/wiki/Data_cleansing

Data pre-processing, Wikipedia. https://en.wikipedia.org/wiki/Data_pre-processing

Machine Learning mastery by Jason Brown lee

Princewill Inyang
Mar, 31 2022