Comprehensive Guide on Feature Engineering and Transformation
feature is a numeric representation of an aspect of raw data. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model. Features sit between data and models in the machine learning pipeline. Right features can only be defined in the context of both the model and the data; since data and models are so diverse, it’s difficult to generalize the practice of feature engineering across projects.
One of the most important feature engineering is Scaling and Transform
Why do we need scaling?
The machine learning algorithm works on numbers and does not know what that number represents.
A weight of 10 grams and a price of 10 dollars represents completely two different things — which is a no brained for humans, but for a model as a feature, it treats both as same. Suppose we have two features of weight and price, as in the below table.
The “Weight” cannot have a meaningful comparison with the “Price.” So the assumption algorithm makes that since “Weight” > “Price,” thus “Weight,” is more important than “Price.”
When we do Scaling?
Feature scaling is essential
- For machine learning algorithms that calculate distances between data
- When we desire faster convergence, scaling is a MUST like in Neural Network.
Converging means : An iterative algorithm is said to converge when as the iterations proceed the output gets closer and closer to a minimum error . Or A machine learning model reaches convergence when it achieves a state during training in which loss settles to within an error range around the final value.
In other words, a model converges when additional training will not improve the model.
- Rule of thumb we may follow here is we perform scaling for any algorithm that computes distance or assumes normality, scales your features. Ex : KNN , K Means , PCA , Gradient descent algo
Algorithms Requires Scaling :
1- Gradient Descent Based Algorithms Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled.
2-Distance-Based Algorithms Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity.
Algorithms does not require scaling :
Algorithms that do not require normalization/scaling are the ones that rely on rules.
They would not be affected by any monotonic transformations of the variables. Scaling is a monotonic transformation.
Examples of algorithms in this category are all the tree-based algorithms — CART, Random Forests, Gradient Boosted Decision Trees.
These algorithms utilize rules (series of inequalities) and do not require normalization.
Algorithms like Linear Discriminant Analysis(LDA), Naive Bayes is by design equipped to handle this and give weights to the features accordingly. Performing features scaling in these algorithms may not have much effect.
Major Methods used for transforming & scaling
1) Min Max Scaler
2) Standard Scaler
Other Scalers :
3) Max Abs Scaler
4) Robust Scaler
5) Quantile Transformer Scaler
6) Power Transformer Scaler
7) Unit Vector Scaler
The most common techniques of feature scaling are Normalization and Standardization.
1)Min-Max scaler Normalization
Transform the data such that the features are within a specific range e.g. [0, 1]
This Scaler shrinks the data within the range of -1 to 1 if there are negative values. We can set the range like [0,1] or [0,5] or [-1,1].
Scaling is important in the algorithms such as support vector machines (SVM) and k-nearest neighbors (KNN) where distances between the data points are important
This Scaler responds well if the standard deviation is small and when a distribution is not Gaussian. This Scaler is sensitive to outliers.
2) Standard Scaler ( Zscore Normalization | Standardization)
The point of normalization is to change your observations so that they can be described as a normal distribution.
Normal distribution (Gaussian distribution), also known as the bell curve, is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean.
comes from statistics, defined as The z-score where μ is the mean. By subtracting the mean from the distribution, we’re essentially shifting it towards left or right by amount equal to mean i.e. if we have a distribution of mean 100, and we subtract mean 100 from every value, then we shift the distribution left by 100 without changing its shape. Thus, the new mean will be 0. When we divide by standard deviation σ , we’re changing the shape of distribution. The new standard deviation of this standardized distribution is 1 which you can get putting the new mean,
The Big Question – Normalize or Standardize?
Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution.
This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.
- Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization. However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.
Max Abs Scaler
Scale each feature by its maximum absolute value. This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set is 1.0. It does not shift/center the data and thus does not destroy any sparsity.
On positive-only data, this Scaler behaves similarly to Min Max Scaler and, therefore, also suffers from the presence of significant outliers.
Robust Scaler are robust to outliers.It is used to scale the feature to median and quantiles Scaling using median and quantiles consists of subtracting the median to all the observations, and then dividing by the interquantile difference. The interquantile difference is the difference between the 75th and 25th quantile:
IQR = 75th quantile - 25th quantile
X_scaled = (X - X.median) / IQR
quantile transform scaler
The power transformer is a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to the variability of a variable that is unequal across the range (heteroscedasticity) or situations where normality is desired. The power transform finds the optimal scaling factor in stabilizing variance and minimizing skewness through maximum likelihood estimation. Currently, Sklearn implementation of Power Transformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood. Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.
GitHub link :
- Hamza Metawea
- Mar, 31 2022