Challenges You Face During Machine Learning
Machine leaning Main Challenges
Data
scientist wants to solve any Machine learning problem which has two main tasks
that is to select a learning algorithm and train the data.
Here
we can make mistake like choosing “bad algorithm” and “bad Data” before
choosing algorithm first we
deal with data so we take an example of bad data.
1.
Taking less
data for training:
Consider for child to learn the difference between the
cars and motorcycle what we need to do is just pointing to them and say this is
car and this is motorcycle (probably repeating this for few time). Now the
child is able to differentiate the car and motorcycle based on size, color and
shape.
But in Machine Learning it’s not so easy; it takes lots of data to train our model. Machine Learning algorithm needs thousands of example data to understand the data to work properly.
Therefore, we need sufficient data to train our model.
2. Non-representative Training Data:
For a test data we’ll take new data to test our model
if it gives bad result on test data that means the data is not that much
familiar with the train data.
If the training model has less data then we can say as sampling noise i.e., non-representative data if the sampling method is imperfect then it is called sampling bias.
To understand sampling bias take an example
A popular case of examining sampling bias occurred during the US Presidential election in 1936 (Landon against Roosevelt), a very large poll was conducted by the Literary Digest by sending mail to around ten million people out of which 2.4 million answered, and predicted that Landon is going to get 57% of votes with high confidence. Be that as it may, Roosevelt won with 62% of the votes.
The problem here is in the sampling method, to get the addresses for conducting the poll, Literary Digest used magazine subscribes, club membership lists, and the likes, which are utilized by wealthier individuals who are bound to cast a ballot Republican, (hence Landon). Also, non-response bias comes into the picture as only 25% of people answered the poll.
To make a good model without any drifts, the training datasets must be representative.
3. Poor-Quality Data:
I If the data has outliers and noise that is scaling measurement is not done properly, so we don’t directly fall into training a model because it will not perform well on result first we clean a data like missing values, outliers, dispersion of data etc. then we landed to train our model, it is most time consuming part for Data Scientist.
Therefore, the quality of data is very important to get an accurate result.
4.
Irrelevant
Features:
This we can say like garbage in garbage out. For our
training model if we have enough relevant features and too many irrelevant
features then the model will not perform well. A Data Scientist should select good
features based on his analysis of data and taking some domain expert knowledge
this process is called feature selection. Some of the points to be consider
· Feature selection: selecting the most useful features among the existing features.
· Feature extraction: combining the existing features for example dimensionality reduction algorithm.
· Creating a new feature by gathering new data for example we have the features like year of experience and age when person start job then we can write age of employee like
Age = Year of experience + age when person start job
Therefore, important feature are more helpful for our model
Now we come
across “bad data”, now it’s time to deal with “bad algorithm” taking an example
of bad algorithm.
5. Over-fitting training data:
Say you visited a hotel in a new city. You want to have some food you looked at the menu to order something and found that bill is too high. You might be saying that ‘all the hotel in the city are too costly and not affordable’. Over generalizing is something that we do very rarely that make fall in to the trap that in Machine Learning we call over-fitting
Over-fitting happens when the model is too complex as
I mention above that data has noise to overcome this problem some of the
solutions are:
· Reduce noise(remove outliers, make proper scaling etc.) from the training model
·
Take more training data
· Select less parameter in the model(avoid high degree polynomial in linear model) and take features which is more relevant or by regularizing the model
6. Under-fitting the Training Data:
It is the opposite of over-fitting i.e. the model is
too simple to learn the underlying structure of data. Consider we are building
a linear model but the data is not distributed linearly then we can say its
under-fitting.
To fix this problem we can come across:
· Taking best feature to learning algorithm
· Selecting good fit model, with more parameter
·
Remove noise or regularize the model
- sana khan
- Mar, 28 2022