Attacking any ML problem

Rajat Jain
2 min readNov 28, 2016

ML problem, supervised, involves learning the classification of the data-points through well-defined classification boundaries.

First step, in any general problem (whether on Kaggle or AnalyticsVidya) is to prepare the Data.

The data that we get might not be clean, or organized (like in a csv, tsv format), so first we need to import it properly.

Next, it might contain columns containing text like (“8'o clock”, or dates like “28th Nov, 2016” or id column: “ID 20160712 O”) which might be difficult for ML algorithms to comprehend. So, we need to do cleaning & formatting. (Python regex & other tools come in handy.)

Next, we might see some missing values, so imputing them is our concern. For imputation, we can analyse the data or use our domain knowledge to infer the variables on which the missing value might depend upon.

After imputing missing values, its time to search for outliers. Outliers come in various forms, most common form is extreme values. (python: Boxplots & scatter plots using seaborn might be useful to observe them. Most likely treatment is to use Log of the column or just remove them if the data left is sufficient.

Now, Feature Engineering, is the most important part of any Data Science problem that you might encounter. It involves creating new features from existing features and extracting new information from them. A well performing model almost always extracts new features from existing ones. For eg: a column having dates, might not be very important to the model, but if we extract the weekday or month, or weekno. from it (depending on our problem) then our data becomes more robust and will help provide a better classifier. Feature Engineering also sometimes requires domain knowledge to create new features. Many times people use simple Multiplication or division of columns as a feature.

Now comes the main step, training a Machine Learning classifier. The easiest way is to train atleast 10 classifiers in one go. This way we can get to know that which class of classifiers are performing by how much in comparison to others. Also, check training accuracy for each one, to be sure that we are not overfitting our training data. Mostly, we should supply a limited number of predictor variables to our model, then keep on increasing and stopping before when we know we overfit. We should cross-validate also, to know actual accuracies & error function performances.

The best performing models are chosen and ensembling is done. Usually, Voting classifiers, etc are used. GridSearchCV is also an important tool, to callibrate our Model, to find the best performance hyperparameters.

Finally, submit!

--

--

Rajat Jain

Tech Blogger. Addicted to OSS, PHP & Performance. Born & brought up in India. Travelled 5 countries. A Table-Tennis player.