An End to End Machine learning
The main steps one has to go through while making a machine learning project is :
- Look at the big picture
- Get the data
- Discover and visualize the data
- Prepare the data for the ML algorithm
- Select a model and train it
- Fine tune the model
- Present the solution
- Launch monitor maintain the system
Frame the problem
The end goal is probably not to build a model. You should be clear with what the project is supposed to achieve. Understand how the model is supposed to be used and make sure you know the benefits from the model. This is important as this will determine what algorithms that would be used in the project. Generally the output of the model made is fed into another ML system along with other signals.
Signals is defined as a piece of information fed into an ML system
Pipelines
A sequence of data processing components is called pipelines. They are very common in ML systems as large amounts of data need to be manipulated. These components typically run asynchronously each component pulls in large amounts of data, processes it and stores the result. Another component in the pipeline then uses this data. This division makes it possible to assign teams to specific components. More importantly, if a component breaks down the other components still work. However, this is also a drawback as without proper monitoring the broken system can go unnoticed.
Taking a look at the data
-> Pandas Hassan describes a function that helps us find the main standard deviation and percentile of the data. It is important to note that while using this function null values are ignored.
-> When you split the data into a training set and test set you should make sure you never see the test set until the whole process is over.
-> Stratified sampling is a function that basically shows that the classes in the training said the present and active population. For instance, if the data contains 50 percent boys then the stratified training set of thousand samples will have 500 boys in it.
Scikit-Learn Design
Estimators
Any object that can estimate parameters based on the data set is called estimators. The estimation itself is performed by a fit()
method. It uses only a dataset ( or two if supervised) as a parameter. Any other parameter needed to guide the estimation process is considered as a hyperparameter and it must be set as an instance variable.
Transformers
Some estimators can also transform a dataset; these are called transformers. The transformation is done using the transform
method using the dataset as a parameter. All transformers also have a convenience method called fit_transform
that is equivalent to fit
and then transform
Predictors
Some estimators are also capable of making predictions based on a dataset. These are called predictors. Predictors have a predict()
method that takes a data set of new instances and returns the corresponding predictions. It also has a score()
method that measures the quality of predictions of the given test set.
Inspection
It is also possible to have a deeper look into the estimate as the hyperparameters are directly accessible via public instance variables.
Non Proliferation of classes
All the data sets are represented using numpy arrays or Scipy matrices. Whereas hyperparameters are just python strings or numbers
Composition
Existing blocks are used as much as possible. For instance, pipeline estimators form a sequence of transformers followed by a final estimator.
Sensible defaults
The default values for most parameters are sensible values. This makes it easy to create a baseline working system.
Feature selection
There are two main types
- min-max used for normalisation
- standardisation
Keep in mind standardisation is less affected by outliers
standardizing using scikit-learn (function name: StandardScaler
)
Variables that are measured at different scales do not contribute equally to the model fitting & model learned function and might end up creating a bias. Thus, to deal with this potential problem, feature-wise standardized (μ=0
and σ=1
) is usually used prior to model fitting.
Min-Max Scaling using scikit-learn (function name: MinMaxScaler
)
Another way to normalize the input features/variables (apart from the standardization that scales the features so that they have μ=0
and σ=1
) is the Min-Max scaler. By doing so, all features will be transformed into the range [0,1] meaning that the minimum and maximum value of a feature/variable is going to be 0 and 1, respectively.
Fine tune the model
A. Grid search CV
Grid search is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values. Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV
to automate the tuning of hyperparameters.
B. Randomised Search
This is used when the hyperparameter search base is large. This, instead of trying out all combinations, it evaluates the number of random combinations by selecting a random value for each hyperparameter during each iteration. This approach has 2 benefits:
- If you run this for a thousand iterations it will explore a thousand different values.
- You can control the computing budget by simply setting the number of iterations.
C. Ensemble methods
Another way to fine tune is by combining models that work the best.