Introduction to the Machine Learning
Machine learning is the science of programming computers so that they can learn from data.
Why use Machine Learning
This is best shown with an example. Let us try building a spam filter. The ones that are used in almost all email services these days. The steps to do so involves :
- We look for words that are common among the spam mails, which helps us to gather a pattern
- We write a detection algorithm for each of the patterns noticed
- We then test the program to see if it is good enough and then repeat steps 1 and 2 till the program becomes good enough.
This is a very basic approach to build such a system. Since the program is not trivial the rules would be a long list.This in fact makes it hard to maintain. However, machine learning techniques automatically learn words and phrases that are good predictors of whether a mail is spam or not. They detect the unusual frequency of such words in the spam examples. This program is much shorter and easier to maintain. Hence better than our initial idea
However this approach also has a problem. That is, if spammers notice some words are spam, they might replace the words with other words of similar meaning. For instance, if they notice the word “four” is flagged they might use the number “4”. They can also change the format in order to bypass the spam filter. However, spam filters, based on machine learning techniques automatically notices such unusual occurrences and marks them as spam
Machine learning can be also used for problems that do not have an algorithm. For instance speech analysis. It can also help humans learn by inspecting what the model has learnt. Applying ML techniques to dig into large amounts of data can help discover patterns that were not immediately evident. This is called Data Mining
Types of Machine Learning Systems
There are many types of machine learning systems. This is classified as follows:
- Whether they are trained on human supervision
- Whether they can learn incrementally or on the fly
- Whether they work on by simply comparing the data points or detect patterns
Supervised and Unsupervised Learning
Machine learning systems can be used to classify according to the number,amount and type of supervision they get during training. There are 4 major categories:
-
Supervised Learning
The training data we feed to the algorithm includes the desired solution called labels.A common supervised learning task in classification is spam filter, predicting target numeric values fall in this category. Sometimes regression algorithms can be used for classification as well, for instance logistic regression, as it outputs a value that corresponds to the probability of belonging to a given class. Some supervised learning algorithms are:
K -nearest neighbours
Linear & Logistic regression
SVM’s
Decision trees & Random forests
Neural Networks
-
Unsupervised learning
The training data is unlabelled. Some unsupervised learning algorithms:
Clusters:
K means
DBSCAN
Hierarchical Cluster Analysis (HCA)
Anomaly detection and novelty Detection
One class SVM
Isolation forest
Visual and Dimensionality reduction
Principal component analysis (PCA)
Kernel PCA
Locally-Linear Embedding
t -distributed stochastic neighbouring embedding (t-SNE)
Association Rate Learning
Apriori
Eclat
Dimensionality reduction is used to simplify the data without simplifying the data too much. One way to do this is to merge several correlations together. This is called Feature Extraction |
-
Semi Supervised Learning
Some algorithms that can deal with a lot of unlabelled data and some labelled data. For instance photo hosting service Google photos is a good example. Most semi supervised algorithms are combinations of basie algorithms, such as DBNS and RBMS
-
Reinforcement Learning
The learning system, called agent, observes the environment, performs actions and gets rewards in return. It must learn by itself what is the best strategy called policy to get the most reward over time. The policy determines what action the agent should choose. Example : This is used by robots. For instance DeepMinds AlphaGo used this learning to analyze millions of Go games to beat the world champion. (Learning was off during this game)
Batch And Online Learning
Defines whether a model can learn incrementally from a stream of income data.
-
Batch Learning
System incapable of learning incrementally. It is trained or learned using the available data. This may take a lot of time and resources, and is generally done offline. This is also called offline learning. IF the system has to know about new data, it has to be trained using the dataset from scratch, Then it has to replace the old system with the new one. Training using a full set of data can often take time (hours). Therefore typically, the system is trained every 24 hours or every week. Training on a full set of data, requires a lot of resources. If you have a lot of data and automate the daily training data, it would use lot of resources and money.
-
Online Learning
The system trained incrementally by feeding it data instances sequentially either individually or in small groups called mini batches. Such a learning step should be fast and clean so that it can learn on the fly. It is a good option to receive data as a continuous flow and it needs to adapt rapidly. Example stock price prediction. A good option if available resources are limited - Once the system has learned about the new instances it can discard it, which in turn saves a lot of resources. Online learning algorithms can also be used to train systems on huge datasets that would not typically fit in one machine’s main memory (this is called out of core learning,which is usually done offline).
An important parameter of offline learning systems is how fast they adapt to the changing data. This is called the learning rate. If the learning rate is high, the system would rapidly learn data but also forget about the old data equally fast. Conversely with a slow learning rate, the system would learn slowly but would also be less sensitive to the noise in the new data such as outliers. A challenge with online learning is when you feen the system with bad data. Example: sensor malfunctioning on a robot. Spamming search engine to get a high rank in search results. To avoid or tackle the challenge, you may need to switch the learning off when the performance reduces. You may also watch to monitor and react to the abnormal data
Instance based vs Model based learning
a. Instance based learning
The system learns examples byheart. Then generalises the new cases by comparing them to the old learnt examples using a similarity measure
b. Model based Learning
The next way to generalise examples is to build a model and then use that model to make predictions. This is called model based learning
Utility or fitness functions : determines how good the model is in |
Cost function: determines how bad your model is |
For linear regression problems people use a cost function. This measures the distance between the linear model’s prediction of the training examples (called MSE). The objective is to reduce that.
Challenges of Machine Learning
The two challenges that can go wrong is bad data and bad algorithm
Bad Data
a. Insufficient quantity of data:
Machine learning is not so advanced. It takes a lot of data for most algorithms to work properly
b. Poor quality data:
Obviously if your data has errors, outliers and noise it makes it harder for the system to detect patterns. This affects the systems performance
c. Irrelevant features
Your system will be capable of learning if the training data contains enough relevant features. A critical step that determines the success of a model is coming up with a good set of features to train. Feature engineering is related to this and involves
- Feature selection : choosing the most useful features to train
- Feature extraction : combining existing features ( dimensionality reduction)
- Creating new features by gathering data
Bad Algorithms
a. Overfilling the data
This means the models well on the training data but does not generalise well. This happens when the model is too complex relative to the amount and noisiness to the data. Constraining a model to make it simpler and reduce the risk of overfilling is called regularization.
b. Underfitting the data
This happens when the model is too simple to learn the underlying structure of the data.Main solutions are : select a more powerful model, feeding better features to the learning algorithm, reducing the constraints
Testing and Validating
The only way to understand how well a model performs is to try it out on new cases. For this ( monitoring performance) the data is split into training and test sets. The error rate on the test set is called generalisation error. If training error is low but the generalisation error is high, then it is overfitting the training data.