Introduction to the Machine Learning

Machine learning is the science of programming computers so that they can learn from data.

Why use Machine Learning

This is best shown with an example. Let us try building a spam filter. The ones that are used in almost all email services these days. The steps to do so involves :

  1. We look for words that are common among the spam mails, which helps us to gather a pattern
  2. We write a detection algorithm for each of the patterns noticed
  3. We then test the program to see if it is good enough and then repeat steps 1 and 2 till the program becomes good enough.

This is a very basic approach to build such a system. Since the program is not trivial the rules would be a long list.This in fact makes it hard to maintain. However, machine learning techniques automatically learn words and phrases that are good predictors of whether a mail is spam or not. They detect the unusual frequency of such words in the spam examples. This program is much shorter and easier to maintain. Hence better than our initial idea

       However this approach also has a problem. That is, if spammers notice some words are spam, they might replace the words with other words of similar meaning. For instance, if they notice the word “four” is flagged they might use the number “4”. They can also change the format in order to bypass the spam filter. However, spam filters, based on machine learning techniques automatically notices such unusual occurrences and marks them as spam

       Machine learning can be also used for problems that do not have an algorithm. For instance speech analysis. It can also help humans learn by inspecting what the model has learnt. Applying ML techniques to dig into large amounts of data can help discover patterns that were not immediately evident. This is called Data Mining

Types of Machine Learning Systems

There are many types of machine learning systems. This is classified as follows:

Supervised and Unsupervised Learning

Machine learning systems can be used to classify according to the number,amount and type of supervision they get during training. There are 4 major categories:

K -nearest neighbours
Linear & Logistic regression
SVM’s
Decision trees & Random forests
Neural Networks

The training data is unlabelled. Some unsupervised learning algorithms:

Clusters:

K means
DBSCAN
Hierarchical Cluster Analysis (HCA)

Anomaly detection and novelty Detection

One class SVM
Isolation forest

Visual and Dimensionality reduction

Principal component analysis (PCA)
Kernel PCA
Locally-Linear Embedding
t -distributed stochastic neighbouring embedding (t-SNE)

Association Rate Learning

Apriori
Eclat

Dimensionality reduction is used to simplify the data without simplifying the data too much. One way to do this is to merge several correlations together. This is called Feature Extraction

Some algorithms that can deal with a lot of unlabelled data and some labelled data. For instance photo hosting service Google photos is a good example. Most semi supervised algorithms are combinations of basie algorithms, such as DBNS and RBMS

The learning system, called agent, observes the environment, performs actions and gets rewards in return. It must learn by itself what is the best strategy called policy to get the most reward over time. The policy determines what action the agent should choose. Example : This is used by robots. For instance DeepMinds AlphaGo used this learning to analyze millions of Go games to beat the world champion. (Learning was off during this game)

Batch And Online Learning

Defines whether a model can learn incrementally from a stream of income data.

Instance based vs Model based learning

a. Instance based learning

       The system learns examples byheart. Then generalises the new cases by comparing them to the old learnt examples using a similarity measure

b. Model based Learning

       The next way to generalise examples is to build a model and then use that model to make predictions. This is called model based learning

Utility or fitness functions : determines how good the model is in
Cost function: determines how bad your model is

For linear regression problems people use a cost function. This measures the distance between the linear model’s prediction of the training examples (called MSE). The objective is to reduce that.

Challenges of Machine Learning

The two challenges that can go wrong is bad data and bad algorithm

Bad Data

a. Insufficient quantity of data:

Machine learning is not so advanced. It takes a lot of data for most algorithms to work properly

b. Poor quality data:

Obviously if your data has errors, outliers and noise it makes it harder for the system to detect patterns. This affects the systems performance

c. Irrelevant features

Your system will be capable of learning if the training data contains enough relevant features. A critical step that determines the success of a model is coming up with a good set of features to train. Feature engineering is related to this and involves

Bad Algorithms

a. Overfilling the data

This means the models well on the training data but does not generalise well. This happens when the model is too complex relative to the amount and noisiness to the data. Constraining a model to make it simpler and reduce the risk of overfilling is called regularization.

b. Underfitting the data

This happens when the model is too simple to learn the underlying structure of the data.Main solutions are : select a more powerful model, feeding better features to the learning algorithm, reducing the constraints

Testing and Validating

The only way to understand how well a model performs is to try it out on new cases. For this ( monitoring performance) the data is split into training and test sets. The error rate on the test set is called generalisation error. If training error is low but the generalisation error is high, then it is overfitting the training data.