Data science dictionary

Data science is a relatively new field and it comes with its own jargon. Here is a short glossary of the terms that you are likely to encounter during your data science journey. A simple definition will be given and the reader is invited to investigate the terms of interest into more details. Most definitions are inspired from other posts found on the internet.

1. Types of machine learning fields

Machine learning
Field of study that gives computers the ability to learn without being explicitly programmed. (This term was coined by Arthur Samuel in 1959). There are 4 main types of machine learning:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning

Supervised learning
It’s a type of machine learning where the computer make predictions or take decisions based on labelled training set of observations. It can be used for classification or regression problems.

Unsupervised learning
It’s a type of machine learning used to draw inference from data sets consisting of input data without labeled output. It is typically used to group similar observations into groups (cluster analysis).

Semi-supervised learning
It’s a type of machine learning algorithm that makes use of unlabeled data to augment labeled data in a supervised learning context. It allows the model to train on a larger data set so it can be more accurate. It is useful when generating labels of s training data set is difficult.

Reinforcement learning
It’s an area of machine learning concerned with how software agents learn to take actions by interacting with an environment and receiving positive or negative rewards for performing actions. The goal is to choose the appropriate actions so as to maximize the cumulative reward. Please refer to my articles on reinforcement learning for more details.

Active learning = Optimal experimental design
It’s a type of semi-supervised machine learning where the learning algorithm can choose the data it wants to learn from. By selecting carefully the most important and informative observations to be labeled, active learning can achieve similar or better performance than supervised learning methods using substantially less data for training. It thus helps by reducing the number of labeled observation required to train the model, which can be a very time and cost consuming task.
eg. Human-in-the-loop

Passive learning
In opposition to active learning, passive learning involves involve gathering large amount of data randomly sampled from the underlying distribution and using this data set to train a predictive model.

Bayesian optimization
It is an approach to optimizing objective functions that take a long time to evaluate. Combined with reinforcement learning, it can learn parametrised policies in only a few iterations.

Artificial general intelligence = strong AI = full AI = broad AI
It refers to the intelligence of a machine that could successfully perform any intellectual task that a human being can. This does not exist (yet) and it still part of the science fiction culture.

Weak AI = narrow AI
AI only focused on one narrow task.

Big data
It refers to a field of data science where the data too large to fit on one node and specialised infrastructure is required in order to analyse and manipulate the data.

Business intelligence (BI)
It is set of techniques and tools used for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes.

Natural Language Processing (NLP)
It is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages

Data mining
It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

Extreme Learning Machine (ELM)
It is an easy-to use and effective learning algorithm of single-hidden layer feed-forward neural networks. The classical learning algorithm in neural network, e. g. backpropagation, requires setting several user-defined parameters and may get into local minimum. However, ELM only requires setting the number of hidden neurons and the activation function. It does not require adjusting the input weights and hidden layer biases during the implementation of the algorithm, and it produces only one optimal solution. Therefore, ELM has the advantages of fast learning speed and good generalization performance.

Lazy vs eager learning
A lazy learning algorithm stores the training data without learning from it and only start fitting the model when it receives the test data. It takes less time in training but more time in predicting.
Given a set of training set, an eager learning algorithm constructs a predictive model before receiving the new test data. It tries to generalize the training data before receiving queries.
Eg. Decision tree, neural networks, Naive Bayes.

Instance-based learning = memory-based learning
It is a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.
Eg. K-Nearest Neighbors

Transfer Learning
It consists in applying the knowledge of an already trained machine learning model is applied to a different but related problem. For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses. It is currently very popular in the field of Deep Learning because it enables you to train Deep Neural Networks with comparatively little data.

Computational learning theory
It is a sub-field of AI devoted to studying the design and analysis of machine learning algorithms.

Rule-based machine learning (RBML)
It is a machine learning method that identifies, learns, or evolves ‘rules’ to store, manipulate or apply. Rules typically take the form of an {IF:THEN} expression, (e.g. {IF ‘condition’ THEN ‘result’}, or as a more specific example, {IF ‘red’ AND ‘octagon’ THEN ‘stop-sign’}). 
Eg. learning classifier systems (LCS), association rule learning (ARM), artificial immune systems (AIS)

Computer vision (CV)
It is a field that deals with how computers can be made to gain high-level understanding from digital images or videos.

Anomaly detection = outlier detection
It consists in the identification of items, events or observations which do not conform to an expected pattern or other items in a data set

Data analyst vs data scientist
A data analyst analyses trends in the existing data and looks for meaning in the data. A data scientist makes data-driven prediction using machine learning algorithms.

2. Types of variables and ML problems

Independent vs dependent variables
In supervised learning, there are 2 types of variables:
Independent variables = predictors = regressors = features = input variables Eg. age, number of rooms in a house
Dependent variable = response = output variable
Eg. price of a house

Quantitative (= numeric) vs categorical variables (= qualitative)
Quantitative variables take values that describe a measurable quantity as a number.
Categorical variables take values that describe a quantity or characteristic of the data.

Continuous vs discrete variables
Continuous variables are quantitative variables that can take any value between a certain set of real numbers.
Eg. temperature, time, distance, etc…
Discrete variables are quantitative variables that can take a values from a finite set of distinct whole values.
Eg. number of children, number of cars, etc…

Ordinal variables
They are categorical variables that can take a value that can be logically ordered or ranked.
Eg. academic grades (A, B, C), clothing size (small, medium, large), etc…

Nominal variables
They are categorical variables that can take a value that is not able to be organised in a logical sequence.
Eg. gender, eye color, religion,, etc…

Dummy variable
They are binary variables created by encoding a qualitative variable into 0 or 1 (one-hot encoding or dummy encoding).

Classification vs regression problems
In classification problems, the output variable is categorical.
Eg. identify spams and malware, classify images, speech recognition, identify fraudulent credit card transactions, targeted advertising
In regression problems, the output variable is quantitative continuous.

Clustering problem = cluster analysis
Type of unsupervised learning problems where the goal is to assign the observations into distinct groups.
eg. Automatically group genes into groups

Leave a Reply

Your email address will not be published. Required fields are marked *