Machine Learning: The Basics

As machine learning (also called data mining) becomes a more integral part of technology everywhere, it will become increasingly important for lawyers and businessmen to be able to relate to and understand how it works.

Machine learning is a subfield of AI which encompasses the creation, study, and use of techniques that allow computers to process new information, learn from it, then use that learning to perform some task.  The most important machine learning concept to understand, and the focus of this post, is the distinction between supervised and unsupervised learning, the two methods by which systems are educated.  There is an enormous diversity of specific algorithms and techniques used in machine learning, but all fit inside of one of these two methods.

The goal of supervised learning is to create a system which can successfully predict or classify input data; how the system uses those predictions/classifications is irrelevant.  In order to develop a supervised learning scheme, a sample dataset which is representative of the universe — i.e. the data that the system will have to read after development is finished — is required.  This dataset is divided into a training set and a testing set.  This division can be accomplished in a number of different ways, and the two datasets do not have to be of equal size, as long as each new dataset is representative of the full sample dataset and universe.  There are a number of supervised learning algorithms, many of the most popular use Bayesian statistics, neural nets, or decision trees.

When conducting supervised learning, the testing set is held in reserve while the training set is run through the algorithm and a model instructing the system in how to classify or predict the universe is built.  After the model has been completed, the testing set is then run through the system and the efficacy of the model is evaluated.  The algorithm is then tweaked or the dataset purged of fields in order to decrease noise and increase accuracy, before running the supervised learning again.  This analysis and tweaking can be done manually or by a wrapper, an algorithm which makes the changes and reruns the learning automatically until a result meeting specified end parameters is found or the maximum amount of attempts is made.  A number of different wrappers exist, some of the most popular involve genetic techniques or numerical/statistical analysis of the data set.  Only when the system is ready is it, with the optimized model, used to analyze data directly from the universe.

By contrast, unsupervised learning is carried out on the universe immediately and is intended to suggest conclusions rather than predict or classify.  Many in-depth graphics we see in news reports, such as relationship graphs, are examples of unsupervised learning.  Unsupervised learning includes a number of popular analytical techniques including basic statistics, relationship analysis, clustering, and outlier detection.  A system using unsupervised learning may then pass on its conclusions to a user or use them to make internal adjustments.


Leave a Comment