Data Compression

swapneel-panda-419bc751 · 19 May 2021 12:50

Even with major advances over the past decade in computing power and storage costs, it still makes sense to keep your data sets as small and efficient as possible. That means only running algorithms on necessary data and not training on too much. Unsupervised learning can help with that through a process called dimensionality reduction.

Dimensionality reduction (dimensions = how many columns are in your dataset) relies on many of the same concepts as[Information Theory: it assumes that a lot of data is redundant, and that you can represent most of the information in a data set with only a fraction of the actual content. In practice, this means combining parts of your data in unique ways to convey meaning. There are a couple of popular algorithms commonly used to reduce dimensionality:

Principal Component Analysis (PCA) – finds the linear combinations that communicate most of the variance in your data.
Singular-Value Decomposition (SVD) – factorizes your data into the product of three other, smaller matrices.

These methods as well as some of their more complex cousins all rely on concepts from linear algebra to break down a matrix into more digestible and informatory pieces.

Reducing the dimensionality of your data can be an important part of a good machine learning pipeline. Take the example of an image-centerpiece for the burgeoning field of computer vision. We outlined here how big a dataset of images can be and why. If you could reduce the size of your training set by an order of magnitude, that will significantly lower your compute and storage costs while making your models run that much faster. That’s why PCA or SVD are often run on images during preprocessing in mature machine learning pipelines.