Follow

# .css-ecb9sr{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;width:16rem;}

Follow

Photo by imgix on Unsplash

# Exploring Data Compression: A Beginner's Guide

## Making Big Stuff Smaller

Carl Amko
·Jan 29, 2023·

Featured on Hashnode

In today's digital world, a staggering quantity of data makes its way across the internet every day. From 4K videos to 3D models, the amount of information racing around the globe has never been higher. As a result, data compression has become a crucial consideration for both businesses and individual consumers alike.

So, what is compression and how does it work?

Data compression is the process of transforming data into a smaller representation of itself, typically by using specific compression algorithms. At a high level, these algorithms work by identifying patterns and redundancies in the data and removing or encoding them to reduce the overall file size. While delving into the details of specific compression algorithms may be intriguing, it is not within the scope of this post. Instead, the focus will be on broader concepts and benefits of data compression.

### Compression Ratio

The effectiveness of data compression can vary greatly depending on both the algorithm employed and the nature of the data being compressed. Factors such as the type of data, the complexity of the data, and the amount of redundancy in the data can all impact the level of compression that can be achieved.

The observable metric for determining the efficiency of a given compression algorithm is known as its compression ratio. A higher ratio means a more efficient algorithm. This value is typically expressed as a floating point value.

$$ratio = \frac{size{original}}{size{compressed}}$$

Taking a file of size 100KB as an example, the compression ratio can be calculated for two different algorithms, A and B. After compressing with algorithm A, the resulting file is 40KB. When the file is instead compressed with algorithm B, the resulting file is 30KB.

$$ratio_A = \frac{100KB}{40KB} = 2.5$$

$$ratio_B = \frac{100KB}{30KB} = 3.33$$

By comparing these ratios, it's clear that algorithm B performed better on this particular file as it achieved a higher compression ratio.

## Lossless vs Lossy

Data compression algorithms can either be lossless or lossy. This classification distinguishes whether or not the original data can be completely restored without any information loss.

As an example, if you've ever encountered a .zip file before, that's using a lossless algorithm called DEFLATE under the hood. When you uncompress a .zip file, you receive back all of the files that were compressed each and every time without fail.

An example of a lossy algorithm is JPEG (commonly seen as .jpg or .jpeg) image compression. When the image that is compressed with JPEG is scaled down in resolution (e.g. from 1024x1024 to 512x512), it is impossible to restore the image back to its original quality.

This naturally begs the question, why would you ever use a lossy algorithm in the first place? The answer is that lossy algorithms can sometimes offer more efficient compression, which is gained as a trade-off for having an irreversible state. .jpg files are smaller in size compared to their .png counterparts.

## Compression In The Real World

Working with smaller data sizes resulting from compression yields a number of advantages. The most notable benefits are the reduction in storage costs and the improvement in transmission speeds.

### Less Storage

Whether it's through a physical disk or a cloud provider, the storage of data incurs a cost. While it may be hard to appreciate the impacts of compression on a few files in a vacuum, the effects are far more considerable when storing gigabytes or even petabytes of data. Depending on the algorithm used, compression can reduce file sizes by multiple orders of magnitude.

Case Study: Apache Cassandra + LZ4 - Fast, capable compression for stored data at scale

The popular NoSQL database Cassandra utilizes a compression algorithm called LZ4 to reduce the footprint of data at rest. LZ4 is characterized by very fast compression speed at the cost of a higher compression ratio. This is a design choice that allows Cassandra to maintain high write throughput while also benefiting from compression in some capacity.

### Faster Transmission

Any network connection, whether a home or a cellular broadband network, is characterized by a maximum throughput rate. The amount of data being transmitted over a network can have a significant impact on its performance. Compression reduces the amount of data that needs to be transferred, which can lead to faster transfer speeds and improved network efficiency.

Case Study: WebSockets + DEFLATE - Reducing bandwidth requirements for arbitrary data transmission

WebSockets, the bidirectional TCP communication mechanism, recently added a built-in extension for DEFLATE compression (called per-message DEFLATE) in 2015. If enabled, this compresses each individual WebSocket message before it is sent over the network. This allows for significant bandwidth savings, especially when sending large amounts of data.

## Try It Yourself

Almost all programming languages provide a library API for utilizing at least one general-purpose compression algorithm. Below you will find some examples written in Python using gzip, although they are easily transposable to other languages with some syntax modifications as appropriate.

import gzip
# Open an uncompressed text file and add it to a .gz compressed file.
with open('/home/basicbytes/file.txt', 'r') as f_in:
with gzip.open('/home/basicbytes/compressed.gz', 'wb') as f_out:
f_out.writelines(f_in)

from zipfile import ZipFile

# Open a .zip file and extracting all files inside.
with ZipFile('/home/basicbytes/archive.zip', 'r') as f_in:
f_in.extractall()

# pip install pillow
from PIL import Image

# Open a JPG image
image = Image.open("image.jpg")
# Capture the resolution of the original image
org_resolution = image.size
# Set a new resolution as half of the original resolution.
new_resolution = (org_resolution[0] // 2, org_resolution[1] // 2)
# Resize the image with the new resolution.
image = image.resize(new_resolution)
# Attempt to upscale it back to the original resolution
image = image.resize(org_resolution)
# Now save the image after downscale + upscale.
# Observe the quality loss!