Dask for Big Data

“Will this work if I give you 50gb of data?”

Use Dask.:zap:

What is Dask?

Dask is a library in Python to enable parallel computing to help you deal with large data in lesser time.

Libraries like Pandas and NumPy would either take too much time to load data which is as huge as 50 gigs, or the process itself might crash.:bulb:

Dask has its own set of data collections which extend the original collections like Pandas DataFrames & NumPy arrays.

Dask follows the lazy evaluation way where it forms a “graph” of the operation at hand, but doesn’t process it until it is explicitly called for.

This complete parallel graph can also be viewed by calling the .visualize() method on it.

A 50gb Pandas DataFrame will be split by Dask into multiple DataFrames by row indices and then computed upon parallely.

The operation you need to perform becomes the “Task Graph” that Dask will perform on them parallely when the .compute() method is called.

Dask uses threads and processes to parallelise its computations on your local machine for data larger than your memory.

It is also compatible for scaling up to multiple machines by using schedulers to scale the same operation to 100s or even 1000s of clusters.:rocket:

Have you used Dask?

#python #machinelearning #datascience