10 Amazing Open Source Projects for Machine Learning Enthusiasts


Open source refers to something people can modify and share because they are accessible to everyone. You can use the work in new ways, integrate it into a larger project, or find a new work based on the original. Open source promotes the free exchange of ideas within a community to build creative and technological innovations or ideas. So, programmers should consider contributing to open source projects because of the following reasons:

  1. It helps you to write cleaner code.

  2. You gain a better understanding of technology.

  3. Contributing to open source projects helps you gain attention, popularity and can leverage your career.

  4. Adding an open-source project to your resume increases its weight.

  5. Improves coding skills

  6. Improve Software on a User and Business Level.

Source: Google Images

To start contributing to open source projects there are some prerequisites:

  1. Learn a programming language: Since in open source contribution you need to write code to get involved in the development, you need to learn a programming language. That can be of any choice. It’s easy to learn another language at a later stage depending upon the needs of the project.

  2. Get yourself familiar with Version Control Systems: These are the software tools that help in keeping all the changes in one place that are being made to recall them at a later stage if needed. Basically, they keep track of every modification done by you over time in the source code. Some popular Version Control Systems are Git, Mercurial, CVS, etc. Out of all these Git is the most popular and widely used in the industry.

Now we will look at some of the amazing Open Source Projects you can contribute to.

So, let’s get started!

1. Caliban

Source: Google Images

This is a machine learning project from tech giant Google. It is used for developing machine learning research workflows and notebooks in an isolated and reproducible computing environment. It solves a big problem. When developers are building data science projects, it is many times difficult to build a test environment that can show your project in a real-life situation. It is not possible to predict all edge cases. So, Caliban is a potential solution for this problem. Caliban makes it easy to develop any ML models locally, run code on your machine then try out that exact same code in a Cloud environment for execution on big machines. So, Dockerized research workflows are made easy, locally as well as in the cloud.

Github Link: GitHub - google/caliban: Research workflows made easy, locally and in the Cloud.

2. Kornia

How a research scientist built Kornia: an open source differentiable library for PyTorch | by PyTorch | PyTorch | Medium | Open Source Projects

Source: Google Images

Kornia is a computer vision library for PyTorch. It is used to solve some generic computer vision problems. Kornia is built on PyTorch and depends on its efficiency and CPU power so that it can compute complex functions. Kornia is a pack of libraries used to train neural network models and perform image transformation, image filtering, edge detection, epipolar geometry, depth estimation, etc.

Github Link: GitHub - kornia/kornia: Open Source Differentiable Computer Vision Library for PyTorch

3. Analytics Zoo

Analytics Zoo

Source: Google Images

Analytics Zoo is a unified data analytics and AI platform that unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline. This can efficiently scale from a laptop to a large cluster to process the production of big data. This project is maintained by Intel-analytics.

Analytics Zoo helps an AI solution in the following ways:

  • Helps you easily prototype AI models.
  • Scaling is efficiently managed.
  • Helps to add automation processes to your ML pipeline like feature engineering, model selection, etc.

Github link: GitHub - intel-analytics/analytics-zoo: Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

4. MLJAR Automated Machine Learning for Humans

Source: Google Images

Mljar is a platform to create prototype models and deployment services. To find the best model, Mljar searches different algorithms and performs hyper-parameters tuning. It provides interesting quick results by running all computation in the cloud and finally creating ensemble models. Then it builds a report for you from AutoML training. Isn’t this cool?

Mljar efficiently trains models for binary classification, multi-class classification, regression.

It provides two kinds of interfaces:

  • It can run ML models on your web browser
  • Provides Python wrapper over Mljar API.

The report received from Mljar contains the table with information about each model score and the time needed to train every model. Performance is shown as scatter and box plots so it’s easy to check visually which algorithms perform best amongst all. See this:

AutoML leaderboard

Source: Google Images

Documentation: https://supervised.mljar.com/

Source Code: GitHub - mljar/mljar-supervised: Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning


Source: Google Images

DeepDetect is a Machine Learning API and server written in C++. If you want to work with the state of art machine learning algorithms and want to integrate them into existing applications DeepDetect is for you. DeepDetect supports a wide variety of tasks like classification, segmentation, regression, object detection, autoencoders. It supports both supervised and unsupervised deep learning of images, time series, text, and some more types of data. But DeepDetect depends on external machine learning libraries like:

  • Deep Learning libraries: Tensorflow, Caffe2, Torch.
  • Gradient Boosting Library: XGBoost.
  • Clustering with T-SNE.

Github link: GitHub - jolibrain/deepdetect: Deep Learning API and Server in C++14 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

6. Dopamine

Source: Google Images

Dopamine is an open-source project from tech giant Google. It’s written in Python. It is a research framework for fast prototyping reinforcement learning algorithms.

Dopamine’s design principles are:

  • Easy Experiment: Dopamine makes it easy for new users to run experiments.
  • It is compact and reliable.
  • It also facilitates reproducibility in results.
  • It is flexible hence makes it easy for new users to try out new research ideas.

Note: Check these Colaboratory Notebooks to learn how to use Dopamine.

Github link: GitHub - google/dopamine: Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

7. TensorFlow

Source: Google Images

Tensorflow is the most famous, popular, and one of the best Machine Learning Open Source projects on GitHub. It is an open-source software library for numerical computation using data flow graphs. It has a very easy-to-use python interface and no unwanted interfaces in other languages to build and execute computational graphs. TensorFlow provides stable Python and C++ APIs. Tensorflow has some amazing use cases like:

  • In voice/sound recognition
  • Text Bases Applications
  • Image Recognition
  • Video Detection

…and many more!

GitHub Link: GitHub - tensorflow/tensorflow: An Open Source Machine Learning Framework for Everyone

8. PredictionIO

Source: Google Images

It is built on top of a state-of-the-art open-source stack. This machine learning server is designed for data scientists to create predictive engines for any ML tasks. It’s some amazing features are:

  • It helps to quickly build and deploy an engine as a web service on production templates that are customizable.
  • Once deployed as a web service, respond to dynamic queries in real-time.
  • It supports machine learning and data processing libraries like OpenNLP, Spark MLLib.
  • It also simplifies data infrastructure management

GitHub link: GitHub - apache/predictionio: PredictionIO, a machine learning server for developers and ML engineers.


Source: Google Images

It is a Python-based free software machine learning library of tools. It provides various algorithms for classification, regression, clustering algorithms including random forests, gradient boosting, DBSCAN. This is built upon SciPy that must be pre-installed so that you can use sci-kit learn. It also provides models for:

  • Ensemble methods
  • Feature extraction
  • Parameter tuning
  • Manifold learning
  • Feature selection
  • Dimensionality reduction

Note: To learn scikit-learn follow documentation: scikit-learn: machine learning in Python — scikit-learn 0.24.2 documentation

GitHub Link: scikit-learn · GitHub

10. Pylearn2

Pylearn2 is the most prevalent machine learning library among all Python developers. It is based on Theano. You can use mathematical expressions to write its plugin while Theano takes or optimization and stabilization. It has some awesome features like:

  • A “default training algorithm” to train the model itself

  • Model Estimation Criteria

    • Score Matching
    • Cross-entropy
    • Log-likelihood
  • Dataset pre-processing

    • Contrast normalization
    • ZCA whitening
    • Patch extraction (for implementing convolution-like algorithms)

GitHub Link: GitHub - lisa-lab/pylearn2: Warning: This project does not have any current developer. See bellow.

Wanted to expand on Sklearn furthermore. Often for minor activities such as discretization or binning of features Sklearn provides options such as “KBinsDiscretizer”, or for missing data imputation, “IterativeImputer”. For various transformations, there’s “PowerTransformer” or “FunctionTransformer”. The point is that, keep looking out for open source updates and features once in a while. Each time there might be an upgrade, which can have cleaner code across the project. Also since these libraries are tested, they are also reliable and require lesser unit testing from developer’s end.