Way towards Data Scientist

Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

Big Data Analytics or Data Science is a very common term in IT industry because everyone knows this is some fancy term which is gonna help us to deal with this huge amount of data we are generating these days.

Let’s find out what the skills required are:

  1. Math Skills:
  • Multivariable Calculus & Linear Algebra : These two things are very important as they help us in understanding various machine learning algorithms which plays an important role in Data science.
  • Probability & Statistics : Understanding of Statistics is very important as this is the branch of Data analysis. Probability theory is also important to statistics and it is mentioned as a prerequisite for learning machine learning.
  1. Programming Skills:
  • Programming Knowledge : You need to have a good grasp on programming concepts such as
    Data structures and algorithms. Languages used are python, R, Java, Scala. C++ is also used in some places where performance is extremely important.
  • Relational Databases : You need to know databases such as SQL or Oracle so that you can fetch the required data from them whenever needed.
  • Non Relational Databases : These are of many types but mostly used types are :
    i) Column: Cassandra, HBase
    ii) Document : MongoDB, CouchDB
    iii) Key value: Redis, Dynamo
  • Distributed Computing : It is one of the most important skills to handle a large amount of data because we cannot process this much data on a single system. Tools which mainly used are Apache Hadoop and Spark. It has two main parts : HDFS i.e Hadoop Distributed File System which is used for storing data over a distributed file system. The other part is map-reduce by which we process data. We can write map reduce in programs in java or python. There are many other tools also such as PIG, HIVE.
  • Machine Learning : It is one of the most important parts of data science and the most hot topic of research among researchers so every year new developments are made in this. You at least need to know common algorithms of supervised and unsupervised learning. There are many libraries available in python and R. List of Python Libraries :
    i) Basic Libraries: NumPy, SciPy, Pandas, Ipython, matpolib
    ii) Libraries for Machine Learning: scikit-learn, Theano, TensorFlow
    iii) Libraries for Data Mining & Natural Language Processing: Scrapy, NLTK, Pattern
  1. Domain Knowledge
    Mostly people ignore this thinking its not important but it is very very important. The whole purpose of data science is to extract useful insights from that data so that it can beneficial to company’s business. If you don’t understand the business side of your company that how your company’s business model works and how you can’t make it better than you are of no use to the company. You need to understand how to ask right questions from right people so that you can get the valuable information you need to extract the information you need. There are some visualization tools used on this business end such as Tableau which helps you display your useful results in proper nontechnical format such as graphs or pie charts which business people can understand.