Some common Machine Learning, Statistics and Data Science terms starts with H



### Word ### Description
Hadoop Hadoop is an open source distributed processing framework used when we have to deal with enormous data. It allows us to use parallel processing capability to handle big data. Here are some significant benefits of Hadoop:
  • Hadoop clusters work and keeps multiple copies to ensure reliability of data. A maximum of 4500 machines can be connected together using Hadoop
  • The whole process is broken down into pieces and executed in parallel, hence saving time. A maximum of 25 Petabyte (1 PB = 1000 TB) data can be processed using Hadoop
  • In case of a long query, Hadoop builds back up data-sets at every level. It also executes query on duplicate datasets to avoid process loss in case of individual failure. These steps makes Hadoop processing more precise and accurate
  • Queries in Hadoop are as simple as coding in any language. You just need to change the way of thinking around building a query to enable parallel processing|
    |Hidden Markov Model|Hidden Markov Process is a Markov process in which the states are invisible or hidden, and the model developed to estimate these hidden states is known as the Hidden Markov Model (HMM). However, the output (data) dependent on the hidden states is visible. This output data generated by HMM gives some cue about the sequence of states.

HMM are widely used for pattern recognition in speech recognition, part-of-speech tagging, handwriting recognition, and reinforcement learning.|
|Hierarchical Clustering|Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted as:

|Histogram|Histogram is one of the methods for visualizing data distribution of continuous variables. For example, the figure below shows a histogram with age along the x-axis and frequency of the variable (count of passengers) along the y-axis.

Histograms are widely used to determine the skewness of the data. Looking at the tail of the plot, you can find whether the data distribution is left skewed, normal or right skewed.

|Hive|Hive is a data warehouse software project to process structured data in Hadoop. It is built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Some of the key features of Hive are :

  • Indexing to provide acceleration
  • Different storage types such as plain text, RDFile, HBase, ORC, and others
  • Metadata storage in a relational database management system, significantly reducing the time to perform semantic checks during query execution
  • Operating on compressed data stored into the Hadoop ecosystem

|Holdout Sample|While working on the dataset, a small part of the dataset is not used for training the model instead, it is used to check the performance of the model. This part of the dataset is called the holdout sample.

For instance, if I divide my data in two parts – 7:3 and use the 70% to train the model, and other 30% to check the performance of my model, the 30% data is called the holdout sample.|
|Holt-Winters Forecasting|Holt-Winters is one of the most popular forecasting techniques for time series. The model predicts the future values computing the combined effects of both trend and seasonality. The idea behind Holt’s Winter forecasting is to apply exponential smoothing to the seasonal components in addition to level and trend.

|Hyperparameter|A hyperparameter is a parameter whose value is set before training a machine learning or deep learning model. Different models require different hyperparameters and some require none. Hyperparameters should not be confused with the parameters of the model because the parameters are estimated or learned from the data.

Some keys points about the hyperparameters are:

  • They are often used in processes to help estimate model parameters.
  • They are often manually set.
  • They are often tuned to tweak a model’s performance

Number of trees in a Random Forest, eta in XGBoost, and k in k-nearest neighbours are some examples of hyperparameters.|
|Hyperplane|It is a subspace with one fewer dimensions than its surrounding area. If a space is 3-dimensional then its hyperplane is just a normal 2D plane. In 5 dimensional space, it’s a 4D plane, so on and so forth.

Most of the time it’s basically a normal plane, but in some special cases, like in Support Vector Machines, where classifications are performed with an n-dimensional hyperplane, the n can be quite large.|
|Hypothesis|Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true. |