Some common Machine Learning, Statistics and Data Science terms starts with D

D

### Word ### Description
Data Mining Data mining is a study of extracting useful information from structured/unstructured data taken from various sources. This is done usually for
  1. Mining for frequent patterns
  2. Mining for associations
  3. Mining for correlations
  4. Mining for clusters
  5. Mining for predictive analysis

Data Mining is done for purposes like Market Analysis, determining customer purchase pattern, financial planning, fraud detection, etc|
|Data Science|Data science is a combination of data analysis, algorithmic development and technology in order to solve analytical problems. The main goal is a use of data to generate business value.|
|Data Transformation|Data transformation is the process to convert data from one form to the other. This is usually done at a preprocessing step.

For instance, replacing a variable x by the square root of x

X SQUARE_ROOT(X)
1 1
4 2
9 3

|
|Database|Database (abbreviated as DB) is an structured collection of data. The collected information is organised in a way such that it is easily accessible by the computer. Databases are built and managed by using database programming languages. The most common database language is SQL.|
|Dataframe|DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A series
  • Another DataFrame|
    |Dataset|A dataset (or data set) is a collection of data. A dataset is organized into some type of data structure. In a database, for example, a dataset might contain a collection of business data (names, salaries, contact information, sales figures, and so forth). Several characteristics define a dataset’s structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as standard deviation and kurtosis.|
    |Dashboard|Dashboard is an information management tool which is used to visually track, analyze and display key performance indicators, metrics and key data points. Dashboards can be customised to fulfil the requirements of a project. It can be used to connect files, attachments, services and APIs which is displayed in the form of tables, line charts, bar charts and gauges. Popular tools for building dashboards include Excel and Tableau.|
    |DBScan|DBSCAN is the acronym for Density-Based Spatial Clustering of Applications with Noise . It is a clustering algorithm that isolates different density regions by forming clusters. For a given set of points, it groups the points which are closely packed.

The algorithm has two important features:

  • distance
  • the minimum number of points required to form a dense region

The steps involved in this algorithm are:

  • Beginning with an arbitrary starting point it extracts the neighborhood of this point using the distance
  • If there are sufficient neighboring points around this point then a cluster is formed
  • This point is then marked as visited
  • A new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise
  • This process continues until all points are marked as visited

The below image is an example of DBScan on a set of normalized data points:

|
|Decision Boundary|In a statistical-classification problem with two or more classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two or more sets, one for each class. How well the classifier works depends upon how closely the input patterns to be classified resemble the decision boundary. In the example sketched below, the correspondence is very close, and one can anticipate excellent performance.

Here the lines separating each class are decision boundaries.|
|Decision Tree|Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input & output variables. In this technique, we split the population (or sample) into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

|Deep Learning|Deep Learning is associated with a machine learning algorithm (Artificial Neural Network, ANN) which uses the concept of human brain to facilitate the modeling of arbitrary functions. ANN requires a vast amount of data and this algorithm is highly flexible when it comes to model multiple outputs simultaneously.
|Descriptive Statistics|Descriptive statistics is comprised of those values which explains the spread and central tendency of data. For example, mean is a way to represent central tendency of the data, whereas IQR is a way to represent spread of the data.|
|Dependent Variable|A dependent variable is what you measure and which is affected by independent / input variable(s). It is called dependent because it “depends” on the independent variable. For example, let’s say we want to predict the smoking habits of people. Then the person smokes “yes” or “no” is the dependent variable.|
|Decile|Decile divides a series into 10 equal parts. For any series, there are 10 decile denoted by D1, D2, D3 … D10. These are known as First Decile , Second Decile and so on.

For example, the diagram below shows the health score of a patient from range 0 to 60. Nine deciles split the patients into 10 groups

Decile - statistics|
|Degree of Freedom|It is the number of variables that have the choice of having more than one arbitrary value.

For example, in a sample of size 10 with mean 10, 9 values can be arbitrary but the 10th value is forced by the sample mean. So, we can choose any number for 9 values but the 10th value must be such that the mean is 10. So, the degree of freedom in this case will be 9.|
|Dimensionality Reduction|Dimensionality Reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. Some of the benefits of dimensionality reduction:

  • It helps in data compressing and reducing the storage space required
  • It fastens the time required for performing same computations
  • It takes care of multicollinearity that improves the model performance. It removes redundant features
  • Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely
  • It is helpful in noise removal also and as result of that we can improve the performance of models|
    |Dplyr|Dplyr is a popular data manipulation package in R. It makes data manipulation, cleaning, summarizing very user friendly. Dplyr can work not only with the local datasets, but also with remote database tables, using exactly the same R code.

It can be easily installed using the following code from the R console:

install.packages(“dplyr”)|
|Dummy Variable|Dummy Variable is another name for Boolean variable. An example of dummy variable is that it takes value 0 or 1. 0 means value is true (i.e. age < 25) and 1 means value is false (i.e. age >= 25)|

[/su_table]