Data Science different topic's explanation -- Part-11 -- Poisson Distribution

This is data science different topic explanation conversation series.
If you are not following below mentioned are the links for each part.

Part-1:

Part-2:

Part-3:

Part-4:

Part-5:

Part-6:

Part-7:

Part-8:

Part-9:

Part-10:

Data Science different topic’s explanation – Part-11 – Poisson Distribution

from scipy.stats import poisson
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt


def poisson_dist(): -> None
    plt.figure(figsize=(15,15))
    data_binom = poisson.rvs(mu=4, size=10000)

    ax = sb.distplot(data_binom, kde=True, color='b', 
                    bins=np.arange(data_binom.min(), data_binom.max() + 1), 
                    kde_kws={'color': 'r', 'lw': 3, 'label': 'KDE'})
    ax.set(xlabel='Poisson', ylabel='Frequency')

poisson_dist()

The Poisson distribution is obtained as a limiting case of the Bernoulli distribution, if we push p to zero and n to infinity, but so that their product remains constant: n * p - a. Formally, such transition leads to the formula

  • x is a random variable (the number of occurrences of event A);
  • λ is the event rate (average number of events in an interval) so called the rate parameter. It is also equal to mean and variance.

Continuing the missed out content.

Poisson’s distribution is subject to many random variables that arise in scientific and practical life: equipment breaks, the duration of maintenance work performed by working staff, printing errors, the number of goals scored by the football team.

In closing

There are many theoretical distributions: Normal, Poisson, Student, Fisher, Binomial and others. Each of them is intended for analysis of data of different origin and has certain characteristics. In practice these distributions are used as some template for analysis of real data of similar type. In other words, they try to impose the structure of the chosen theoretical distribution on real data, thus calculating the characteristics that are of interest to the analyst.

More precisely, theoretical distributions are probabilistic statistical models whose properties are used to analyse empirical data. Something like that is done. Data are collected and compared to any known theoretical distributions. If there are similarities, the properties of a theoretical model are transmitted to the empirical data with the corresponding conclusions. This approach is the basis for classical methods related to the hypotheses testing (calculation of confidence intervals, comparison of mean values, verification of parameters’ significance, etc.).

If the available data do not correspond to any known theoretical distribution (which usually happens in practice, but for some reason, nobody cares), it is not recommended to use the selected template (probabilistic-statistical model). Illegal use of parametric distributions leads to the situation when an analyst searches for keys not where he has lost them, but under a street lamp pole where it is light. To solve this problem, there are other approaches related to using non-parametric statistics.