Maximum Likelihood Estimation

Maximum Likelihood Estimation, simply known as MLE, is a traditional probabilistic approach that can be applied to data belonging to any distribution, i.e., Normal, Poisson, Bernoulli, etc. With prior assumption or knowledge about the data distribution, Maximum Likelihood Estimation helps find the most likely-to-occur distribution parameters. For instance, let us say we have data that is assumed to be normally distributed, but we do not know its mean and standard deviation parameters. Maximum Likelihood Estimation iteratively searches the most likely mean and standard deviation that could have generated the distribution. Moreover, Maximum Likelihood Estimation can be applied to both regression and classification problems.
Therefore, Maximum Likelihood Estimation is simply an optimization algorithm that searches for the most suitable parameters. Since we know the data distribution a priori, the algorithm attempts iteratively to find its pattern. The approach is much generalized, so that it is important to devise a user-defined Python function that solves the particular machine learning problem.

How does Maximum Likelihood Estimation work?

The term likelihood can be defined as the possibility that the parameters under consideration may generate the data. A likelihood function is simply the joint probability function of the data distribution. A maximum likelihood function is the optimized likelihood function employed with most-likely parameters. Function maximization is performed by differentiating the likelihood function with respect to the distribution parameters and set individually to zero.
If we look back into the basics of probability, we can understand that the joint probability function is simply the product of the probability functions of individual data points. With a large dataset, it is practically difficult to formulate a joint probability function and differentiate it with respect to the parameters. Hence MLE introduces logarithmic likelihood functions. Maximizing a strictly increasing function is the same as maximizing its logarithmic form. The parameters obtained via either likelihood function or log-likelihood function are the same. The logarithmic form enables the large product function to be converted into a summation function. It is quite easy to sum the individual likelihood functions and differentiate it. Because of this simplicity in math works, Maximum Likelihood Estimation solves huge datasets with data points in the order of millions!