The outcome in binomial logistic regression can be a 0 or a 1. The idea is then to estimate the probability of an outcome being a 1 or a 0. Given that the probability of the outcome being a 1 is given by p then the probability of it not occurring is given by 1-p. This can be seen as a special case of Binomial distribution called the Bernoulli distribution.

The idea of logistic regression is to cast the problem in form of a generalized linear regression model.

y^=β0+β1x1+…+βnxn

where y^=predicted value, x= independent variables, and the β are coefficients to be learned.

This can be compactly expressed in vector form:

wT=[β0,β1,…,βn] xT=[1,x1,…,xn] Then y^=wTx

But this is linear regression stuff so we need to find a way to cast the logistic regression problem in a manner whereby at least the expression above can be used. Thus if we compute the odds of the outcome as:

**odds(p)=p/(1−p)**

We can move a step closer to casting the problem in a continuous linear manner but this is still just having positive values we need a range of (−∞,+∞)

That can be done by getting the (natural) logarithm of the odds as:

**logit(p)=log(p/(1−p))**

This now varies continuously linear and thus we can do the following:

**logit(p)=y^=wTx**

Thus the logit function acts like a link between logistic regression and linear regression and thus it is called a link function.

After we estimate the weights w we can take the inverse of logit to get the probability p as given below:

**logit(p)=log(p/(1−p))**

Taking the natural exponential on both sides gives:

**elogit(p)=p/(1−p)**

Then adding 1 on both sides gives:

**ey^+1=p/(1−p)+1**

**ey^+1=1/(1−p)**

Changing subject of formula to 1-p gives:

**1−p=1/(ey^+1)**

Then subtracting 1 from both sides to give:

**−p=(1/(ey^+1))−1**

Multiply by -1 on both sides to get:

**p=1−(1/(ey^+1))**

**p=(ey^/(ey^+1))**

Then divide the numerator and denominator by e

**p=1/(1+e−y^)**

It is a logistic (sigmoid) function.

From this point, we can cast logistic regression in form of a single-layered neural network (NN) with a sigmoid activation function. This means we can estimate the weight parameters using the usual stochastic gradient descent (SGD) by minimizing the cross-entropy loss function given by:

**l(p,y)=−ylog(p)−(1−y)log(1−p)**

**l(p,y)=−ylog(p)−(1−y)log(1−p)**

Where y∈[0,1]

y∈[0,1]

The multinomial logistic regression can then be just about adding more sigmoid neurons in the layer. And instead of using the logistic function, we can use the softmax as the activation function to keep the output as a probability distribution. The loss in the multinomial logistic regression case with a softmax becomes the log-likelihood loss.