# Why we generally use Softmax non-linearity function as last operation in-network?

Why we generally use Softmax non-linearity function as last operation in-network?

The softmax function is simply a generalisation of the logistic function, which simply squashes values into a given range.

At the final layer of a neural network, the model produces its final activations (a.k.a. logits ), which we would like to be able to interpret as probabilities, as that would allow is to e.g. create a classification result.

the reason for using the softmax is to ensure these logits all sum up to 1, thereby fulfilling the constraints of a probability density. If we are try to predict whether a medical image contain cancer or not (simply `yes` or `no` ), then the probability of a positive result ( `yes` ) and a negative result ( `no` ) must sum up to one. o the model produces a probability vector for each outcome, in pseudo-code: `p = [yes, no]` .

If the final logits in this binary classification example were `p = [0.03, 1.92]` , we can see that they sum to `1.95` . This is not interpretable as a probability, although we can see the value is much higher for `no` . In other examples where there might be 1000s of categories, not just two), we can no-longer assess so easily, which is the largest logit. The softmax gives us some perspective and (quasi-) interpretable probabilities for the categories.