Why we generally use Softmax non-linearity function as last operation in network?

palak-virmani-30ac6553 · 28 August 2020 04:10

Why we generally use Softmax non-linearity function as last operation in network ?

iftekar-patel-f1e6bf65 · 28 August 2020 19:38

The softmax function is simply a generalisation of the logistic function, which simply squashes values into a given range.

At the final layer of a neural network, the model produces its final activations (a.k.a. logits ), which we would like to be able to interpret as probabilities, as that would allow is to e.g. create a classification result.

the reason for using the softmax is to ensure these logits all sum up to 1, thereby fulfilling the constraints of a probability density. If we are try to predict whether a medical image contain cancer or not (simply yes or no ), then the probability of a positive result ( yes ) and a negative result ( no ) must sum up to one. o the model produces a probability vector for each outcome, in pseudo-code: p = [yes, no] .

If the final logits in this binary classification example were p = [0.03, 1.92] , we can see that they sum to 1.95 . This is not interpretable as a probability, although we can see the value is much higher for no . In other examples where there might be 1000s of categories, not just two), we can no-longer assess so easily, which is the largest logit. The softmax gives us some perspective and (quasi-) interpretable probabilities for the categories.

chirag-garg · 7 August 2021 15:57

Softmax is often used as the final layer in the network, for a classification task. It receives the final representation of the data sample as input, and it outputs a classification prediction - giving a probability per class (all summing to one).

As a metaphor, you can think about the layers before the Softmax as investigating the input object, and providing a detailed description of its features: 60cm tall, fluffy hair, long falling ears, long hairy waiving tail, big wet nose, large tongue. The Softmax function takes all these parameters (encoded as vectors), and responds with a N long vector, where each value corresponds to a potential class:

0 probability car

0.1 probability cat

0 probability auto

0 probability human

0.76 probability dog

0.14 probability bear