skip to content
Nathan Wiegand

Softmax observations

/ 1 min read

Table of Contents

Softmax observations

The interpretation of softmax is for a vector of Reals we can convert it to a probability distribution such that term ii reflects the relative liklihood of it versus the rest. But, let’s play.

softmax(z)i=ezij=1Nezj\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}

Consider the 2-term case:

softmax(z)0=ez0ez0+ez1=ez0ez0ez0ez0+ez1=11+ez1z0\begin{aligned} \text{softmax}(z)_0 &= \frac{e^{z_0}}{e^{z_0} + e^{z_1}} \\ &= \frac{e^{-z_0}}{e^{-z_0}}\frac{e^{z_0}}{e^{z_0} + e^{z_1}} \\ &= \frac{1}{1 + e^{z_1-z_0}} \end{aligned}

I find this suprising. It means that the difference in the softmax value isn’t a function of the two terms, it’s a function of the difference between those terms.

softmax([0,1])0=softmax([1000,1001])0=0.2689414214\begin{aligned} \text{softmax}([0, 1])_0 &= \text{softmax}([1000, 1001])_0 \\ &= 0.2689414214 \end{aligned}

Intuitively, 1000 and 1001 feel “closer” to me than 0 and 1.

What if we norm by the min? (Let’s ignore the 0 case for the moment.)

Let m=mink(zk)\text{Let } m = \min_k(z_k) softmaxmin.norm(z)i=ezi/mj=1Nezj/m\text{softmax}_{min. norm}(z)_i = \frac{e^{z_i/m}}{\sum_{j=1}^{N} e^{z_j/m}}

Let’s consider just the 1000 and 1001 case.

softmaxmin.norm(1000,1001)0=e1000/1000e1000/1000+e1001/1000=ee+e1.001=0.49975\begin{aligned} \text{softmax}_{min. norm}(1000, 1001)_0 &= \frac{e^{1000/1000}}{e^{1000/1000} + e^{1001/1000}} \\ &= \frac{e}{e+e^1.001} \\ &= 0.49975 \end{aligned}

Depending on what our problem domain is, this feels more intuitive than 0.269. I suppose we can think of this as the prior on the domain.