The interpretation of softmax is for a vector of Reals we can convert it to a probability distribution such that term $i$ reflects the relative liklihood of it versus the rest. But, let’s play.

\[\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}\]

Consider the 2-term case: \(\begin{aligned} \text{softmax}(z)_0 &= \frac{e^{z_0}}{e^{z_0} + e^{z_1}} \\ &= \frac{e^{-z_0}}{e^{-z_0}}\frac{e^{z_0}}{e^{z_0} + e^{z_1}} \\ &= \frac{1}{1 + e^{z_1-z_0}} \end{aligned}\)

I find this suprising. It means that the difference in the softmax value isn’t a function of the two terms, it’s a function of the difference between those terms.

\[\begin{aligned} \text{softmax}([0, 1])_0 &= \text{softmax}([1000, 1001])_0 \\ &= 0.2689414214 \end{aligned}\]

Intuitively, 1000 and 1001 feel “closer” to me than 0 and 1.

What if we norm by the min? (Let’s ignore the 0 case for the moment.)

\[\text{Let } m = \min_k(z_k)\] \[\text{softmax}_{min. norm}(z)_i = \frac{e^{z_i/m}}{\sum_{j=1}^{N} e^{z_j/m}}\]

Let’s consider just the 1000 and 1001 case.

\[\begin{aligned} \text{softmax}_{min. norm}(1000, 1001)_0 &= \frac{e^{1000/1000}}{e^{1000/1000} + e^{1001/1000}} \\ &= \frac{e}{e+e^1.001} \\ &= 0.49975 \end{aligned}\]

Depending on what our problem domain is, this feels more intuitive than 0.269. I suppose we can think of this as the prior on the domain.