The interpretation of softmax is for a vector of Reals we can convert it to a probability distribution such that term i reflects the relative liklihood of it versus the rest. But, let’s play.
softmax(z)i=∑j=1Nezjezi
Consider the 2-term case:
softmax(z)0=ez0+ez1ez0=e−z0e−z0ez0+ez1ez0=1+ez1−z01
I find this suprising. It means that the difference in the softmax value isn’t a function of the two terms, it’s a function of the difference between those terms.
softmax([0,1])0=softmax([1000,1001])0=0.2689414214
Intuitively, 1000 and 1001 feel “closer” to me than 0 and 1.
What if we norm by the min? (Let’s ignore the 0 case for the moment.)
Let m=kmin(zk)
softmaxmin.norm(z)i=∑j=1Nezj/mezi/m
Let’s consider just the 1000 and 1001 case.
softmaxmin.norm(1000,1001)0=e1000/1000+e1001/1000e1000/1000=e+e1.001e=0.49975
Depending on what our problem domain is, this feels more intuitive than 0.269. I suppose we can think of this as the prior on the domain.