The interpretation of softmax is for a vector of Reals we can convert it to a probability distribution such that term i reflects the relative liklihood of it versus the rest. But, let’s play.

softmax(z)i=ezij=1Nezj

Consider the 2-term case: softmax(z)0=ez0ez0+ez1=ez0ez0ez0ez0+ez1=11+ez1z0

I find this suprising. It means that the difference in the softmax value isn’t a function of the two terms, it’s a function of the difference between those terms.

softmax([0,1])0=softmax([1000,1001])0=0.2689414214

Intuitively, 1000 and 1001 feel “closer” to me than 0 and 1.

What if we norm by the min? (Let’s ignore the 0 case for the moment.)

Let m=mink(zk) softmaxmin.norm(z)i=ezi/mj=1Nezj/m

Let’s consider just the 1000 and 1001 case.

softmaxmin.norm(1000,1001)0=e1000/1000e1000/1000+e1001/1000=ee+e1.001=0.49975

Depending on what our problem domain is, this feels more intuitive than 0.269. I suppose we can think of this as the prior on the domain.