The interpretation of softmax is for a vector of Reals we can convert it to a probability distribution such that term
Consider the 2-term case:
I find this suprising. It means that the difference in the softmax value isn’t a function of the two terms, it’s a function of the difference between those terms.
Intuitively, 1000 and 1001 feel “closer” to me than 0 and 1.
What if we norm by the min? (Let’s ignore the 0 case for the moment.)
Let’s consider just the 1000 and 1001 case.
Depending on what our problem domain is, this feels more intuitive than 0.269. I suppose we can think of this as the prior on the domain.