Today I was thinking about softmax again. This time, I was wondering what would happen if you scaled-up one of its terms. How would the other terms go down? Specifically, I was wondering what would happen to the other terms if you averaged the max term with 1.
Let’s start here.
softmax(z)i=∑j=1Nezjezi
Now, let’s assume we scale up the z1 case.
softmax(z)1=ez1+c+∑j=2Nezjez1+c
That denominator is just pulling out the ez1 term and scaling it.
That’s just:
softmax(z)1=ecez1+∑j=2Nezjecez1
Ok, now let’s do a little bit of rewriting.
Let,
ABCpq=ez1=j=2∑Nezj=ec=softmax(.) before scaling=softmax(.) after scaling
Then we rewrite the above as
softmax(z)1=ecez1+∑j=2Nezjecez1=CA+BCA=q
Now, let’s solve for C.
q(CA+B)qCA+qBqBqBCec=CA=CA=CA−qCA=CA(1−q)=A(1−q)qB=A(1−q)qB=AB1−qq=ez1∑j=2Nezj1−qq
Note, the summation is missing the first term. Let’s rewrite so that it’s back
ec=ez1−ez1+∑j=1Nezj1−qq=(ez1−ez1+ez1∑j=1Nezj)1−qq=(−1+p1)1−qq=p1−p1−qq
Solving for c,
c=ln(p1−p1−qq)=ln(p1−p)+ln(1−qq)=ln(1−qq)−ln(1−pp)
Ok, interesting. We have a way of scaling everything the whole thing by pegging one term. Nice.
But, back to my original goal, I want to find the midpoint between the max term and 1.
So,
q=21+p
c=ln(1−21+p21+p)−ln(1−pp)=ln(21−p21+p)−ln(1−pp)=ln(1−p1+p)−ln(1−pp)=ln(p1+p)
Ok, at this point I’m feeling pretty damn clever. That was some good algebra I just scratched out.
But then I thought, “wonder what happens if you just average each term of the z vector with the vector [1,0,...0]” (assuming that z1 is the maximum value).
Well, I put it all in a spreadsheet, and know what? IT’S EXACTLY THE SAME!
All that algebra, derived an exactly formula for c… and I could have just averaged each term!
Oh well, I’m still happy with the result. And I’m happy that I’ve demonstrated that the thing I thought was a heuristic (averaging terms) is actually exact.
Yes, my hobbies are awesome and you’re jealous.