skip to content
Nathan Wiegand

Softmax explorations cont'd

/ 3 min read

Table of Contents

Softmax explorations cont’d.

(In which I get excited, then feel dumb about math)

Today I was thinking about softmax again. This time, I was wondering what would happen if you scaled-up one of its terms. How would the other terms go down? Specifically, I was wondering what would happen to the other terms if you averaged the max term with 11.

Let’s start here.

softmax(z)i=ezij=1Nezj\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}

Now, let’s assume we scale up the z1z_1 case.

softmax(z)1=ez1+cez1+c+j=2Nezj\text{softmax}(z)_1 = \frac{e^{z_1 + c}}{e^{z_1+c}+\sum_{j=2}^{N} e^{z_j}}

That denominator is just pulling out the ez1e^{z_1} term and scaling it.

That’s just:

softmax(z)1=ecez1ecez1+j=2Nezj\text{softmax}(z)_1 = \frac{e^ce^{z_1}}{e^ce^{z_1}+\sum_{j=2}^{N} e^{z_j}}

Ok, now let’s do a little bit of rewriting.

Let,

A=ez1B=j=2NezjC=ecp=softmax(.) before scalingq=softmax(.) after scaling\begin{aligned} A &= e^{z_1} \\ B &= \sum_{j=2}^{N} e^{z_j} \\ C &= e^c \\ p &= \text{softmax(.) before scaling} \\ q &= \text{softmax(.) after scaling} \\ \end{aligned}

Then we rewrite the above as

softmax(z)1=ecez1ecez1+j=2Nezj=CACA+B=q\begin{aligned} \text{softmax}(z)_1 &= \frac{e^ce^{z_1}}{e^ce^{z_1}+\sum_{j=2}^{N} e^{z_j}} \\ &= \frac{CA}{CA + B} \\ &= q \end{aligned}

Now, let’s solve for CC.

q(CA+B)=CAqCA+qB=CAqB=CAqCAqB=CA(1q)C=qBA(1q)=qBA(1q)=BAq1qec=j=2Nezjez1q1q\begin{aligned} q(CA+B) &= CA \\ qCA + qB &= CA \\ qB &= CA - qCA \\ qB &= CA(1-q) \\ \\ C &= \frac{qB}{A(1-q)} \\ &= \frac{qB}{A(1-q)} \\ &= \frac{B}{A} \frac{q}{1-q} \\ \\ e^c &= \frac{\sum_{j=2}^{N} e^{z_j} }{ e^{z_1}} \frac{q}{1-q} \\ \end{aligned}

Note, the summation is missing the first term. Let’s rewrite so that it’s back

ec=ez1+j=1Nezjez1q1q=(ez1ez1+j=1Nezjez1)q1q=(1+1p)q1q=1ppq1q\begin{aligned} e^c &= \frac{-e^{z_1} + \sum_{j=1}^{N} e^{z_j} }{ e^{z_1}} \frac{q}{1-q} \\ &= (\frac{-e^{z_1}}{e^{z_1}}+ \frac{\sum_{j=1}^{N} e^{z_j}}{e^{z_1}}) \frac{q}{1-q} \\ &= (-1 + \frac{1}{p})\frac{q}{1-q} \\ &= \frac{1-p}{p}\frac{q}{1-q} \end{aligned}

Solving for cc,

c=ln(1ppq1q)=ln(1pp)+ln(q1q)=ln(q1q)ln(p1p)\begin{align} c &= \ln({\frac{1-p}{p}\frac{q}{1-q}}) \\ &= \ln{(\frac{1-p}{p})}+\ln{(\frac{q}{1-q})} \\ &= \ln{(\frac{q}{1-q})} -\ln{(\frac{p}{1-p})} \\ \end{align}

Ok, interesting. We have a way of scaling everything the whole thing by pegging one term. Nice.

But, back to my original goal, I want to find the midpoint between the max term and 11.

So,

q=1+p2q = \frac{1+p}{2} \\ c=ln(1+p211+p2)ln(p1p)=ln(1+p21p2)ln(p1p)=ln(1+p1p)ln(p1p)=ln(1+pp)\begin{aligned} c &=\ln{(\frac{\frac{1+p}{2}}{1-\frac{1+p}{2}})} -\ln{(\frac{p}{1-p})} \\ &= \ln{(\frac{\frac{1+p}{2}}{\frac{1-p}{2}})} -\ln{(\frac{p}{1-p})} \\ &= \ln{({\frac{1+p}{1-p}})} -\ln{(\frac{p}{1-p})} \\ &= \ln{({\frac{1+p}{p}})} \end{aligned}

Ok, at this point I’m feeling pretty damn clever. That was some good algebra I just scratched out.

But then I thought, “wonder what happens if you just average each term of the zz vector with the vector [1,0,...0][1, 0, ... 0]” (assuming that z1z_1 is the maximum value).

Well, I put it all in a spreadsheet, and know what? IT’S EXACTLY THE SAME!

All that algebra, derived an exactly formula for cc… and I could have just averaged each term!

Oh well, I’m still happy with the result. And I’m happy that I’ve demonstrated that the thing I thought was a heuristic (averaging terms) is actually exact.

Yes, my hobbies are awesome and you’re jealous.