(In which I get excited, then feel dumb about math)

Today I was thinking about softmax again. This time, I was wondering what would happen if you scaled-up one of its terms. How would the other terms go down? Specifically, I was wondering what would happen to the other terms if you averaged the max term with $1$.

Let’s start here.

\[\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}}\]

Now, let’s assume we scale up the $z_1$ case.

\[\text{softmax}(z)_1 = \frac{e^{z_1 + c}}{e^{z_1+c}+\sum_{j=2}^{N} e^{z_j}}\]

That denominator is just pulling out the $e^{z_1}$ term and scaling it.

That’s just:

\[\text{softmax}(z)_1 = \frac{e^ce^{z_1}}{e^ce^{z_1}+\sum_{j=2}^{N} e^{z_j}}\]

Ok, now let’s do a little bit of rewriting.


\[\begin{aligned} A &= e^{z_1} \\ B &= \sum_{j=2}^{N} e^{z_j} \\ C &= e^c \\ p &= \text{softmax(.) before scaling} \\ q &= \text{softmax(.) after scaling} \\ \end{aligned}\]

Then we rewrite the above as

\[\begin{aligned} \text{softmax}(z)_1 &= \frac{e^ce^{z_1}}{e^ce^{z_1}+\sum_{j=2}^{N} e^{z_j}} \\ &= \frac{CA}{CA + B} \\ &= q \end{aligned}\]

Now, let’s solve for $C$.

\[\begin{aligned} q(CA+B) &= CA \\ qCA + qB &= CA \\ qB &= CA - qCA \\ qB &= CA(1-q) \\ \\ C &= \frac{qB}{A(1-q)} \\ &= \frac{qB}{A(1-q)} \\ &= \frac{B}{A} \frac{q}{1-q} \\ \\ e^c &= \frac{\sum_{j=2}^{N} e^{z_j} }{ e^{z_1}} \frac{q}{1-q} \\ \end{aligned}\]

Note, the summation is missing the first term. Let’s rewrite so that it’s back

\[\begin{aligned} e^c &= \frac{-e^{z_1} + \sum_{j=1}^{N} e^{z_j} }{ e^{z_1}} \frac{q}{1-q} \\ &= (\frac{-e^{z_1}}{e^{z_1}}+ \frac{\sum_{j=1}^{N} e^{z_j}}{e^{z_1}}) \frac{q}{1-q} \\ &= (-1 + \frac{1}{p})\frac{q}{1-q} \\ &= \frac{1-p}{p}\frac{q}{1-q} \end{aligned}\]

Solving for $c$,

\[\begin{align} c &= \ln({\frac{1-p}{p}\frac{q}{1-q}}) \\ &= \ln{(\frac{1-p}{p})}+\ln{(\frac{q}{1-q})} \\ &= \ln{(\frac{q}{1-q})} -\ln{(\frac{p}{1-p})} \\ \end{align}\]

Ok, interesting. We have a way of scaling everything the whole thing by pegging one term. Nice.

But, back to my original goal, I want to find the midpoint between the max term and $1$.


\[q = \frac{1+p}{2} \\\] \[\begin{aligned} c &=\ln{(\frac{\frac{1+p}{2}}{1-\frac{1+p}{2}})} -\ln{(\frac{p}{1-p})} \\ &= \ln{(\frac{\frac{1+p}{2}}{\frac{1-p}{2}})} -\ln{(\frac{p}{1-p})} \\ &= \ln{({\frac{1+p}{1-p}})} -\ln{(\frac{p}{1-p})} \\ &= \ln{({\frac{1+p}{p}})} \end{aligned}\]

Ok, at this point I’m feeling pretty damn clever. That was some good algebra I just scratched out.

But then I thought, “wonder what happens if you just average each term of the $z$ vector with the vector $[1, 0, … 0]$” (assuming that $z_1$ is the maximum value).

Well, I put it all in a spreadsheet, and know what? IT’S EXACTLY THE SAME!

All that algebra, derived an exactly formula for $c$… and I could have just averaged each term!

Oh well, I’m still happy with the result. And I’m happy that I’ve demonstrated that the thing I thought was a heuristic (averaging terms) is actually exact.

Yes, my hobbies are awesome and you’re jealous.