The cross-entropy loss for softmax is given by:
$$l(y, \hat{y}) = -\sum_i y_i \log(\text{softmax}(o_i))$$where $\text{softmax}(o)_i = \frac{e^{o_i}}{\sum_j e^{o_j}}$.
Let's denote the loss as $L = -\sum_i y_i \log(\text{softmax}(o_i))$.
First Derivative: The first derivative of $L$ with respect to $o_j$ is:
$$\frac{\partial L}{\partial o_j} = \text{softmax}(o_j) - y_j$$Second Derivative: The second derivative of $L$ with respect to $o_j$ is:
$$\frac{\partial^2 L}{\partial o_j^2} = \frac{\partial}{\partial o_j} (\text{softmax}(o_j) - y_j) = \frac{\partial \text{softmax}(o_j)}{\partial o_j} = \text{softmax}(o_j )(1 - \text{softmax}(o_j))$$The distribution given by $softmax(𝑜)$ is actually a Bernoulli distribution with probabilities $p = softmax(o)$, so the variance is: $$Var[X] = E[X^2] - E[X]^2 = \text{softmax}(o)(1 - \text{softmax}(o))$$
If we design a binary code directly for the three classes with equal probabilities, we might assign binary codes 00, 01, and 10 to the classes. However, this approach doesn't utilize the information about the equal probabilities effectively. It assigns different lengths to the codes, which could lead to suboptimal performance in certain scenarios.
We can use one-hot encode to express the jointly observations.
The number of ternary units needed can be found using the formula: $ \text{Number of ternary units} = \log_3 (\text{Range}) + 1$
So in this case: $\text{Number of ternary units} = \log_3 (7+1) + 1 = \log_3 8 + 1 = 2$
This means that you would need just two ternary digit to represent integers in the range ({0, \ldots, 7}) using the (-1, 0, 1) encoding.
Using ternary encoding can be a better idea in terms of electronics for a few reasons:
Increased Information Density: Ternary encoding allows you to represent more information in a single symbol compared to binary encoding. This means that you can transmit more data in the same amount of time.
Reduced Transmission Errors: Ternary encoding with three signal levels ((-1, 0, 1)) can provide better noise immunity compared to binary encoding ((0, 1)). The presence of a zero level allows for better differentiation between signal states, reducing the likelihood of errors due to noise.
Simpler Hardware: Ternary encoding can sometimes simplify hardware design. For instance, in differential signaling systems, where the difference between signal levels matters more than the absolute values, ternary encoding can offer benefits.
Easy Accomplish: There are distinctive conditions: positive voltage, negative voltage, zero voltage in a physical wire which can be used as ternary.
In the Bradley-Terry model, the probability of item $i$ being chosen over item $j$ is given by: $$P(i > j) = \frac{p_i}{p_i+p_j}$$ as $p_i=\text{softmax}(o_i) = \frac{e^{o_i}}{\sum_i e^{o_i}}$ and $p_i+p_j=1$, this can be simplified to: $$P(i>j)=p_i=softmax(o_i)= \frac{e^{o_i}}{\sum_i e^{o_i}}\propto{e^{o_i}}\propto{o_i}$$ This show that larger scores lead to higher likelihood.
No matter how many choices we have, the probability of choosing item i is: $$p_i=softmax(o_i)= \frac{e^{o_i}}{\sum_i e^{o_i}}\propto{e^{o_i}}\propto{o_i}$$ Item with the largest score is the most likely one to be chosen.
As: $$\exp(a) + \exp(b) > \exp(\max(a,b)) > 0$$ and log is a monotonically increasing function, we can get: $$RealSoftMax(a,b)=log(\exp(a) + \exp(b))>\max(a,b)$$
If we set $b = 0$ and (a \geq b), the functions become: $$RealSoftMax(a, 0) = \log(\exp(a) + 1) $$ $$\max(a, 0) = a$$ So the difference between the two functions is: $$\text{diff(a)}=\log(1+\exp(-a))$$ As $a$ increases, $\text{diff}$ gets smaller, but it's never exactly zero because $\exp(-a)$ will always be slightly larger than $0$.
As $\lambda\gt0$: $$\exp(\lambda{a}) + \exp(\lambda{b}) > \exp(\lambda\max(a,b)) > 0$$ and log is a monotonically increasing function, we can get: $$RealSoftMax(a,b)=log(\exp(\lambda{a}) + \exp(\lambda{b}))>\lambda\max(a,b)$$ so: $$\lambda^{-1}RealSoftMax(a,b)>\max(a,b)$$
If we set $b = 0$ and (a \geq b), the functions become: $$\lambda^{-1}RealSoftMax(\lambda{a}, 0) = \log(\exp(\lambda{a}) + 1) $$ $$\max(a, 0) = a$$ So the difference between the two functions is: $$\text{diff}(\lambda)=\lambda^{-1}\log(1+\exp(-\lambda{a}))$$ Using Lagrange's theorem: $$\lim_{\lambda\to\infty}\text{diff}(\lambda)=\lim_{\lambda\to\infty}\frac{-1}{1+\exp(\lambda{a})}=0$$ So we have $\lambda^{-1}RealSoftMax(\lambda{a},\lambda{b})\to\max(a,b)$.
An analogous function to the $RealSoftMax$ function can be defined as follows: $$RealSoftMin(a, b) = -\log(\exp(-a) + \exp(-b)) $$
This function captures the "soft" version of the minimum operation, where the logarithmic function is used to create a smooth transition between the two values.
The concept can be extended to more than two numbers using a similar approach. Given numbers $a_1, a_2, \ldots, a_n$, you can define the $RealSoftMax$ as: $$ RealSoftMax(a_1, a_2, \ldots, a_n) = \log(\sum_{i=1}^{n} \exp(a_i))$$
This function smoothens the maximum operation over multiple numbers. An analogous function, (RealSoftMin), can also be defined for the "soft" version of the minimum operation over more than two numbers.
To prove that the function $g(x) = \log\sum{\exp(x_i)}$ is convex, we can analyze its second derivative. First, let's find the first and second derivatives of $g(x)$:
The first derivative: $$ \frac{d}{dx_i} g(x) = \frac{\exp(x_i)}{\sum{\exp(x_i)}}=\text{softmax}(x_i) $$
The second derivative: $$ \frac{d^2}{dx_i^2} g(x) = \text{softmax}(x)(1-\text{softmax(x)}) \gt0$$
The second derivative is constant and non-negative, which means that the function is convex.
If we choose $b = \max_i{x_i}$, we can rewrite $g(x)$ as: $$ g(x) = \log\sum{\exp(x_i)} = \log\left(\exp(b) \cdot \sum{\exp(x_i - b)}\right) = b + \log\sum{\exp(x_i - b)} $$ This form ensures that the largest value, (b), is subtracted from all (x_i), reducing the potential for numerical instability due to large exponentials. This is often used in practice to improve the numerical stability of computing $g(x)$.
When we talk about adjusting the "temperature" of a probability distribution, we are referring to a concept commonly used in the context of the softmax function, especially in machine learning and optimization. The softmax function is often used to transform a vector of real values into a probability distribution. The parameter $T$, also referred to as "temperature," controls the shape of the resulting distribution.
The softmax function is defined as $Q(i) = \frac{e^{x_i/T}}{\sum_j e^{x_j/T}}$, where $x_i$ are the input values and $T$ is the temperature parameter. As $Q(i)\propto P(i)^\alpha$ for $\alpha\gt0$, we can get: $$\frac{Q(i)}{Q(j)}=(\frac{P(i)}{P(j)})^\alpha=e^{\frac{(x_i-x_j)}{T}}$$ This implies that the parameter $\alpha$ is related to the inverse of temperature: $$\alpha \propto \frac{1}{T}$$
As the temperature parameter $\alpha$ approaches 0, the softmax function approaches a step function. In the limit as $\alpha$ goes to 0, the softmax function will assign all probability mass to the maximum element and zero probability to the other elements. This is because as $\alpha$ becomes very small, the exponential term with the largest value will dominate the denominator, and all other terms will approach 0.
As the temperature parameter $\alpha$ approaches $\infty$, the softmax function becomes more uniform, and the probabilities for all elements tend to converge towards equal values. In this case, the output distribution approaches a uniform distribution, where all elements have roughly the same probability.
In summary, adjusting the temperature parameter $T$ in the softmax function affects the shape and concentration of the resulting probability distribution. Higher values of $T$ "soften" the distribution, making it more uniform, while lower values "sharpen" the distribution, emphasizing the maximum value. As $T$ approaches 0, the distribution becomes more focused on the maximum value, and as $T$ approaches $\infty$, the distribution becomes more uniform.