SoftMax简介

SoftMax回归是Logistic回归模型在多分类问题上的推广，在多分类问题中，类标签$y$可以取两个以上的值。本人之前对softmax回归的名字由来一直有一个疑问，为什么叫softmax回归而不叫其他回归？

softmax回归有人翻译成柔性回归。假设有这样一种场景，在进行minist数据集分类时，输出端10个神经元其中三个值是0、1、2，为了得到一个概率分布，我们可能会用0/3，1/3，2/3归一化，但是这样不符合要求。因为这样取该值概率为0，永远无法取得，但实际应该有一定概率可以取到改值。同样2比1大，我们就说有2/3的概率取得2对应的值，这样也不对。用softmax，加入指数后即使输出为0，也可以求得一个小的非0值，虽然很小但是仍有可能取得该值。对于2来说，加入指数后呈指数行增长，比1增长的快，所以使得2的概率比2/3大。

公式推导

假如有$m$个训练样本${(x^{(1)},y{(1)}),(x^{(2)},y{(2)}),…,(x^{(m)},y{(m)})}$，输入特征$x^{(i)}\in \mathcal{R}^{n+1}$，类标记为$y_i \in {0,1,…,k}$。假设函数为对于每一个样本估计所属类别的概率$p(y=j|x)$，即

$$
h_{\theta}(x^{(i)})=\left[
\begin{split}
{p(y^{(i)}=1|x^{i};\theta)}& \\
{p(y^{(i)}=2|x^{i};\theta)}& \\
{\vdots}& \\
{p(y^{(i)}=k|x^{i};\theta)} &\\
\end{split}
\right]=\frac{1}{\sum_{j=1}^{k} e^{\theta^{T}_{j}x^{(i)}}}\left[\begin{split}
e^{\theta^{T}_{1}x^{(i)}}& \\
e^{\theta^{T}_{2}x^{(i)}}& \\
{\vdots}& \\
e^{\theta^{T}_{k}x^{(i)}} &\\
\end{split} \right]
$$

其中$\theta$表示向量，且$\theta_{i} \in \mathcal{R}^{n+1}$。对于每一个样本估计其所属类别的概率为
$$
p(y^{(i)}=j|x^{(i)};\theta)=\frac{e^{\theta_{j}^{T}x^{(i)}}}{\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}}}
$$

代价函数

引入指示函数$I$，表示样本$i$是否属于第$j$类，所以对于softmax回归的代价函数为

$$
J(\theta)=-\frac{1}{m}[\sum_{i=1}^{m}\sum_{j=1}^{k}I({y^{(i)}}=j)log\frac{e^{\theta_{j}^{T}x^{(i)}}}{\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}}}]
$$

求解

对于上述代价函数，使用梯度下降算法对其进行求解，首先对其进行求梯度

$$
\frac{\partial J(\theta)}{\partial \theta_j}=\frac{ \partial \lbrace-\frac{1}{m}[\sum_{i=1}^{m}\sum_{j=1}^{k}I({y^{(i)}}=j)log\frac{e^{\theta_{j}^{T}x^{(i)}}}{\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}}}] \rbrace}{\partial \theta_j}
$$

对于一个样本$i$只能属于一个类别$j$，

若$y^{(i)}=j$，则$I({y^{(i)}}=j)=1$
$$
\begin{split}
\frac{\partial J(\theta)}{\partial \theta_j} &=-\frac{1}{m}\sum_{i=1}^{m}\frac{\partial log\frac{e^{\theta_{j}^{T}x^{(i)}}}{\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}}}}{\partial \theta_j} \\
&=-\frac{1}{m}\sum_{i=1}^{m}[\frac{\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}}}{e^{\theta_{j}^{T}x^{(i)}}} \frac{e^{\theta^{T}_{l}x^{(i)}} \cdot x^{(i)} \cdot \sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}} - e^{\theta^{T}_{j}x^{(i)}} \cdot x^{(i)} \cdot e^{\theta^{T}_{j}x^{(i)}}}{(\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}})^2}]\\
\end{split}
$$
若$y^{(i)} \neq j$，$y^{(i)} \neq j^{‘}$，则$I({y^{(i)}}=j)=0$，$I({y^{(i)}}=j^{‘})=1$

$$
\begin{split}
\frac{\partial J(\theta)}{\partial \theta_j} &= -\frac{1}{m}\sum_{i=1}^{m}\frac{ \partial log\frac{e^{\theta_{j^{‘}}^{T}x^{(i)}}}{\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}}}}{\partial \theta_j} \\
&= -\frac{1}{m}\sum_{i=1}^{m} [\frac{\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}}}{e^{\theta_{j^{‘}}^{T}x^{(i)}}} \frac{-e^{\theta^{T}_{j^{‘}}x^{(i)}} \cdot x^{(i)} \cdot e^{\theta^{T}_{j}x^{(i)}}}{(\sum_{l=1}^{k} e^{\theta^{T}_{l}x^{(i)}})^2} ]\
&=-\frac{1}{m}\sum_{i=1}^{m} [-\frac{e^{\theta^{T}_{j}x^{(i)}}}{\sum_{l=1}^{k}e^{\theta^{T}_{l}x^{(i)}}} \cdot x^{(i)}]
\end{split}
$$

综上有

$$
\frac{\partial J(\theta)}{\partial \theta_j} = -\frac{1}{m}\sum_{i=1}^{m}[x^{(i)} \cdot (I \lbrace y^{(i)} =j \rbrace - p(y^{(i)}=j|x^{(i)};\theta))]
$$

接下来可以对$\theta_j$使用梯度下降。

与logistic的关系

softmax回归的参数特点

softmax回归中存在参数冗余的问题，简单来讲就是参数中有些参数是没有用的，为了证明这点，假设从参数$\theta_j$中减去向量$\psi$，假设函数为
$$
\begin{split}
p(y^{(i)}=j|x^{(i)};\theta)&=\frac{e^{(\theta_{j}-\psi)^{T}x^{(i)}}}{\sum_{j=1}^{k} e^{(\theta_l-\psi)^{T}x^{(i)}}} \\
&= \frac{e^{\theta_j^T \cdot x^{(i)}} \cdot e^{-\psi^T \cdot x^{(i)}}}{\sum_{j=1}^{k} e^{\theta_l^{T} \cdot x^{(i)}}\cdot e^{-\psi^{T} \cdot x^{(i)}}}\\
&=\frac{e^{\theta_{j}^{T}x^{(i)}}}{\sum_{j=1}^{k} e^{\theta_l^{T}x^{(i)}}}
\end{split}
$$

从softmax推导出logtistic

logistic算法是softmax回归的特殊情况，即$k=2$时，此时softmax回归有

$$
h_{\theta}^{x} = \frac{1}{e^{\theta_{1}^{T} \cdot x} + e^{\theta_{2}^{T} \cdot x}}\left[
\begin{aligned}
e^{\theta_{1}^{T} \cdot x} \
e^{\theta_{2}^{T} \cdot x}
\end{aligned}
\right]
$$

利用softmax回归参数冗余的特点，另$\psi=\theta_1$，从两个向量中减去该向量，得到

$$
\begin{split}
h_{\theta}^{x} &= \frac{1}{e^{(\theta_{1}-\psi)^{T} \cdot x} + e^{(\theta_{2}-\psi)^{T} \cdot x}}\left[
\begin{aligned}
e^{(\theta_{1}-\psi)^{T} \cdot x} \
e^{(\theta_{2}-\psi)^{T} \cdot x}
\end{aligned}
\right] \\
&=\left[
\begin{aligned}
\frac{1}{1+e^{(\theta_2-\theta_1)^T \cdot x}} \
\frac{e^{(\theta_2-\theta_1)^T \cdot x}}{1+e^{(\theta_2-\theta_1)^T \cdot x}}
\end{aligned}
\right]
\end{split}
$$
上述表达式和logistic是一致的

tensorflow实现

# coding=utf-8

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

# 数据集
mnist = input_data.read_data_sets('/tmp/data/', one_hot=True)
# 超参数
learning_rate = 0.01
training_epochs = 10
batch_size = 100
display_step = 1
# 输入数据
x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32, [None, 10])
# 参数
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
# softmax函数
pred = tf.nn.softmax(tf.matmul(x, W))
# 损失函数，这里因为y用one_hot表示，所以可以直接用矩阵代替指示函数I
cost = tf.reduce_mean(-tf.reduce_sum(y * tf.log(pred), reduction_indices=1))
# W梯度下降
W_grad = -tf.matmul(tf.transpose(x), y - pred)
# b梯度下降
b_grad = -tf.reduce_mean(-tf.matmul(tf.transpose(x), y - pred), reduction_indices=0)
# W更新方式
new_W = W.assign(W - learning_rate * W_grad)
# b更新方式
new_b = b.assign(b - learning_rate * b_grad)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    for epoch in range(training_epochs):
        avg_cost = 0
        total_batch = int(mnist.train.num_examples / batch_size)
        for i in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            _, _, c = sess.run([new_W, new_b, cost], feed_dict={x: batch_xs, y: batch_ys})
            avg_cost += c / total_batch
        if (epoch + 1) % display_step == 0:
            print "Epoch:", '%04d' % (epoch + 1), "cost=", "{:.9f}".format(avg_cost)
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print "Accuracy:", accuracy.eval({x: mnist.test.images[:3000], y: mnist.test.labels[:3000]})