

为什么在神经网络中需要非线性激活函数?

AI算法之道

2025-01-17

导读：神经网络中非线性激活函数解释

点击蓝字

关注我们

引言

在神经网络中，我们在像线性层（Linear/Dense）或卷积层（Conv2D）等层之后经常使用非线性激活函数，例如Sigmoid、Tanh、ReLU等。考虑一个具有两个隐藏层的神经网络，如下所示。

上图中，输入首先通过一个线性层，然后我们应用激活函数ReLU，再传递到第二个隐藏层Linear2。

但是我们为什么要这么做呢？

神经网络用于学习输入和输出之间关系是非线性的数据。我将在下面的章节中进一步具体说明。我们将在Pytorch中训练几个有无非线性激活函数的神经网络，并可视化它们之间的差异。希望这能让你对神经网络中非线性的必要性有一些了解。

数据集准备

为了更具体一些，让我们考虑一个将数据点分类为两个类别之一的问题。我们将使用scikit - learn来生成一个小型数据集。在我们深入这个过程之前，先导入几个库：

import torchimport numpy as npfrom lets_plot import *import pandas as pd

现在，让我们使用scikit - learn中的make_moons函数生成一个小数据集并绘制出来。

from sklearn.datasets import make_moonsX, y = make_moons(n_samples=10000, noise=0.2, random_state=10)df = pd.DataFrame(X, columns=['feature1', 'feature2'])df['y'] = yggplot(df, aes('feature1', 'feature2', color='y')) + geom_point(size=0.7) + scale_color_discrete() + labs(title="Toy dataset")

可视化结果如下：

我们生成的数据集有2个输入特征，每个数据点属于两个类别之一。我选择这个数据集是为了突出非线性的重要性。

如果你画一条直线，将直线左侧的点视为“红色”，右侧的点视为“蓝色”，无论你如何放置这条线，总会有一些点被错误分类。

例如，如果我们沿着x轴在0.5处画一条垂直线，我们会将许多蓝色点错误地分类为红色，因为这条线左侧有许多蓝色点，红色点也是如此。

如果我们沿着y轴在-0.5处画一条水平线，我们会正确地将所有红色点分类，但也会将许多蓝色点错误地分类为红色。

简单来说，一条直线根本无法充当合适的决策边界。这表明输入和输出之间存在非线性关系，因此我们需要在模型中引入非线性。

模型

现在让我们创建有无非线性的神经网络，并看看它们之间的区别。我将使用Pytorch创建一个具有两个隐藏层的简单神经网络。

由于我们的输入有2个特征，第一层以（batch_size, 2）张量作为输入，并产生（batch_size, 10）张量作为输出。如果启用非线性层，我们将使用ReLU作为非线性层。fc1层的输出将传递给ReLU。ReLU的输出形状与其输入完全相同，即（batch_size, 10）。

接下来，fc2层产生形状为（batch_size, 2）的输出，其中第一列表示类别0的logits（未归一化的分数），第二列表示类别1的logits。这些logits可以传递给softmax函数（在评估期间）以获得类别概率，或者通过argmax来确定预测的类别。

class DemoModel(torch.nn.Module):    def __init__(self, use_relu=False):        super().__init__()        self.use_relu = use_relu        self.fc1 = torch.nn.Linear(2, 10)        self.fc2 = torch.nn.Linear(10, 2)
    def forward(self, x):        x = self.fc1(x)        if self.use_relu:            x = torch.relu(x)        return self.fc2(x)
linear_model = DemoModel(use_relu=False)non_linear_model = DemoModel(use_relu=True)

训练

接着，我定义了一个名为train的函数，它是一个简单的训练循环。我使用了torch.optim.AdamW优化器和torch.nn.CrossEntropyLoss作为损失函数。

我还创建了训练和测试数据集以及数据加载器。

from sklearn.model_selection import train_test_splitfrom torch.utils.data import TensorDataset, DataLoader
def train(model: torch.nn.Module, train_dl, val_dl, epochs=10, ):    optim = torch.optim.AdamW(model.parameters(), lr=1e-3)    loss_fn = torch.nn.CrossEntropyLoss()    losses = []    for epoch in range(epochs):        train_loss = 0.0        model.train()        for batch_X, batch_y in train_dl:            optim.zero_grad()            logits = model(batch_X)            loss = loss_fn(logits, batch_y)            loss.backward()            optim.step()            train_loss += loss.item() * batch_X.size(0)
        train_loss /= len(train_dl.dataset)
        model.eval()        val_loss = 0.0        with torch.no_grad():            for batch_X, batch_y in val_dl:                logits = model(batch_X)                loss = loss_fn(logits, batch_y)                val_loss += loss.item() * batch_X.size(0)
        val_loss /= len(val_dl.dataset)        log_steps = int(0.2 * epochs)
        losses.append((train_loss, val_loss))        if epoch % log_steps == 0 or epoch == epochs - 1:            print(f'Epoch {epoch+1}/{epochs}, Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')
    return losses

X_train, X_test, y_train, y_test = train_test_split(torch.Tensor(X), torch.LongTensor(y))train_ds = TensorDataset(X_train, y_train)test_ds = TensorDataset(X_test, y_test)train_dl = DataLoader(train_ds, shuffle=True, batch_size=32)test_dl = DataLoader(test_ds, shuffle=False, batch_size=32)

让我们训练这两个模型50个epoch，并查看训练和验证损失曲线图。

linear_losses = train(linear_model, train_dl, test_dl, epochs=50)non_linear_losses = train(non_linear_model, train_dl, test_dl, epochs=50)

结果如下：

左边的图是一个不使用ReLU激活函数的模型的损失曲线，右边的图是一个使用ReLU激活函数的模型的损失曲线。我们可以看到训练/验证损失之间有很大的差异。线性模型的损失没有下降，而是停滞在大约0.29左右，而非线性模型在整个epoch中损失都在下降。

评价

让我们检查一下这两个模型的预测结果。在测试集上，线性模型的准确率为0.85，非线性模型的准确率为0.96，两者之间存在很大差异。

在上面的结果中，圆形表示类别0，三角形表示类别1。

我们看到线性模型有一个尖锐的线性边界，边界上方的点被分类为0，下方的点被分类为1。由于这个原因，许多点被错误分类。在左边的图中，理想情况下所有的圆形都应该为蓝色，所有的三角形都应该为红色，但实际上并非如此。

在右边的图中，我们看到了更好的结果。该模型能够学习一个非线性边界，正确分类了96%的数据点。

上述可视化代码如下：

from sklearn.metrics import classification_reportfrom lets_plot.mapping import as_discrete
def plot_classification(model, model_name: str):    preds = model(X_test).argmax(dim=1).numpy()    report_dict = (classification_report(y_test, preds, output_dict=True))    plot_df = pd.DataFrame({"feature1": X_test[:, 0].numpy(), "feature2": X_test[: ,1].numpy(), "y": y_test, "pred": preds})    title = f"{model_name}"    subtitle = f"Accuracy: {report_dict['accuracy']:.2}, F1-Score {report_dict['weighted avg']['f1-score']:.2}"    return ggplot(plot_df) + geom_point(aes('feature1', 'feature2', color=as_discrete('pred'), shape=as_discrete('y')), size=2.5, alpha=0.7) + labs(title=title, subtitle=subtitle, color="Predicted Class", shape="Actual Class")
fig_linear = plot_classification(linear_model, model_name="Linear")fig_non_linear = plot_classification(non_linear_model, model_name="Non Linear")bunch = GGBunch()bunch.add_plot(fig_linear, 0, 0)bunch.add_plot(fig_non_linear, 600, 0)bunch

激活可视化

现在让我们来看模型中各个层的输出。我从测试集中取了前10行数据，并在下面绘制了每个层的输出。我还展示了每个数据点的真实标签，作为最后一列的参考。

在下面的图表中，我使用了线性和非线性两种模型。由于第一层产生的是一个大小为10的向量，因此我们从每个神经元得到10个不同的值，以及每个样本末尾的一个标签列。

在这两种模型中，fc1的输出包含一系列正负值。这是通过线性运算（输入与层权重之间的矩阵乘法）得到的。

然而，当我们使用ReLU时，我们看到引入了非线性，使得负值被设置为0，而正值保持不变。这意味着只有正值会对下一层的输出做出贡献。

上述可视化的代码如下：

def plot_activations(activations, labels, title: str):    df_logits = pd.DataFrame(        activations, columns=[f"Neuron_{i+1}" for i in range(activations.shape[1])]    )    df_logits['Label'] = labels    df_logits["Sample"] = range(1, len(df_logits) + 1)

    df_logits = df_logits.melt(        id_vars="Sample", var_name="Neuron"    )    return (        ggplot(df_logits, aes("Neuron", as_discrete("Sample")))        + geom_tile(aes(fill="value"))        + geom_text(aes(label="value"), label_format=".1", color='black')        + scale_fill_brewer(type='seq', palette=9)        + labs(title=title)            )
with torch.no_grad():    logits_fc1 = non_linear_model.fc1(X_test[:10])    logits_fc1_relu = torch.relu(logits_fc1)    logits_fc2 = non_linear_model.fc2(logits_fc1_relu)
bunch = GGBunch()bunch.add_plot(    plot_activations(logits_fc1.numpy(), y_test[:10], title="Output of fc1 of Non-Linear Model"), 0, 0, 500, 500)bunch.add_plot(    plot_activations(logits_fc1_relu.numpy(), y_test[:10], title="Output of fc1 of Non-Linear Model after RELU"), 502, -7, 500, 510)bunch.add_plot(    plot_activations(logits_fc2.numpy(), y_test[:10], title="Output of fc2"), 872, 9, 500, 478)
display(bunch)
bunch = GGBunch()with torch.no_grad():    linear_logits_fc1 = linear_model.fc1(X_test[:10])    linear_logits_fc2 = linear_model.fc2(logits_fc1_relu)
bunch.add_plot(    plot_activations(linear_logits_fc1.numpy(), y_test[:10], title="Output of fc1 of Linear Model"), 0, 0, 500, 500)bunch.add_plot(    plot_activations(linear_logits_fc2.numpy(), y_test[:10], title="Output of fc2"), 375, 15, 500, 470)bunch

结论

尽管在神经网络中使用非线性激活函数已是众所周知，但我希望能够为大家提供一个具体的可视化，解释为什么需要它。除了ReLU之外，还有许多激活函数可供选择，例如SeLU、GeLU、Sigmoid、Tanh等，但ReLU尽管逻辑非常简单，却表现相当出色。你可以自由尝试其他激活函数，看看结果如何。

点击上方小卡片关注我

添加个人微信，进专属粉丝群！

【声明】内容源于网络

AI算法之道

一个专注于深度学习、计算机视觉和自动驾驶感知算法的公众号，涵盖视觉CV、神经网络、模式识别等方面，包括相应的硬件和软件配置，以及开源项目等。

内容 573

粉丝 0

AI算法之道一个专注于深度学习、计算机视觉和自动驾驶感知算法的公众号，涵盖视觉CV、神经网络、模式识别等方面，包括相应的硬件和软件配置，以及开源项目等。

总阅读256

粉丝0

内容573