当前位置：首页 > news >正文

【论文_序列转换模型架构_20230802v7】Attention Is All You Need 【Transformer】

news 来源：原创 2025/8/24 4:04:42

https://arxiv.org/abs/1706.03762
20170612 v1

代码实现_notebook

在这里插入图片描述

∗Equal contribution. Listing order is random.
Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea.
提出用 self-attention 替代 RNNs，并开始努力评估这一想法。
Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work.
设计并实现了第一个 Transformer 模型，并参与了这项工作的各个核心方面。
Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every
detail.
提出了缩放的点积注意、多头注意和无参数位置表示，并参与了几乎每一个细节。
Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor.
在我们的原始代码库和 tensor2tensor 中设计、实现、调整和评估了无数的模型变体。
Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations.
还试验了新的模型变体，负责我们的初始代码库，以及高效推理和可视化。
Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
花了无数个漫长的日子来设计和实现 tensor2tensor 的各个部分，更换了我们早期的代码库，极大地改善了结果并大大加快了我们的研究。

文章目录

摘要
1 引言
2 背景
3 模型架构
- 3.1 编码器和解码器堆叠
- 3.2 Attention
- - 3.2.1 Scaled Dot-Product Attention
  - 3.2.2 Multi-Head Attention
  - 3.2.3 Attention 在我们的模型中的应用
- 3.3 Position-wise 前馈网络
- 3.4 Embeddings 和 Softmax
- 3.5 Positional Encoding 位置编码
4 Why Self-Attention
5 训练
- 5.1 训练数据和 Batching
- 5.2 硬件和时间表
- 5.3 优化器
- 5.4 正则化
6 结果
- 6.1 机器翻译
- 6.2 模型变体
- 6.3 English Constituency Parsing 英语选区解析
7 结论
致谢
参考文献
注意可视化

摘要

↓ 〔一个新的简单的 sequence transduction序列转换模型架构，Transformer：性能更好，更具并行性，需要更少的训练时间。〕

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
主流的序列 transduction 模型是基于复杂的循环或卷积神经网络，包括一个编码器和一个解码器。
The best performing models also connect the encoder and decoder through an attention mechanism.
表现最好的模型还通过注意机制连接编码器和解码器。
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
我们提出了一个新的简单的网络架构，Transformer，完全基于注意机制，完全摒弃循环和卷积。
Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
在两个机器翻译任务上的实验表明，这些模型在质量上更优越，同时更具并行性，并且需要更少的训练时间。
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.
我们的模型在 WMT 2014 英语-德语翻译任务上实现了 28.4 BLEU，比现有的最佳结果（包括 ensembles ）提高了 2 个 BLEU 以上。
On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
在 WMT 2014 英法翻译任务中，我们的模型在 8 个 GPUs 上训练 3.5 天后，建立了一个新的单模型最先进的 BLEU 分数 41.8，这是文献中最佳模型的训练成本的一小部分。
We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
我们通过将 Transformer 成功地应用于具有大量和有限训练数据的英语选区解析，证明了它可以很好地泛化到其它任务。

1 引言

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5].
循环神经网络，特别是长短时记忆[13] 和门控循环[7]神经网络，已经被牢固地确立为序列建模和 transduction 问题（如语言建模和机器翻译）的最先进方法[35,2,5]。
Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
从那以后，大量的努力继续推动循环语言模型和编码器-解码器架构的边界[38,24,15]。

Recurrent models typically factor computation along the symbol positions of the input and output sequences.
循环模型通常沿输入和输出序列的符号位置进行因子计算。
Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$ , as a function of the previous hidden state $h_{t-1}$ and the input for position $t$ .
将位置与计算时间中的步对齐，它们生成一个隐藏状态序列 $h_t$ ，是前一个隐藏状态 $h_{t-1}$ 和位置 $t$ 的输入的函数。
This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
这种固有的顺序性排除了训练示例中的并行化，这在较长的序列长度下变得至关重要，因为内存约束限制了跨示例的 batching。
Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter.
最近的研究通过因式分解技巧[21] 和条件计算[32]显著提高了计算效率，同时也提高了后者的模型性能。
The fundamental constraint of sequential computation, however, remains.
然而，顺序计算的基本约束仍然存在。〔顺序计算 ——> 无法并行〕

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19].
注意机制已经成为各种任务中的序列建模和 transduction 模型的必要组成部分，允许对依赖关系进行建模，而不考虑它们在输入或输出序列中的距离[2,19]。
In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
然而，在除少数情况 [27] 外的所有情况下，这种注意机制都与循环网络结合使用。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
在这项工作中，我们提出了 Transformer，一种避免循环的模型架构，完全依赖于注意机制来得到输入和输出之间的全局依赖关系。
The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Transformer 允许更多的并行化，并且在 8 个 P100 GPUs 上经过 12 小时的训练后，可以达到翻译质量的新的最先进水平。

2 背景

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.
减少顺序计算的目标也构成了 Extended Neural GPU[16]、ByteNet[18] 和 ConvS2S[9] 的基础，它们都使用卷积神经网络作为基本构建块，并行计算所有输入和输出位置的隐藏表示。
In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.
在这些模型中，将两个任意输入或输出位置的信号关联起来所需的运算量随着位置之间的距离增加而增长，ConvS2S 为线性增长，ByteNet 为对数增长。
This makes it more difficult to learn dependencies between distant positions [12].
这使得学习远距离位置之间的依赖关系变得更加困难 [12]。
In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
在 Transformer 中，这被减少到一个恒定的操作数量，尽管其代价是由于平均注意加权的位置而降低了有效分辨率，我们用 3.2 节中描述的多头注意抵消了这一影响。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
自注意，有时被称为内注意，是一种将单个序列的不同位置联系起来以计算该序列的表示的注意机制。
Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
self-attention 已经被成功地应用于阅读理解、摘要总结、文本蕴涵和学习任务无关的句子表征等多种任务中 [4,27,28,22]。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
端到端记忆网络基于循环注意机制，而不是顺序对齐的循环，并且在简单语言问答和语言建模任务上表现良好 [34]。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.
然而，据我们所知，Transformer 是第一个完全依赖于 self-attention 来计算其输入和输出表示的 transduction 模型，不使用顺序对齐的 RNNs 或卷积。
In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
在后续部分中，我们将描述 Transformer，激励 self-attention，并讨论它相对于 [17,18] 和 [9] 等模型的优势。

3 模型架构

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].
大多数有竞争力的神经序列 transduction 模型具有编码器-解码器结构 [5,2,35]。
Here, the encoder maps an input sequence of symbol representations $x_1, ..., x_n)$ to a sequence of continuous representations $\bm{z} = (z_1,\cdots,z_n)$ .
这里，编码器将一个符号表示的输入序列 $x_1, ..., x_n)$ 映射到一个连续表示的序列 $\bm{z} = (z_1,\cdots,z_n)$ 。
Given $\bm{z}$ , the decoder then generates an output sequence $(y_1, \cdots,y_m)$ of symbols one element at a time.
给定 $\bm{z}$ ，解码器然后生成一个符号的输出序列 $(y_1, \cdots,y_m)$ ，每次一个元素。
At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
在每一步中，模型都是自回归的，在生成下一个符号时，使用之前生成的符号作为额外输入。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer 遵循这个整体架构，编码器和解码器都使用堆叠的 self-attention、逐点的，完全连接的层，分别如图 1 的左半部分和右半部分所示。

在这里插入图片描述

3.1 编码器和解码器堆叠

Encoder: The encoder is composed of a stack of $N = 6$ identical layers.
Each layer has two sub-layers.
编码器由 6 个相同的层堆叠而成。每一层有两个子层。
The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
第一个子层是多头自注意机制，第二个子层是简单的、逐个位置完全连接的前馈网络。
We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1].
我们在两个子层的每一层周围都使用了一个残差连接[11]，然后是层标准化[1]。
That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
也就是说，每个子层的输出是 LayerNorm(x + Sublayer(x)) ，其中 Sublayer(x) 是子层本身实现的函数。
To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_\text{model}=512$ .
为了方便这些残差连接，模型中的所有子层以及 embedding 层产生维度为 $d_\text{model}=512$ 的输出。

Decoder: The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
解码器也由 6 个相同的层堆叠而成。除了每个编码器层中的两个子层之外，解码器插入第三个子层，该子层对编码器 stack 的输出执行多头注意。
Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
与编码器类似，我们在每个子层周围使用残差连接，然后进行层标准化。
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
我们还修改了解码器 stack 中的自注意子层，以防止位置关注后续位置。
This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$ .
这种 masking，加上输出 embeddings 偏移一个位置的事实，确保了位置 $i$ 的预测只能依赖于小于 $i$ 的位置的已知输出。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
注意函数可以描述为将 查询和一组键值对 映射到输出，其中查询、键、值和输出都是向量。
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
输出被计算为值的加权和，其中分配给每个值的权重是由查询与相应键的兼容性函数计算的。

3.2.1 Scaled Dot-Product Attention

We call our particular attention “Scaled Dot-Product Attention” (Figure 2).
我们称这种特殊的注意为 “Scaled Dot-Product Attention”（图 2）。
The input consists of queries and keys of dimension $d_k$ , and values of dimension $d_v$ .
输入包括 维度都是 $d_k$ 的查询和键，维度为 $d_v$ 的值。
We compute the dot products of the query with all keys, divide each by $\sqrt {d_k}$ , and apply a softmax function to obtain the weights on the values.
我们计算查询与所有键的点积，每个点积除以 $\sqrt {d_k}$ ，并应用 softmax 函数来获得值的权重。

在这里插入图片描述

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$ .
在实践中，我们同时计算一组查询的注意函数，它们被打包成一个矩阵 Q。
The keys and values are also packed together into matrices $K$ and $V$ .
键和值也打包成矩阵 $K$ 和 $V$ 。
We compute the matrix of outputs as:
我们计算输出矩阵为：

$\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V~~~~~~~~~~(1)$

Q：queries ，查询
K：keys 键
V： values 值

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention.
两个最常用的注意函数是 加性注意[2] 和点积（乘法）注意。
Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$ .
除了 scaling factor 为 $\frac{1}{\sqrt{d_k}}$ 之外，点积注意与我们的算法相同。
Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.
加性注意使用一个具有单个隐藏层的前馈网络来计算 compatibility function 。
While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
虽然两者在理论复杂性上相似，但在实践中，点积注意更快，更节省空间，因为它可以使用高度优化的矩阵乘法代码来实现。

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [3].
虽然对于较小的 $d_k$ 值，这两种机制的表现相似，但在不扩大 $d_k$ 的值的情况下，加法注意优于点积注意。
We suspect that for large values of $d_k$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4.
我们怀疑，对于较大的 $d_k$ 值，点积的大小会变大，从而将 softmax 函数推入具有极小梯度的区域

脚注 4 To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean 0 and variance 1.
为了说明为什么点积变得很大，假设 $q$ 和 $k$ 的分量是均值为 0，方差为 1 的独立随机变量。
Then their dot product, $=\sum\limits_{i=1}^{d_k}q_ik_i$ , has mean 0 and variance $d_k$ .
它们的点积 $=\sum\limits_{i=1}^{d_k}q_ik_i$ 均值为 0，方差为 $d_k$

To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$ 。
为了抵消这个影响，我们将点积乘以 $\frac{1}{\sqrt{d_k}}$

3.2.2 Multi-Head Attention

Instead of performing a single attention function with $d_\text{model}$ -dimensional keys, values and queries,we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k$ , $d_k$ and $d_v$ , dimensions, respectively.
我们发现，与其使用 $d_\text{model}$ 维度的键、值和查询来执行单一的注意函数，将查询、键和值 $h$ 次线性投影到 $d_k$ 、 $d_k$ 和 $d_v$ 维度是有益的。
On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$ -dimensional output values.
然后，在查询、键和值的每个投影版本上，我们并行地执行注意函数，产生 $d_v$ 维的输出值。
These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
将它们连接起来并再次进行投影，得到最终值，如图 2 所示。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
多头注意让模型在不同位置共同注意来自 不同表示子空间 的信息。
With a single attention head, averaging inhibits this.
对于单一注意头，平均会抑制这一点。

$\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1,\cdots,\text{head}_h)W^O$

其中 $\text{head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V)$

其中 projections投影是 $W_i^Q\in {\mathbb R}^{d_\text{model}\times d_k},~~W_i^K\in {\mathbb R}^{d_\text{model}\times d_k},~~W_i^V\in {\mathbb R}^{d_\text{model}\times d_v},~~W^O\in {\mathbb R}^{hd_v\times d_\text{model}}$

In this work we employ $h = 8$ parallel attention layers, or heads.
For each of these we use $d_k=d_v=d_\text{model}/h=64$ .
在这项工作中，我们使用 $h = 8$ 平行注意层，或头。
对于每个 head，我们使用 $d_k=d_v=d_\text{model}/h=64$ 。
Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
由于每个 head 的维数降低，因此总计算成本与全维的 single-head attention单头注意相似。

3.2.3 Attention 在我们的模型中的应用

The Transformer uses multi-head attention in three different ways:
Transformer 以三种不同的方式使用多头注意：
———— In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.
在 “编码器-解码器注意”层 中，查询来自前一个解码器层，而记忆键和值来自编码器的输出。
This allows every position in the decoder to attend over all positions in the input sequence.
这使得解码器中的每个位置都注意输入序列中的所有位置。
This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as[38, 2, 9].
这模仿了序列到序列模型中典型的编码器-解码器注意机制，如 [38,2,9]。
———— The encoder contains self-attention layers.
编码器包含自注意层。
In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder.
在自注意层中，所有的键、值和查询都来自同一个地方，在这种情况下，是编码器中前一层的输出。
Each position in the encoder can attend to all positions in the previous layer of the encoder.
编码器中的每个位置都可以注意编码器前一层中的所有位置。
———— Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.
类似地，解码器中的自注意层使得解码器中的每个位置注意到解码器中的所有位置直至并包括该位置。
We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.
我们需要阻止解码器中的向左信息流以保持自回归特性。〔 ✅ 自回归特性是啥特性，为什么要求信息不能向左流？↓ 〕
We implement this inside of scaled dot-product attention by masking out (setting to -∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.
我们通过 masking out（令其为 -∞）softmax 输入中对应于非法连接的所有值来实现 scaled dot-product attention缩放点积注意力。参见图 2。

———————— 补充 Start
自回归特性可以简单理解为 “基于历史生成未来”。例如，在语言生成任务中，模型生成一个词时，只会考虑之前已经生成的词，而不会考虑尚未生成的词。这种特性确保了生成过程的顺序性和因果性。

在这里插入图片描述

———————— 补充 End

3.3 Position-wise 前馈网络

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
除了注意子层外，编码器和解码器中的每一层都包含一个全连接的前馈网络，该网络分别相同地应用于每个位置。
This consists of two linear transformations with a ReLU activation in between.
这包括两个线性转换，中间有一个 ReLU 激活。

$\text{FFN}(x)=\text{max}(0, xW_1+b_1)W_2+b_2~~~~~~~~~~(2)$

While the linear transformations are the same across different positions, they use different parameters from layer to layer.
虽然线性变换在不同位置上是相同的，但它们在每一层之间使用不同的参数。
Another way of describing this is as two convolutions with kernel size 1.
另一种描述它的方式是两个核大小为 1 的卷积。
The dimensionality of input and output is $d_\text{model} = 512$ , and the inner-layer has dimensionality $d_\text{ff}=2048$ .
输入和输出的维数 $d_\text{model} = 512$ ，内层的维数 $d_\text{ff}=2048$ 。

3.4 Embeddings 和 Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_\text{model}$ .
与其它序列 transduction 模型类似，我们使用习得的 embeddings 将输入 tokens 和输出 tokens 转换为维度为 $d_\text{model}$ 的向量。
We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.
我们还使用通常习得的线性变换和 softmax 函数将解码器输出转换为预测的 next-token 概率。
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30].
在我们的模型中，我们在两个 embedding 层和 pre-softmax 线性变换之间共享相同的权重矩阵，类似于 [30]。
In the embedding layers, we multiply those weights by $\sqrt{d_\text{model}}$
在 embedding 层中，我们将这些权重乘以 $\sqrt{d_\text{model}}$

3.5 Positional Encoding 位置编码

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.
由于我们的模型不包含 recurrence 和卷积，为了使模型利用序列的顺序，我们必须注入一些关于序列中 tokens 的相对或绝对位置的信息。
To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks.
为此，我们在编码器和解码器堆栈底部的输入 embeddings 中添加了“位置编码”。
The positional encodings have the same dimension $d_\text{model}$ as the embeddings, so that the two can be summed.
位置编码与 embeddings 具有相同的维数 $d_\text{model}$ ，因此两者可以相加。
There are many choices of positional encodings, learned and fixed [9].
有许多位置编码的选择，学习和固定[9]。

In this work, we use sine and cosine functions of different frequencies:
在这项工作中，我们使用了不同频率的正弦和余弦函数：

$PE_{(pos,2i)}=sin(\frac{pos}{10000^{\frac{2i}{d_\text{model}}}})~~~~~~$ ✔ 可以让模型外推到比训练期间遇到的序列长度更长的序列

$PE_{(pos,2i\textcolor{blue}{+1})}=\textcolor{blue}{cos}(\frac{pos}{10000^{\frac{2i}{d_\text{model}}}})$

其中 pos 是位置，i 是维度。
也就是说，位置编码的每一个维度对应于一个正弦波。
波长形成从 2π 到 10000·2π 的几何级数。
我们选择这个函数是因为我们假设它可以让模型很容易地通过相对位置学习，因为对于任何固定的偏移量 k， $PE_{pos+k}$ 可以表示为 $PE_{pos}$ 的线性函数。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)).
我们还尝试使用习得的位置 embeddings [9] 代替，发现这两个版本产生的结果几乎相同（见表 3 行 (E) ）。
We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
我们选择正弦版本是因为它可以让模型外推到比训练期间遇到的序列长度更长的序列。

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1,\cdots, x_n)$ to another sequence of equal length $z_1,...,z_n)$ , with $x_i, z_i \in {\mathbb R}^d$ , such as a hidden layer in a typical sequence transduction encoder or decoder.
在本节中，我们将自注意层的各个方面与循环层和卷积层进行比较，这些层通常用于将一个可变长度的符号表示序列 $(x_1,\cdots, x_n)$ 映射到另一个相等长度的序列 $z_1，…，z_n)$ ，其中 $x_i, z_i \in {\mathbb R}^d$ ，例如典型序列 transduction 编码器或解码器中的隐藏层。
Motivating our use of self-attention we consider three desiderat.
我们认为有 3 个缺少的东西激励我们使用自关注。

One is the total computational complexity per layer.
一个是每层的总计算复杂度。
Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
另一个是可以并行化的计算量，通过所需的最小顺序运算来衡量。

The third is the path length between long-range dependencies in the network.
第三个是网络中长范围依赖关系之间的路径长度。
Learning long-range dependencies is a key challenge in many sequence transduction tasks.
学习长范围依赖关系 是许多序列 transduction 任务中的关键挑战。
One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.
影响学习这种依赖关系能力的一个关键因素是网络中向前和向后信号必须经过的路径长度。
The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12].
输入和输出序列中任意位置组合之间的路径越短，学习长范围依赖关系就越容易[12]。
Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
因此，我们还比较了由不同层类型组成的网络中任意两个输入和输出位置之间的最大路径长度。

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O (n)$ sequential operations.
如表 1 所示，自关注层用固定数量的顺序执行计算连接所有位置，而循环层需要 $O (n)$ 顺序操作。
In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length $n$ is smaller than the representation dimensionality $d$ , which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations.
就计算复杂度而言，当序列长度 $n$ 小于表示维数 $d$ 时，自注意层比循环层更快，这是机器翻译中最先进的模型使用的句子表示的最常见情况，例如词块[38] 和字节对[31]表示。
To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size $r$ in the input sequence centered around the respective output position.
为了提高涉及非常长的序列的任务的计算性能，可以将自注意限制为只考虑以各自输出位置为中心的输入序列中大小为 $r$ 的邻域。
This would increase the maximum path length to $O (n / r)$ .
这将使最大路径长度增加到 $O (n / r)$ 。
We plan to investigate this approach further in future work.
我们计划在未来的工作中进一步研究这种方法。🌱

在这里插入图片描述

A single convolutional layer with kernel width $k < n$ does not connect all pairs of input and output positions.
一个核宽度为 $k < n$ 的卷积层并不能连接所有的输入和输出位置对。〔网络中任意两个位置之间最长路径的长度增加〕
Doing so requires a stack of $O (n / k)$ convolutional layers in the case of contiguous kernels, or $O(\log_k(n))$ in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network.
这样做在连续核的情况下需要 $O (n / k)$ 卷积层的堆栈，或者在膨胀卷积 [18]的情况下需要 $O(\log_k(n))$ 卷积层的堆栈，增加网络中任意两个位置之间最长路径的长度。
Convolutional layers are generally more expensive than recurrent layers, by a factor of $k$ .
卷积层通常比循环层昂贵 k 倍 。
Separable convolutions [6], however, decrease the complexity considerably, to $O(k·n·d +n·d^2)$ .
然而，可分离卷积[6]大大降低了复杂度，为 $O(k·n·d +n·d^2)$ 。
Even with $k = n$ , however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
然而，即使令 $k = n$ ，可分离卷积的复杂度等于自注意层和逐点前馈层的组合，这是我们在模型中采用的方法。

As side benefit, self-attention could yield more interpretable models.
作为附带好处，自注意可以产生更多可解释的模型。
We inspect attention distributions from our models and present and discuss examples in the appendix.
我们从我们的模型中检查注意力分布，在附录中给出并讨论示例。
Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
不仅单独注意力头清楚地学会执行不同的任务，许多注意力头似乎表现出与句子的句法和语义结构相关的行为。

5 训练

This section describes the training regime for our models.
本节描述了我们模型的训练机制。

5.1 训练数据和 Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.
我们在标准的 WMT 2014 英语-德语数据集上进行训练，该数据集由大约 450 万句对组成。
Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens.
句子使用字节对编码[3] 进行编码，具有大约 37000 个 tokens 的共享源-目标词汇表。
For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38].
对于英语-法语，我们使用了更大的 WMT 2014 英语-法语数据集，该数据集由 36M 个句子组成，并将 tokens 拆分为 32000 个单词块的词汇[38]。
Sentence pairs were batched together by approximate sequence length.
句子对按 近似序列长度 进行批处理。
Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
每个训练批包含一组句子对，其中包含大约 25000 个源 tokens 和 25000 个目标 tokens。

5.2 硬件和时间表

We trained our models on one machine with 8 NVIDIA P100 GPUs.
我们在一台带有 8 个 NVIDIA P100 GPUs 的机器上训练我们的模型。
For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds.
对于使用本文中描述的超参数的基础模型，每个训练步骤大约需要 0.4 秒。
We trained the base models for a total of 100,000 steps or 12 hours.
我们对基础模型进行了总共 10 万步或 12 小时的训练。
For our big models,(described on the bottom line of table 3), step time was 1.0 seconds.
对于我们的大型模型（如表 3 所示），步长为 1.0 秒。
The big models were trained for 300,000 steps(3.5 days).
大模型训练了 30 万步（3.5 天）。

5.3 优化器

We used the Adam optimizer [20] with $β_1 = 0.9, β_2 = 0.98$ and $\epsilon = 10^{-9}$ .
我们使用 Adam 优化器[20]，其中 $β_1 = 0.9, β_2 = 0.98$ 和 $\epsilon = 10^{-9}$ 。
We varied the learning rate over the course of training, according to the formula:
在训练过程中，我们根据以下公式改变了学习率：

$lrate=d^{-0.5}_\text{model}·\min(step\_num^{-0.5},step\_num·warmup\_steps^{-1.5})~~~~~~~~~~(3)$

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number.
这对应于在第一个 warmup_steps 训练步骤中线性增加学习率，然后按步数的倒数平方根成比例地降低学习率。
We used warmup_steps = 4000.
我们令 warmup_steps = 4000

5.4 正则化

We employ three types of regularization during training:
我们在训练中使用三种类型的正则化：

Residual Dropout
We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized.
我们将 dropout[33] 应用于每个子层的输出，然后将其添加到子层输入并标准化。
In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.
此外，我们将 dropout 应用于编码器和解码器堆栈中的 embeddings 和位置编码之和。
For the base model, we use a rate of $P_\text{drop} =0.1$ .
对于基础模型，我们使用 $P_\text{drop} =0.1$ 的比率。

Label Smoothing
During training, we employed label smoothing of value $\epsilon_{ls}$ = 0.1 [36].
在训练过程中，我们使用值 $\epsilon_{ls}$ = 0.1[36] 的标签平滑。
This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
这损害了困惑度，因为模型学会了变得更不确定，但提高了准确性和 BLEU 分数。〔评估机器翻译质量的指标，越高越好〕

6 结果

6.1 机器翻译

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4.
在 WMT 2014 英语转德语翻译任务中，大的 transformer 模型（表 2 中的Transformer (big)）比之前报道的最佳模型（包括集成）高出 2.0 BLEU 以上，建立了新的最先进的 BLEU 分数 28.4。
The configuration of this model is listed in the bottom line of Table 3.
该模型的配置列在表 3 的底线。
Training took 3.5 days on 8 P100 GPUs.
训练时间为 3.5 天，使用的是 8 个 P100 GPUs。
Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
甚至我们的基础模型也超过了所有以前发表的模型和集成，训练成本只是任何竞争模型的一小部分。

在这里插入图片描述

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model.
在 WMT 2014 英语转法语翻译任务上，我们大的模型获得了 41.0 的 BLEU 分数，优于之前发布的所有单一模型，而训练成本不到之前最先进模型的 1/4。〔集成模型没超过〕
The Transformer (big) model trained for English-to-French used dropout rate Pdrop=0.1, instead of 0.3.
为英语转法语训练的 Transformer(big) 模型使用的 dropout 率 Pdrop=0.1，而不是 0.3。

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals.
对于基础模型，我们使用通过平均最后 5 个检查点获得的单个模型，这些检查点每隔 10 分钟写入一次。
For the big models, we averaged the last 20 checkpoints.
对于大的模型，我们取最后 20 个检查点的平均值。
We used beam search with a beam size of 4 and length penalty $\alpha = 0.6$ [38].
我们使用束搜索，波束大小为 4，长度惩罚 $\alpha = 0.6$ [38]。
These hyperparameters were chosen after experimentation on the development set.
这些超参数是在开发集上实验后选择的。
We set the maximum output length during inference to input length +50, but terminate early when possible [38].
我们在推理期间将最大输出长度设置为输入长度 +50，但在可能的情况下提前终止 [38]。

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature.
表 2 总结了我们的结果，并将我们的翻译质量和训练成本与文献中的其它模型架构进行了比较。
We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU $^5$ .
我们通过将训练时间、使用的 GPUs 数量和每个 GPU 的持续单精度浮点容量的估计值乘积来估计用于训练模型的浮点运算次数。

$^5$ We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.
K80、K40、M40 和 P100 的 TFLOPS 分别为 2.8、3.7、6.0 和 9.5。

6.2 模型变体

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013.
为了评估 Transformer 不同组件的重要性，我们以不同的方式改变了我们的基础模型，在开发集 newstest2013 上测量了英语到德语翻译的性能变化。
We used beam search as described in the previous section, but no checkpoint averaging.
We present these results in Table 3.
我们使用前一节描述的束搜索，但没有使用检查点平均。
我们在表 3 中展示了这些结果。

在这里插入图片描述

Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model.
表 3：Transformer 架构的变体。未列出的值与基础模型的值相同。
All metrics are on the English-to-German translation development set, newstest2013.
所有指标都是基于英语到德语的翻译开发集 newstest2013。
Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
根据我们的字节对编码，列出的困惑是 per-wordpiece，不应该与 per-word 困惑进行比较。

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2.
在表 3 行(A) 中，我们在保持计算量不变的情况下，改变注意头的数量以及注意键和值维度，如 3.2.2 节所述。
While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
虽然单头注意比最佳设置差 0.9 BLEU，但过多的头也会降低质量。

In Table 3 rows (B), we observe that reducing the attention key size $d_k$ hurts model quality.
在表 3 行(B) 中，我们观察到减小注意键大小 $d_k$ 会损害模型质量。
This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial.
这表明确定兼容性并不容易，一个比点积更复杂的兼容性函数可能是有益的。
We further observe in rows © and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting.
我们在行( C) 和行(D) 中进一步观察到，正如预期的那样，更大的模型更好，并且 dropout 对于避免过拟合非常有帮助。
In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
在行(E) 中，我们用习得的位置 embeddings[9] 替换正弦位置编码，并观察到与基础模型几乎相同的结果。

6.3 English Constituency Parsing 英语选区解析

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing.
为了评估 Transformer 是否可以泛化到其它任务，我们对英语选区解析进行了实验。
This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input.
这项任务提出了具体的挑战：输出受到强烈的结构限制，并且比输入长得多。
Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].
此外，RNN 序列到序列模型还不能在小数据体系中获得最先进的结果 [37]。

We trained a 4-layer transformer with $d_\text{model}$ =1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences.
我们用 $d_\text{model}$ =1024 在 Penn Treebank [25] 的 Wall Street Journal (WSJ) 部分训练了一个 4 层 transformer ，大约训练了 40K 个句子。
We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37].
我们还在半监督设置中训练它，使用更大的高置信度和 BerkleyParser 语料库，大约有 1700 万个句子。
We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
我们仅在 WSJ 设置中使用了 16K tokens 的词汇表，在半监督设置中使用了 32K tokens 的词汇表。

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model.
我们只进行了少量的实验来选择 Section 22 开发集上的 dropout、注意和残差（第 5.4 节）、学习率和束大小，所有其他参数从英语转德语的基础翻译模型保持不变。
During inference, we increased the maximum output length to input length + 300.
在推理过程中，我们将最大输出长度增加到输入长度 + 300。
We used a beam size of 21 and $\alpha$ = 0.3 for both WSJ only and the semi-supervised setting.
对于仅 WSJ 和半监督设置，我们使用了 21 的光束大小和 $\alpha$ = 0.3。

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
表 4 中的结果显示，尽管缺乏针对特定任务的调优，我们的模型的性能出奇地好，除了 Recurrent Neural Network Grammar [8] 之外，它的结果比之前报道的所有模型都要好。

在这里插入图片描述

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.
与 RNN 序列到序列模型[37]相比，Transformer 即使只在包含 40K 个句子的 WSJ 训练集上训练，其性能也优于 BerkeleyParser[29]。

7 结论

↓ 【一句话介绍自己的工作：亮点 + 主要 idea】

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
在这项工作中，我们提出了 Transformer，第一个完全基于注意的序列转导模型，用多头自注意取代了编码器-解码器架构中最常用的循环层。

↓ 【优势 (训练更快) + 测试基准的结果 (实现 SOTA)】

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
对于翻译任务，Transformer 的训练速度明显快于基于循环层或卷积层的架构。
On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art.
在 WMT 2014 的英语到德语和 WMT 2014 的英语到法语翻译任务上，我们都达到了一个新的最先进水平。
In the former task our best model outperforms even all previously reported ensembles.
在前一个任务中，我们的最佳模型甚至优于所有先前报道的模型的集成。

↓ 【后续的研究计划】

We are excited about the future of attention-based models and plan to apply them to other tasks.
我们对基于注意的模型的未来感到兴奋，并计划将其应用于其它任务。
We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.
我们计划将 Transformer 扩展到涉及文本以外的输入和输出模态的问题，并研究局部的、受限的注意机制，以高效地处理大量的输入和输出，如图像、音频和视频。〔可参考 diffusion transformer (DiT) 〕
Making generation less sequential is another research goals of ours.
更少顺序的生成 是我们的另一个研究目标。〔更随机的生成？生成需要更少的顺序计算？〕

The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.
我们用来训练和评估模型的代码可以在 https://github.com/tensorflow/tensor2tensor 上找到。

致谢

Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.
我们感谢 Nal Kalchbrenner 和 Stephan Gouws 富有成效的评论、更正和启发。

参考文献

注意可视化

在这里插入图片描述

【论文_序列转换模型架构_20230802v7】Attention Is All You Need 【Transformer】

https://arxiv.org/abs/1706.03762 20170612 v1 代码实现_notebook ∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. 提出用 self-attention 替代 RNNs，并开始…...

编程日记 2025/8/24 4:04:42

清晰易懂的跨域请求知识——拿捏

1. 什么是跨域请求？ 简单来说：当你的前端网页（例如 http://frontend.com）通过 JavaScript 调用后端接口（例如 http://backend.com/api）时，如果两者的域名、端口、协议中任意一项不同&#…...

编程日记 2025/8/19 23:09:46

前端漏洞不扫描理由

漏洞类型豁免理由基于DOM的XSS1.已实施安全加固： 使用encodeURIComponent对URL参数进行编码对特殊字符(<>“”&)进行HTML实体转义使用template literal替代字符串拼接移除了直接操作DOM的不安全写法,二次扫描仍然扫描出来，且修改建议模糊 2…...

编程日记 2025/8/17 12:38:01

论文阅读的三个步骤

论文阅读的三个步骤方法说明链接：https://www.academia.edu/4907403/How_to_Read_a_Paper 方法框架如下...

编程日记 2025/8/20 14:31:11

Javascript 中的继承？如何实现继承？

一、继承的本质继承：子对象可以自动拥有父对象的属性和方法，就像孩子继承父母的基因。JavaScript 的继承：通过原型链实现（原型和原型链是底层核心）。二、4 种常见继承方式 1. 原型链继承（传家宝模式&am…...

编程日记 2025/8/23 15:25:27

深入理解 Linux 权限管理：从基础到进阶

在 Linux 系统中，权限管理是保障系统安全与资源合理分配的核心机制。无论是服务器管理员，还是日常使用 Linux 的开发者，深入掌握权限管理，不仅能避免因权限设置不当导致的数据泄露或系统故障，还能灵活高效地管理各类资…...

编程日记 2025/8/22 11:50:14

第1阶段-前5天-考试题及答案

文章目录 1.1 用户 root 的家目录是哪里?1.2 如何查询 linux 系统 ip 地址?1.3 检查是否可以访问 baidu.com 的命令?1.4 [rootoldboy-c7 /etc/sysconfig/ ]# 说说每一部分含义?1.5 说说 Linux 常见快捷键?(至少 3 个)1.6 Windows 分为 C 盘,D 盘,但是 Linux 一切从根或/ …...

编程日记 2025/8/24 2:32:45

农村供水智能化远程监控解决方案

农村供水智能化远程监控解决方案 ——基于巨控GRM242Q-4D4I4Q(HE)模块的快速部署方案一、项目需求与痛点某西南山区农村供水项目需管理12个分散站点，每个站点包含： 4-20mA模拟量：压力传感器、流量计485通信设备：智能水表&…...

编程日记 2025/8/20 7:38:35

4月29日星期二今日早报简报微语报早读

4月29日星期二，农历四月初二，早报#微语早读。 1、特朗普声称中方领导人打了电话，外交部：近期中美元首没有通话； 2、跳水世界杯总决赛名单出炉，“梦之队”全主力出战； 3、深圳：对年…...

编程日记 2025/8/21 9:54:43

C++日更八股--first

### 内存static和dynamic的区别 static（静态） 和 dynamic（动态）<br> static:内存分配在编译的时候确定，大小和生命周期固定，无需运行时分配开销<br> dynamic:内存分配在运行时动态申请…...

编程日记 2025/8/21 5:27:30

Git操作指令

1.基础操作指令: (1).查看修改的状态(git status): 查看修改的状态(暂存区、工作区) (2).添加工作区到暂存区(git add 单个文件名 | 通配符): 添加工作区一个或多个文件的修改到暂存区 (3).提交暂存区到本地仓库(git commit -m "注释内容"): 提交暂存区内容到本…...

编程日记 2025/8/18 13:56:31

Linux——安装NVM

1. 安装命令官方地址：https://github.com/nvm-sh/nvm curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash2. 安装完成后执行命令 source ~/.bashrc3. 验证 nvm -v...

编程日记 2025/8/21 8:29:31

C++之AVL树

前言一、AVL的概念二、AVL树的实现 2.1 AVL的结点结构 2.2 AVL树结点的插入平衡因子的更新更新原则： 更新停止的条件： 插入结点以及更新平衡因子的代码实现旋转右单旋左单旋左右双旋右左双旋 2.3 AVL树的查找 2.4 AVL树的平衡性检测总结前言 …...

编程日记 2025/8/21 2:31:01

解决STM32H743单片机USB_HOST+FATF操作usb文件

前缀花了两天的时间整理了一下在使用STM32H743单片机开发usb相关功能时遇到的问题及解决方案，具体为以下2种情况： 1.USB插上单片机后，单片机卡死，导致长时间没有喂狗程序重启； 2.USB正常插拔后，使用FAT…...

编程日记 2025/8/24 1:37:12

数据结构|并查集

Hello ！朋友们，这是我在学习过程中梳理的笔记，以作以后复习回顾，有时略有潦草，一些话是我用自己的话描述的，可能不够准确，还是感谢大家的阅读！ 目录一、并查集Quickfind 二、两种算…...

编程日记 2025/8/23 20:05:56

从拒绝采样到强化学习，大语言模型推理极简新路径！

大语言模型（LLMs）的推理能力是当下研究热点，强化学习在其复杂推理任务微调中广泛应用。这篇论文深入剖析了相关算法，发现简单的拒绝采样基线方法表现惊人，还提出了新算法。快来一探究竟，看看这些发现如何颠…...

编程日记 2025/8/23 23:13:24

数据中心电能质量问题解决方案及经典案例

行业背景与挑战数据中心作为互联网的核心枢纽，承载着海量数据存储、计算及通信任务，其内部精密设备（如恒温恒湿空调、高精度开关电源等）对电能质量极为敏感。微小的电压波动或频率偏差可能导致设备损坏，而瞬态过电压…...

编程日记 2025/8/21 15:21:14

【软考-高级】【信息系统项目管理师】【论文基础】沟通管理过程输入输出及工具技术的使用方法

沟通管理概念沟通是人们分享信息、思想和情感的过程，沟通的主旨在于互动双方建立彼此相互了解的关系，相互回应，并期待能经由沟通的过程相互接纳并达成共识。沟通失败是很多IT项目失败的重要原因。与IT项目成功有关的最重要的四个因素是…...

编程日记 2025/8/23 2:32:51

优化PCB Via Stub系列（1）：一次学会利用层叠设计降低Via Stub损耗

开路谐振对SI而言真不是个好东西，这种1/4波长谐振会带来讯号的驻波，进而降低整体通道带宽，导致SI不佳！ 在高速PCB设计中，最常发生的1/4波长谐振就属过孔的Via stub，这个小小的金属残段可以酿成大大的SI问题…...

编程日记 2025/8/20 10:32:25

STP端口状态变迁及故障拓扑变化

STP端口状态变迁及故障拓扑变化一、STP 端口状态变迁（以标准 STP 为例，共 5 种状态） 状态功能描述能否收发数据帧能否收发 BPDU持续时间进入条件Disabled端口物理关闭或被管理员手动关闭，不参与 STP 运算。否否-端口物理 down …...

编程日记 2025/8/22 17:54:05

9.idea中创建springboot项目_jdk1.8

9. idea中创建springboot项目_jdk1.8 步骤 1：打开 IntelliJ IDEA 并创建新项目启动 IntelliJ IDEA。在欢迎界面，点击 New Project（或通过菜单栏 File > New > Project）。步骤 2：选择 Maven 项目类型在左侧…...

编程日记 2025/8/22 4:40:06

mysql 事务中如果有sql语句出错，会导致自动回滚吗？

CREATE TABLE name ( id int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT ID, name varchar(32) DEFAULT COMMENT 名称, PRIMARY KEY (id) ) ENGINEInnoDB DEFAULT CHARSETutf8mb4; 情况1.执行下列操作， 会发现新开窗口去查询name表时，整个事务都…...

编程日记 2025/8/23 6:49:02

考OCM证书前需要有OCP证书

报考OCM认证必须持有有效OCP证书。从知识体系的构建来看，OCP 和 OCM 认证构成了一个循序渐进的学习和考核体系。OCP 认证侧重于考察数据库管理员和开发人员对 Oracle 数据库的基础架构、日常管理、性能优化、备份恢复等核心技能的掌握。通过 OCP 考试，意…...

编程日记 2025/8/21 17:32:11

动态图表 -- eg1

问题： 前端vue，后端springboot，实现动态表格样式，（表格List<Student>，Student类有年级，班级，文理科分类，姓名，学号，等属性。先根据年级分类…...

编程日记 2025/8/22 3:09:21

echo 1 ＞ /proc/sys/kernel/nmi_watchdog报错

报错内容 /proc/sys/kernel/nmi_watchdog报错，内容如下： [root@localhost log]# echo 1 > /proc/sys/kernel/nmi_watchdog -bash: echo: write error: Unknown error 524 [root@localhost log]#报错原因内核未配置 NMI 支持某些自定义内核可能未编译 NMI Watchdog 驱…...

编程日记 2025/8/21 21:40:34

upload-labs PASS 1-5通关

PASS-01 前端javascript检查 1，第一个提示javascript对上传的文件进行审查 2，javascript工作在前端页面，可以直接删除具有审查功能的代码 3，删除之后再上传一句话木马上传成功，可以使用蚁剑进行连接，控制网…...

编程日记 2025/8/23 0:08:18

大数据测试集群环境部署

Hadoop大数据集群搭建（超详细）_hadoop_小飞飞519-GitCode 开源社区 hadoop集群一之虚拟机安装(mac)_hadoop_皮皮虾不皮呀-华为开发者空间 hadoop集群二之hadoop安装_hadoop_皮皮虾不皮呀-华为开发者空间虚拟机如何查看gateway | PingCode智库...

编程日记 2025/8/17 14:01:40

BUUCTF——Online Tool

BUUCTF——Online Tool 进入靶场 <?phpif (isset($_SERVER[HTTP_X_FORWARDED_FOR])) {$_SERVER[REMOTE_ADDR] $_SERVER[HTTP_X_FORWARDED_FOR]; }if(!isset($_GET[host])) {highlight_file(__FILE__); } else {$host $_GET[host];$host escapeshellarg($host);$host e…...

编程日记 2025/8/20 21:24:24

人工智能数学基础（三）：微积分初步

微积分作为数学的重要分支，为人工智能的发展提供了坚实的理论基础。从理解数据的变化趋势到优化模型参数，微积分的应用贯穿其中。本文将深入探讨微积分的核心概念，并结合 Python 编程实例，助力大家轻松掌握这些关键知识点。资源绑…...

编程日记 2025/8/21 14:53:49

【11408学习记录】考研英语语法核心：倒装句考点全解+真题演练

倒装句英语语法总结——特殊句式倒装全部倒装介词短语形容词副词There be 部分倒装否定副词或词组位于句首only位于句首虚拟条件句省略if 每日一句词汇第一步：找谓语第二步：断句第三步：简化主句定语从句英语语法总结——特殊句式倒装 …...

编程日记 2025/8/17 13:04:13

云数据中心整体规划方案PPT(113页)

1. 引言概述：云数据中心整体规划方案旨在构建弹性、高效的云计算基础设施，通过软件定义数据中心（SDDC）实现资源虚拟化与管理自动化。 2. 技术趋势与背景技术革新：随着云计算、虚拟化及自动化技术的发展&#xff0c…...

编程日记 2025/8/21 11:08:05

java练习4

创建类对象，要求写一个人的类，内容包括： 值：年龄，姓名，家庭身份函数：年龄，姓名修改，家庭身份修改，生孩子 package a01_第一次练习.a04_创建类对象;public cl…...

编程日记 2025/8/17 11:25:34

在VMware上创建Ubuntu虚拟机，与Xshell和Xftp的连接和使用

一、在VMware创建Ubuntu虚拟机 1、创建新的虚拟机 2、新建虚拟机安装导向 1）自定义安装 2）稍后安装操作系统 3）选择Linux和Ubuntu64 4）可自定义虚拟机名称和虚拟机位置 5）选择合适的处理器数量 6）虚拟机…...

编程日记 2025/8/19 2:00:53

Java常用注解通俗解释

注解就像是给Java代码贴的"便利贴"，它们不会改变代码本身的逻辑，但能给编译器、开发工具或运行时环境提供额外信息。下面我用最通俗的方式解释Java中最常用的注解： 一、基础篇：人人必知的注解 1. Override - "我…...

编程日记 2025/8/20 14:23:03

前端性能优化2:结合HTTPS与最佳实践，全面优化你的网站性能

点亮极速体验：结合HTTPS与最佳实践，为你详解网站性能优化的道与术在如今这个信息爆炸、用户耐心极其有限的数字时代，网站的性能早已不是一个可选项，而是关乎生存和发展的核心竞争力。一个迟缓的网站，无异于在数字世界…...

编程日记 2025/8/22 5:57:04

小刚说C语言刷题——1032分糖果

1.题目描述某幼儿园里，有 5 个小朋友编号为 1，2，3，4，5，他们按自己的编号顺序围坐在一张圆桌旁。他们身上都有若干个糖果，现在他们做一个分糖果游戏。从 1 号小朋友开始，将他的糖…...

编程日记 2025/8/19 16:27:57

socket套接字-UDP（下）

socket套接字-UDP（中）https://blog.csdn.net/Small_entreprene/article/details/147567115?fromshareblogdetail&sharetypeblogdetail&sharerId147567115&sharereferPC&sharesourceSmall_entreprene&sharefromfrom_link在我之前搭建…...

编程日记 2025/8/23 7:28:05

使用Docker操作MySQL

在Docker中操作MySQL可以简化数据库的部署和管理过程。以下是详细的步骤，包括如何拉取MySQL镜像、创建容器以及配置远程访问权限。拉取MySQL镜像首先，使用以下命令从Docker Hub拉取MySQL镜像： docker pull mysql你也可以指定版本&#x…...

编程日记 2025/8/16 12:56:17

OpenGL ES 3.0 第二章总结：你好，三角形（Hello Triangle）

—— 从“画出第一个三角形”理解现代图形渲染流程 🔰 写在前面 OpenGL 是一个状态机型的图形 API。第二章《你好，三角形》是整个图形开发的起点，它帮助我们掌握从「准备绘制数据」到「渲染出第一个像素」的完整流程。这一章最核心的任务是…...

编程日记 2025/8/19 2:43:27

neo4j vs python

1.将库中已经存在的两个节点，创建关系。查询库中只有2个独立的节点。方式一，python，使用py2neo库 #coding:utf-8 from py2neo import Graph,Node,Relationship,NodeMatcher##连接neo4j数据库，输入地址、用户名、密码 graph G…...

编程日记 2025/8/22 4:16:52

MIT6.S081-lab7前置

MIT6.S081-lab7前置这部分包含了设备中断和锁的内容设备中断之前系统调用的时候提过 usertrap ，而我们的设备中断，比如计时器中断也会在这里执行，我们可以看看具体的逻辑： void usertrap(void) {int which_dev 0;if((r_sst…...

编程日记 2025/8/18 18:04:13

通过漂移-扩散仿真研究钙钛矿-硅叠层太阳能电池中的电流匹配和滞后行为

引言卤化物钙钛矿作为光活性半导体的出现，为光伏技术的发展开辟了令人振奋的新方向。[1] 除了在单结太阳能电池中的优异表现，目前研究的重点在于将钙钛矿吸收层整合到叠层器件中。在硅-钙钛矿叠层太阳能电池中，将高效的钙钛矿吸收层与成熟的…...

编程日记 2025/8/19 20:02:29

IIC小记

SCL 时钟同步线，由主机发出。当SCL为高电平（逻辑1）时是工作状态，低电平（逻辑0）时是休息状态。SCL可以控制通信的速度。 SDA 数据收发线应答位：前八个工作区间是一个字节，在SCL…...

编程日记 2025/8/23 9:46:22

使用 ECharts 在 Vue3 中柱状图的完整配置解析

一、初始化图表实例 const chart echarts.init(chartRef.value);二、Tooltip 提示配置 tooltip: {trigger: axis,axisPointer: {type: line // 支持 line 或 shadow 类型，指示器样式},backgroundColor: rgba(0,0,0,0.7),textStyle: { color: #fff },formatter: {…...

编程日记 2025/8/21 18:45:38

Ubuntu实现远程文件传输

目录安装 FileZillaUbuntu 配套设置实现文件传输在Ubuntu系统中，实现远程文件传输的方法有多种，常见的包括使用SSH（Secure Shell）的SCP（Secure Copy Protocol）命令、SFTP（SSH File Transfer P…...

编程日记 2025/8/20 15:59:41

AI驱动软件工程：SoftEngine 方法论与 Lynx 平台实践分析

引言在过去数十年中，软件开发领域历经了从瀑布模型到敏捷开发，再到DevOps的深刻变革。然而，面对当今快速变化的市场需求和复杂的软件系统，这些方法仍然显露出明显的局限性。近年来，基于大语言模型（LLM&am…...

编程日记 2025/8/21 16:35:53

Vue基础(一) 基础用法

1.取消生产提示 Vue.config.productionTip false; Vue.config.devtools true; //运行开发调试 2.hello小案例需要注意如下几点： 1.必须要有一个模板，其实就是一个html组件 2.新建一个Vue实例，并且通过el与容器建立绑定关系&#xff0…...

编程日记 2025/8/17 11:25:35

文心一言开发指南08——千帆大模型平台推理服务API

编程日记 2025/8/23 18:29:29

矩阵区域和 --- 前缀和

目录一：题目二：算法原理三：代码一：题目题目链接：1314. 矩阵区域和 - 力扣（LeetCode） 二：算法原理三：代码 class Solution { public:vector<vector<int…...

编程日记 2025/8/19 6:16:33

全局id生成器生产方案

1.只要求不重复版本（常用于分布式确定一个实体的id） uuid（ MAC 地址、时间戳、名字空间（Namespace）、随机或伪随机数、时序等元素，计算机基于这些规则生成的 UUID 是肯定不会重复的。） UUID 作…...

编程日记 2025/8/23 17:24:09

文章目录

摘要

1 引言

2 背景

3 模型架构

3.1 编码器 和 解码器 堆叠

3.2 Attention

3.2.1 Scaled Dot-Product Attention

3.2.2 Multi-Head Attention

3.2.3 Attention 在我们的模型中的应用

3.3 Position-wise 前馈网络

3.4 Embeddings 和 Softmax

3.5 Positional Encoding 位置编码

4 Why Self-Attention

5 训练

5.1 训练数据 和 Batching

5.2 硬件和时间表

5.3 优化器

5.4 正则化

6 结果

6.1 机器翻译

6.2 模型变体

6.3 English Constituency Parsing 英语选区解析

7 结论

致谢

参考文献

注意可视化

相关文章：

3.1 编码器和解码器堆叠

5.1 训练数据和 Batching