当前位置：首页 > news >正文

【NLP】23.小结：选择60题

news 来源：原创 2025/8/29 0:09:11

Question 1:
What does the fixed lookup table in traditional NLP represent?
A. A table of one‐hot vectors
B. A table of pre‐trained dense word embeddings
C. A dictionary of word definitions
D. A table of n-gram counts

Answer (中文): 答案选 B。传统NLP中“查表”是指利用预先训练好的、固定维度的词向量表，来表示每个词的语义特征。

Question 2:
Which method uses context words to predict a target word in its training procedure?
A. Skip-Gram
B. GloVe
C. Continuous Bag-of-Words (CBOW)
D. FastText

Answer (中文): 答案选 C。CBOW模型利用上下文中的多个词来预测中心词。

Question 3:
What is the main idea behind the Skip-Gram model?
A. Predicting surrounding context words using the center word
B. Averaging word vectors to form a sentence vector
C. Using subword information only
D. Decomposing words into character n-grams

Answer (中文): 答案选 A。Skip-Gram模型的核心思想是利用当前中心词来预测其左右上下文词语。

Question 4:
Which method combines both global corpus statistics and local context information for word vector training?
A. Word2vec
B. FastText
C. GloVe
D. CBOW

Answer (中文): 答案选 C。GloVe模型通过利用全局共现统计与局部上下文共同生成词向量。

Question 5:
What is the main advantage of using FastText over traditional word2vec?
A. It uses a larger lookup table
B. It represents words as a sum of their character n-grams
C. It employs RNNs for context modeling
D. It uses only one-hot encoding

Answer (中文): 答案选 B。FastText将词表示成多个字符n-gram向量的和，从而更好处理未登录词和形态变化问题。

Question 6:
When the training data distribution differs from the pre-training data, which process is generally applied?
A. Data augmentation
B. Pre-computation of embeddings
C. Fine-tuning
D. Tokenization

Answer (中文): 答案选 C。微调（Fine-tuning）是在任务特定数据上进一步训练预训练模型以适应数据分布的差异。

Question 7:
How does WordNet assist in handling word senses?
A. It provides pre-trained vectors
B. It supplies prior knowledge of semantic relationships
C. It improves model parallelization
D. It replaces the lookup table entirely

Answer (中文): 答案选 B。WordNet作为一个词典数据库，提供同义词、反义词等语义关系信息，有助于词义消歧。

Question 8:
What is a key limitation of traditional word embeddings like word2vec for polysemous words?
A. They are too high-dimensional
B. They use complex recurrent structures
C. They assign a single vector regardless of context
D. They are computed using SVD

Answer (中文): 答案选 C。传统词嵌入为每个词只生成一个固定向量，无法区分多义词在不同语境下的含义。

Question 9:
Which model introduced deep contextualized word representations in 2018?
A. GloVe
B. FastText
C. ELMo
D. Word2vec

Answer (中文): 答案选 C。ELMo在2018年推出，通过双向RNN生成上下文相关的词表示。

Question 10:
What is the purpose of using a bidirectional RNN in Encoder architectures?
A. To reduce computation time
B. To capture context from both past and future
C. To implement teacher forcing
D. To initialize word embeddings

Answer (中文): 答案选 B。双向RNN可以同时捕捉左侧（过去）和右侧（未来）的上下文信息，丰富输入表示。

Question 11:
In a sequence-to-sequence model, what role does the Decoder perform?
A. It compresses the input sequence into a fixed vector
B. It generates a target sequence based on encoder outputs
C. It performs tokenization
D. It computes global word frequencies

Answer (中文): 答案选 B。解码器利用编码器输出的信息生成目标序列，例如翻译或摘要。

Question 12:
What is “teacher forcing” in the context of training decoders?
A. Using the predicted token as the next input
B. Replacing the entire decoder with a lookup table
C. Feeding ground truth tokens as input during training
D. Forcing the model to converge to a fixed learning rate

Answer (中文): 答案选 C。教师强制在训练时将真实的目标词作为下一步的输入，以提高模型收敛速度和稳定性。

Question 13:
Which decoding strategy retains multiple candidate sequences for higher quality output?
A. Greedy decoding
B. Beam Search
C. Token sampling
D. Backpropagation

Answer (中文): 答案选 B。Beam Search通过保留多个候选序列，能避免局部最优，提高生成质量。

Question 14:
What is a major bottleneck in traditional Encoder-Decoder models?
A. The need for excessive training data
B. The fixed-size vector that must carry all information
C. The use of subword tokenization
D. Inaccurate teacher forcing

Answer (中文): 答案选 B。传统模型存在瓶颈问题，即所有信息必须压缩进一个固定维度的向量中，导致信息丢失。

Question 15:
How does attention mechanism help with the encoder-decoder bottleneck?
A. It increases the size of the fixed context vector
B. It allows dynamic focus on different parts of the input
C. It eliminates the need for training data
D. It converts dense vectors to sparse ones

Answer (中文): 答案选 B。注意力机制允许解码器在生成每个输出时动态关注输入的不同部分，有效缓解瓶颈问题。

Question 16:
In the attention calculation process, what is typically computed first between the decoder state and encoder hidden states?
A. The weighted average
B. The feedforward network output
C. The attention scores
D. The subword embeddings

Answer (中文): 答案选 C。首先计算的是注意力分数（attention scores），用以衡量各输入部分的重要性。

Question 17:
Which attention mechanism uses a dot product between the decoder query and encoder keys?
A. Additive attention
B. Dot-Product Attention
C. Multiplicative Attention
D. Reduced-Rank Attention

Answer (中文): 答案选 B。点积注意力机制直接将解码器的查询向量与编码器的键向量做点积运算。

Question 18:
Why is the scaled dot-product attention used instead of the plain dot product in some models?
A. To increase numerical stability for high dimensions
B. To simplify the model architecture
C. To incorporate position encoding
D. To avoid using softmax

Answer (中文): 答案选 A。缩放点积注意力通过除以键向量维数的平方根，减小高维情况下数值过大带来的不稳定性。

Question 19:
What does “multiplicative attention” (also known as bilinear attention) allow?
A. Allowing query and key to be in different vector spaces via a learnable matrix
B. Eliminating non-linearities
C. Combining multiple attention heads automatically
D. Enforcing the same dimension for query and key

Answer (中文): 答案选 A。乘性（双线性）注意力通过引入可学习矩阵，使查询和键可以在不同向量空间中做匹配。

Question 20:
In additive attention, what is the purpose of using a feedforward network after combining the query and key?
A. To improve computational efficiency
B. To add non-linearity and flexibility in score computation
C. To enforce sparsity in representations
D. To perform tokenization

Answer (中文): 答案选 B。加性注意力中使用前馈层可以引入非线性，使得注意力评分更灵活和精准。

Question 21:
What do the keys and values in the attention mechanism usually derive from in a translation task?
A. The decoder hidden states
B. Randomly initialized embeddings
C. The encoder’s output hidden states
D. The subword segmentation results

Answer (中文): 答案选 C。在机器翻译中，Attention中的键和值一般来源于编码器的隐藏状态。

Question 22:
What is self-attention?
A. Attention between different sequences
B. An attention mechanism where the query, key, and value come from the same sequence
C. A method to pre-compute embeddings
D. A way to enforce teacher forcing

Answer (中文): 答案选 B。自注意力是指在同一个序列中进行查询、键和值之间的注意力计算。

Question 23:
Which benefit does self-attention provide compared to recurrent networks?
A. It requires sequential processing
B. It introduces dependency on previous time steps
C. It allows parallel computation over sequence positions
D. It reduces model complexity by avoiding feedforward layers

Answer (中文): 答案选 C。自注意力机制允许并行计算各个位置的表示，显著提高了训练和推理效率。

Question 24:
What is the primary purpose of positional encoding in Transformer models?
A. To represent syntactic dependency
B. To inject sequence order information
C. To generate subword tokens
D. To perform beam search

Answer (中文): 答案选 B。位置编码的主要作用是为Transformer提供位置信息，弥补自注意力中无序列顺序信息的问题。

Question 25:
Which of the following is a common method to compute fixed positional encodings?
A. One-hot encoding
B. Learnable embedding vectors
C. Sine and cosine functions
D. Word2vec representations

Answer (中文): 答案选 C。常用的固定位置编码方法是通过正弦和余弦函数计算得到的。

Question 26:
What is one limitation of using learned positional embeddings?
A. 不能捕捉位置信息
B. 训练过程过于简单
C. 难以泛化到训练中未出现的序列长度
D. 只能用于短文本

Answer (中文): 答案选 C。学习的位置编码在处理训练中未见过的更长序列时，泛化能力较弱。

Question 27:
What is the primary advantage of using subword segmentation techniques like BPE?
A. 它能解决高频词带来的下采样问题
B. 它能显著减少模型参数量
C. 它可以处理未登录词和形态变化问题
D. 它提高了模型的并行计算效率

Answer (中文): 答案选 C。BPE等子词分割方法通过将词拆分成更小单元，有效处理拼写变体、未登录词和形态变化问题。

Question 28:
In Byte-Pair Encoding (BPE), what is the basic algorithmic step?
A. Replacing whole words with synonyms
B. Merging the most frequent adjacent symbol pairs iteratively
C. Splitting sentences into individual characters
D. Averaging character embeddings

Answer (中文): 答案选 B。BPE的基本原理是在语料中反复统计并合并频率最高的相邻字符对，直到词汇表达到预定大小。

Question 29:
Which method uses a more complex criterion—such as decreasing perplexity—for merging tokens?
A. Simple BPE
B. WordPiece
C. One-hot encoding
D. CBOW

Answer (中文): 答案选 B。WordPiece在合并token时考虑语言模型困惑度，以便选择最优合并方式。

Question 30:
What distinguishes the Unigram model (SentencePiece) from BPE and WordPiece?
A. 它仅使用字符级别的信息
B. 它从一个大词表中逐步删除不常用的单元
C. 它基于统计共现矩阵计算
D. 它用固定的合并规则，不进行概率计算

Answer (中文): 答案选 B。Unigram模型从一个包含大单元和小单元的词表出发，通过概率模型逐步删除贡献小的单元以达到目标词表大小。

Question 31:
Which evaluation metric compares word n-grams and applies a brevity penalty?
A. chrF
B. BLEU
C. ROUGE
D. METEOR

Answer (中文): 答案选 B。BLEU指标通过计算各级n-gram精确率，并对输出过短的结果加上惩罚项来综合评价翻译质量。

Question 32:
What is the main advantage of the chrF metric?
A. 它对单词形态变化更为宽容
B. 它只关注句子长度
C. 它专门用于对话生成
D. 它忽略了标点符号

Answer (中文): 答案选 A。chrF指标基于字符n-gram匹配，对词形变化、拼写错误等具有较高的鲁棒性。

Question 33:
Which of the following is NOT a typical problem with simple space-based tokenization?
A. 处理缩写时可能出错
B. 难以分辨标点符号
C. 能够有效捕捉语义关系
D. 对罕见词及新词存在局限

Answer (中文): 答案选 C。简单的空格分词无法有效捕捉语义关系，而缩写、标点、罕见词等问题正是其缺点。

Question 34:
What do subword tokenization methods help with?
A. 完全取代词嵌入
B. 捕捉词汇的细粒度信息
C. 降低计算成本到最低
D. 自动生成句法树

Answer (中文): 答案选 B。子词分割方法能捕捉词汇的细粒度特征，有助于处理词形变化和未知词问题。

Question 35:
In the context of neural machine translation, what does the brevity penalty in BLEU aim to correct?
A. 生成过长句子的问题
B. 生成无意义的词汇
C. 生成过短句子导致得分虚高的问题
D. 语法错误

Answer (中文): 答案选 C。Brevity penalty惩罚那些过短的输出，防止由于句子短而使n-gram精确率虚高的问题。

Question 36:
What is the main function of the encoder in an Encoder-Decoder architecture?
A. To generate the output sequence
B. To compress the input sequence into a representation
C. To decode the target language
D. To compute the loss function

Answer (中文): 答案选 B。编码器的主要功能是将输入序列压缩成一个（或一组）表示，供解码器使用。

Question 37:
How does beam search improve over greedy decoding in sequence generation?
A. 提供更高的生成速度
B. 通过保留多个候选序列避免局部最优
C. 完全避免教师强制
D. 强制使用固定词汇表

Answer (中文): 答案选 B。Beam Search通过保留多个候选序列，能有效避免单步贪婪选择导致的局部最优问题。

Question 38:
In teacher forcing, what is used as the input at each decoder step during training?
A. The model’s previous predicted token
B. The ground truth target token
C. A randomly selected token
D. The encoder’s final state

Answer (中文): 答案选 B。教师强制训练时，在每个解码步骤中使用真实目标token作为输入，以引导模型学习。

Question 39:
What issue does the “bottleneck” in an Encoder-Decoder model refer to?
A. 训练数据不足
B. 所有输入信息需要压缩进一个固定维度的向量
C. 解码速度太快
D. 无法进行子词分割

Answer (中文): 答案选 B。瓶颈问题指的是将所有信息压缩到一个固定向量中可能导致信息丢失，尤其对于长序列而言。

Question 40:
Which of the following best describes “Cross-Attention”?
A. Attention where query, key, and value all come from the same sequence
B. Attention between encoder and decoder where query comes from decoder and key/value from encoder
C. A method to generate subword tokens
D. A special case of recurrent neural networks

Answer (中文): 答案选 B。Cross-Attention描述了在编码器—解码器模型中，解码器查询向量与编码器输出（键和值）之间的注意力计算。

Question 41:
What is the function of the feedforward layer after the attention mechanism in Transformer models?
A. 简化注意力计算
B. 引入非线性转换，进一步处理加权表示
C. 将输出转化为one-hot向量
D. 直接进行词汇预测

Answer (中文): 答案选 B。前馈层在注意力层后添加非线性激活，有助于捕捉更复杂的特征和模式。

Question 42:
What is Rotary Positional Embedding designed to do?
A. 固定序列的词汇表大小
B. 通过旋转操作保持词间余弦相似度，同时编码位置信息
C. 替代注意力机制
D. 对抗梯度消失问题

Answer (中文): 答案选 B。Rotary Positional Embeddings利用旋转变换在保持词间相似度的同时注入位置信息。

Question 43:
Which subword segmentation method is known for starting with all characters as vocabulary and then merging pairs iteratively?
A. WordPiece
B. Unigram
C. Byte-Pair Encoding (BPE)
D. Transformer

Answer (中文): 答案选 C。BPE从所有字符开始，通过迭代地合并最频繁的字符对生成子词。

Question 44:
What distinguishes WordPiece’s merging criterion from that of BPE?
A. WordPiece只关注字符频率
B. WordPiece使用语言模型概率来衡量合并收益
C. BPE使用固定的词表
D. WordPiece不进行合并

Answer (中文): 答案选 B。WordPiece在合并时考虑对降低困惑度的影响，而BPE主要基于频率。

Question 45:
Which of the following is an advantage of the Unigram model (SentencePiece) over BPE?
A. 总能得到唯一的分词结果
B. 更灵活，能够同时考虑大单元和小单元
C. 完全基于贪心算法
D. 不需要任何概率计算

Answer (中文): 答案选 B。Unigram模型允许同时考虑大单元和小单元，通过概率模型选出最优分词结果，具有较大的灵活性。

Question 46:
What are the typical components generated from an input word in the self-attention mechanism of Transformers?
A. Only the key and value vectors
B. Query, key, and value vectors
C. Only the query vector
D. Only positional encoding

Answer (中文): 答案选 B。自注意力机制中，每个输入会生成查询、键和值三个向量，用于计算注意力分数。

Question 47:
In the scaled dot-product attention, what does the denominator (√dk) help with?
A. 加快计算速度
B. 缓解高维点积数值过大导致的不稳定性
C. 提高词汇覆盖率
D. 生成更稀疏的向量

Answer (中文): 答案选 B。除以键维度的平方根可以减小高维点积值，防止数值过大带来的不稳定问题。

Question 48:
What does “length normalization” in beam search address?
A. 调整输出序列中单词的长短
B. 防止较短序列因总得分较高而被过度选择
C. 将所有序列标准化为相同长度
D. 降低计算复杂度

Answer (中文): 答案选 B。长度归一化是为了平衡生成短序列可能因对数和较高而不合理的问题。

Question 49:
What aspect of the attention mechanism allows it to be highly parallelizable?
A. 依赖于先前的输出
B. 每个单元之间相互独立，不需要序列化处理
C. 必须逐步生成结果
D. 只能处理固定长度的向量

Answer (中文): 答案选 B。注意力机制中所有位置的计算可以同时进行，不依赖序列前后顺序，从而实现高度并行化。

Question 50:
What is the significance of using a softmax function in attention computation?
A. 用于增加非线性
B. 将未归一化的注意力得分转换为概率分布
C. 降低模型的复杂度
D. 防止overfitting

Answer (中文): 答案选 B。softmax函数将注意力得分转换为概率分布，使得权重能够归一化并进行加权求和。

Question 51:
In a self-attention layer, why is masking sometimes applied during training?
A. 为了防止未来信息泄露给当前生成步骤
B. 为了加快计算速度
C. 避免使用位置编码
D. 强制生成固定长度输出

Answer (中文): 答案选 A。遮罩（masking）确保在自回归生成时，模型只能使用已生成的过去信息，防止提前利用未来信息。

Question 52:
What is one key advantage of replacing RNNs with self-attention based models in sequence processing?
A. 增加了依赖关系
B. 避免了循环依赖，实现更高的并行性
C. 需要更长的序列才能训练
D. 简化了词汇表

Answer (中文): 答案选 B。自注意力机制摆脱了RNN的循环依赖，使得各位置计算可以并行进行，从而大大加速训练和推理。

Question 53:
Which of the following is a common challenge with using RNNs in long-sequence processing?
A. 梯度消失或爆炸
B. 太高的并行性
C. 过度依赖子词分割
D. 固定词向量的问题

Answer (中文): 答案选 A。RNN在长序列处理中常出现梯度消失或爆炸问题，导致远距离依赖捕捉不佳。

Question 54:
What role do feedforward layers play in Transformer architectures following self-attention?
A. 将词向量进行降维
B. 提供非线性变换以增加模型表达能力
C. 直接输出最终预测
D. 替代位置编码

Answer (中文): 答案选 B。前馈层提供了非线性变换，帮助模型捕捉复杂特征，增强表达能力。

Question 55:
Which metric is especially robust to morphological variations in words?
A. BLEU
B. ROUGE
C. chrF
D. METEOR

Answer (中文): 答案选 C。chrF指标基于字符n-gram，更能容忍拼写和词形变化所带来的差异。

Question 56:
What is a typical input to an encoder in sequence-to-sequence models?
A. The translated target sentence
B. The original input sequence
C. Random noise vectors
D. The final output token

Answer (中文): 答案选 B。编码器的输入通常是原始的输入序列（例如一句英文句子）。

Question 57:
In the context of attention, what is the “query”?
A. The tokenized output sequence
B. The vector from the decoder or current position used to query the encoder outputs
C. The positional encoding vector
D. The softmax output

Answer (中文): 答案选 B。查询向量通常来自解码器当前状态，用于与编码器的键进行匹配，决定关注哪些信息。

Question 58:
What problem does the use of subword segmentation (like BPE) directly address in NLP?
A. 模型欠拟合
B. 词汇表爆炸问题
C. 梯度消失
D. 数据并行性不足

Answer (中文): 答案选 B。子词分割方法能减少词汇表的大小，解决由于词形多样性引起的词汇爆炸问题。

Question 59:
How do additive attention mechanisms compute the alignment score?
A. 通过两个向量的点积直接得分
B. 通过前馈神经网络对查询和键进行非线性转换后计算
C. 通过简单相加
D. 通过直接比较单词长度

Answer (中文): 答案选 B。加性注意力利用前馈网络对查询和键组合后计算得分，能引入非线性因素。

Question 60:
Which of the following best summarizes the overall benefit of attention mechanisms in NLP models?
A. 它们消除了使用词嵌入的必要性
B. 使模型能够动态聚焦输入中最相关的信息，提高生成质量
C. 完全替代了所有传统NLP方法
D. 降低了模型的训练数据需求

Answer (中文): 答案选 B。注意力机制使模型在每个输出步骤动态关注最相关的信息，有效提升了生成文本的质量和准确性。

相关文章：