由来

Transformer 由论文《Attention is All You Need》提出，目前成熟的大模型系统都是基于 Transforme 流程的取舍实现.
Encoder-only(Bert、RoBERTa等)、Decoder-only(GPT系列、LLaMA、OPT、Bloom等)、Encoder-Decoder(T5、BART、GLM等)

流程

Encode: 将句子的向量矩阵传入 Encoder 处理后, 得到编码矩阵A.
Decoder: Decoder 接收了 Encoder 的编码矩阵A，然后根据句子的第i位置的单词, 预测第i+1位置的单词, 一直到最后一个.

Encode

input embedding + position encoding 处理输入句子, 得到数字化数组
Multi-head attention 处理数字化数组, 得到具有KQV属性的矩阵. 其中Multi-head attention是由多个self-attention组成的, 最终的具有KQV属性的矩阵是由多个self-attention的具有KQV属性的矩阵上下拼接出来的.
Add&Norm 残差神经网络更新具有KQV属性的矩阵
Feed Forward 处理具有KQV属性的矩阵, 得到编码矩阵A
Add&Norm 残差神经网络更新编码矩阵A

Decoder

第一个 Multi-Head Attention 层采用了 Masked 操作, 得到输入句子的子集(0,i)的编码矩阵B
第二个 Multi-Head Attention 层的KV矩阵使用Encode的输出结果编码矩阵A的KV进行计算，而Q使用编码矩阵B
编码矩阵A和编码矩阵B再次经过Encode的Multi-head attention, Feed Forward得到i+1的预测输出

代码示例

# Import required libraries
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

# Print the predicted word
print(predicted_text)

输出结果为:

 What is the fastest car in the world

大模型 Transformer 简介

由来

流程

Encode

Decoder

代码示例