Huggingface · Transformer 工程实战

HuggingFace 是什么

HuggingFace 不只是一个模型库。它是当前 NLP/AI 工程领域事实上的基础设施供应商，由以下几个核心部分组成：

Hub：模型、数据集、Space（演示应用）的托管平台。截至 2024 年，Hub 上有超过 70 万个公开模型。搜索”bert”能找到数千个变种，搜索”llama”能找到几十个量化版本。

transformers：核心 Python 库，封装了几乎所有主流预训练模型的加载、推理、微调接口。这是使用最频繁的部分。

datasets：数据集管理库，提供统一的加载接口和高效的数据处理（基于 Arrow 格式）。第 8 章微调时会用到。

tokenizers：高性能分词库，底层用 Rust 实现，比纯 Python 快 10-100 倍。transformers 库内部使用它。

accelerate：分布式训练和混合精度训练的抽象层，屏蔽了 DataParallel、DistributedDataParallel、DeepSpeed 等的差异。第 8 章微调时会用到。

PEFT：参数高效微调库，封装了 LoRA、Prefix Tuning、Prompt Tuning 等方法。

工程上最常用的是 transformers + Hub。本章以这两者为主，其他库在后续章节按需引入。

Pipeline：最高级的抽象

pipeline 是 transformers 提供的最高层抽象，一行代码完成任务。


from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
result = classifier("This movie is fantastic!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

pipeline 背后做了什么

调用 pipeline("sentiment-analysis") 时，库做了以下几件事：

根据任务名查找默认模型（sentiment-analysis 默认用 distilbert-base-uncased-finetuned-sst-2-english）
从 Hub 下载模型权重和 tokenizer 配置，缓存到本地
初始化 tokenizer 对象
初始化 model 对象，加载权重
返回一个封装好的 Pipeline 对象，调用时自动完成 tokenize → 模型前向 → 后处理

也可以手动指定模型：


classifier = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

pipeline 的适用场景

快速验证：想知道某个模型对你的数据表现如何，pipeline 最省事
原型开发：建 demo、写 POC 时不关心底层细节
批量推理：pipeline 支持传入列表，内部会做批处理

pipeline 的局限性

不能控制 tokenization 的细节（截断策略、padding 方式）
不能访问中间层输出（比如需要 embedding 向量时）
自定义后处理逻辑不方便
性能调优空间有限

遇到上面这些情况，需要绕开 pipeline，直接操作 Model + Tokenizer。

Model + Tokenizer：精细控制

当 pipeline 的封装不够用时，直接使用 Model 和 Tokenizer 对象。


from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
 
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
 
text = "This is a great product!"
# tokenize：把文本转成 token id
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 
# 模型前向传播
with torch.no_grad():
    outputs = model(**inputs)
 
# outputs.logits 形状：[batch_size, num_labels]
logits = outputs.logits
probabilities = torch.softmax(logits, dim=-1)
predicted_class = torch.argmax(probabilities, dim=-1).item()

什么情况下需要绕开 pipeline

需要 embedding 向量：用 BertModel（不带分类头），取 last_hidden_state 或 [CLS] 向量
批量处理大数据集：需要控制 batch_size、padding 策略、DataLoader
多任务/多输入：比如句子对输入（tokenizer(text_a, text_b, ...)）
访问注意力权重：model(..., output_attentions=True) 返回每层的注意力矩阵
自定义解码策略：文本生成时需要调整 temperature、top_p、repetition_penalty 等

tokenizer 返回内容


inputs = tokenizer("Hello world", return_tensors="pt")
# {
#   'input_ids': tensor([[101, 7592, 2088, 102]]),       # token id
#   'token_type_ids': tensor([[0, 0, 0, 0]]),            # 句子段落 id（BERT 用）
#   'attention_mask': tensor([[1, 1, 1, 1]])             # 1 表示真实 token，0 表示 padding
# }

attention_mask 在批量处理时尤为重要——不同长度的序列 padding 到同一长度后，padding 位置的 mask 为 0，告诉模型忽略这些位置。

AutoClass 设计

HuggingFace 的 AutoClass 是一套工厂模式，根据模型配置文件自动选择正确的类。

问题背景：Hub 上有数十万个模型，每种架构（BERT、RoBERTa、GPT-2、T5……）都有自己的 Python 类（BertModel、RobertaModel 等）。如果要写通用代码，不可能 if-else 穷举所有架构。

解决方案：AutoClass 读取模型目录下的 config.json，从 "model_type" 字段知道是哪种架构，然后映射到对应的类：


# config.json 里有 "model_type": "bert"
# AutoTokenizer 自动选择 BertTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 
# AutoModel 自动选择 BertModel
model = AutoModel.from_pretrained("bert-base-uncased")

常用 AutoClass 列表

AutoClass	说明	对应任务
`AutoTokenizer`	自动选择 tokenizer	所有任务
`AutoModel`	基础模型，无任务头	获取 embedding
`AutoModelForSequenceClassification`	带分类头	文本分类、情感分析
`AutoModelForTokenClassification`	每 token 分类头	NER、词性标注
`AutoModelForQuestionAnswering`	起止位置预测头	抽取式问答
`AutoModelForCausalLM`	因果语言模型头	文本生成（Decoder-Only）
`AutoModelForSeq2SeqLM`	Seq2Seq 语言模型头	翻译、摘要（Encoder-Decoder）

为什么推荐用 AutoClass 而不是具体类

写 AutoTokenizer.from_pretrained(model_name) 而不是 BertTokenizer.from_pretrained(model_name)，代码就和具体架构解耦了。换一个模型（比如从 BERT 换到 RoBERTa），只需改 model_name 字符串，其他代码不用动。

HuggingFace Hub

搜索模型

Hub 的网站（huggingface.co）提供过滤器：

Task：选任务类型（Text Classification、Token Classification、Question Answering 等）
Language：选语言（zh、en、multilingual 等）
Library：transformers、sentence-transformers 等
Downloads（排序）：按下载量排序，通常是最实用的指标

也可以用 huggingface_hub 库在代码中搜索：


from huggingface_hub import list_models
 
# 搜索中文文本分类模型，按下载量排序，取前 5 个
models = list(list_models(
    task="text-classification",
    language="zh",
    sort="downloads",
    limit=5,
))
for m in models:
    print(m.id, m.downloads)

读模型卡片

模型卡片（Model Card）是判断模型是否适合自己任务的关键文档。重点看：

Intended uses & limitations：模型设计用途和已知局限
Training data：在什么数据上训练的，和你的目标领域是否匹配
Evaluation results：在哪些 benchmark 上跑了什么分数
How to use：官方推荐的使用方式（任务前缀、特殊 token 等）

一个常见陷阱：下载量高不等于适合你的任务。bert-base-uncased 下载量极高，但它是未经任务微调的基础模型，直接用于情感分类效果差——需要进一步微调，或换 distilbert-base-uncased-finetuned-sst-2-english（已在 SST-2 数据集上微调过的版本）。

本地缓存机制

from_pretrained 下载的文件缓存在本地，默认路径：


~/.cache/huggingface/hub/

目录结构：


~/.cache/huggingface/hub/
  models--bert-base-uncased/
    snapshots/
      <commit-hash>/       # 模型文件（config.json、pytorch_model.bin 等）
    refs/
      main                 # 指向最新 commit hash 的指针

修改缓存路径：设置环境变量 HF_HOME 或 TRANSFORMERS_CACHE。


export HF_HOME=/data/hf_cache

离线使用：设置 TRANSFORMERS_OFFLINE=1，强制只从本地缓存加载，不发送网络请求。适合生产环境或无网络的服务器。


import os
os.environ["TRANSFORMERS_OFFLINE"] = "1"
 
# 此时 from_pretrained 只从本地缓存加载，找不到则报错
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

常见任务速查

以下给出六种常见任务的 pipeline 代码片段。每种任务选用了已经过微调的模型，可以直接使用。

文本分类（情感分析）


from transformers import pipeline
 
# 情感分析：判断文本是正面还是负面
# 典型应用：用户评价分析、舆情监控
classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)
result = classifier("The service was incredibly disappointing.")
# [{'label': 'NEGATIVE', 'score': 0.9994}]

命名实体识别（NER）


from transformers import pipeline
 
# NER：识别文本中的人名、地名、组织名等实体
# 典型应用：信息抽取、知识图谱构建
ner = pipeline(
    "ner",
    model="dbmdz/bert-large-cased-finetuned-conll03-english",
    aggregation_strategy="simple"  # 合并同一实体的多个 token
)
result = ner("Hugging Face is a company based in New York City.")
# [
#   {'entity_group': 'ORG', 'word': 'Hugging Face', 'score': 0.99},
#   {'entity_group': 'LOC', 'word': 'New York City', 'score': 0.99}
# ]

问答（抽取式）


from transformers import pipeline
 
# 抽取式问答：从给定段落中找到问题的答案
# 典型应用：文档问答、客服知识库检索
qa = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2"
)
result = qa(
    question="Where is Hugging Face based?",
    context="Hugging Face is an AI company headquartered in New York City, with offices in Paris."
)
# {'answer': 'New York City', 'score': 0.98, 'start': 45, 'end': 58}

文本摘要


from transformers import pipeline
 
# 文本摘要：将长文本压缩为简短摘要
# 典型应用：新闻摘要、会议纪要、文档压缩
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn"
)
article = """
The Transformer architecture, introduced in the paper "Attention Is All You Need" (2017),
revolutionized natural language processing. Unlike previous recurrent models, Transformers
use self-attention mechanisms to process all tokens in parallel, making training significantly
faster. This architecture became the foundation for BERT, GPT, and virtually all modern
large language models.
"""
result = summarizer(article, max_length=60, min_length=20, do_sample=False)
# [{'summary_text': 'The Transformer architecture ...'}]

文本生成


from transformers import pipeline
 
# 文本生成：给定 prompt，续写文本
# 典型应用：内容创作辅助、代码补全、对话
generator = pipeline(
    "text-generation",
    model="gpt2"
)
result = generator(
    "The future of artificial intelligence",
    max_new_tokens=50,
    num_return_sequences=1,
    do_sample=True,   # 采样解码，结果有随机性
    temperature=0.8,  # 控制随机性：越低越保守，越高越多样
)
# [{'generated_text': 'The future of artificial intelligence ...'}]

Embedding 生成


from transformers import pipeline
 
# 特征提取：获取文本的向量表示
# 典型应用：语义搜索、RAG、文本聚类、相似度计算
feature_extractor = pipeline(
    "feature-extraction",
    model="sentence-transformers/all-MiniLM-L6-v2",
    tokenize_kwargs={"truncation": True}
)
result = feature_extractor("Hello, how are you?")
# result 是一个嵌套列表，形状 [1, seq_len, hidden_size]
# 取平均池化得到句子向量（单句无 padding，直接取均值）
import numpy as np
sentence_embedding = np.mean(result[0], axis=0)  # 形状 [hidden_size]
print(f"句子向量维度：{sentence_embedding.shape}")  # (384,)

对于 embedding 的生产使用，推荐直接用 sentence-transformers 库，它封装了池化和归一化逻辑，接口更简洁。第 10 章语义搜索会详细展开。