LLM Inference System Technique Guided Book¶
Overview¶
| 层级 | 名称 | 说明 | 代表技术 / 模块 |
|---|---|---|---|
| L0 | Hardware Layer(硬件层) | 提供算力与带宽的物理基础 | GPU (H100, MI300, Ascend), HBM, NVLink, PCIe, InfiniBand |
| L1 | CUDA Runtime & Kernel Layer(计算运行时层) | 封装硬件资源的编程接口与并行执行模型 | CUDA Core, Tensor Core, CUDA Graph, Stream, Kernel Launch, Memory Hierarchy (global/shared/register) |
| L2 | Distributed Communication Layer(通信层) | 提供多 GPU/多节点协同机制 | NCCL, RCCL, UCX, RDMA, NVLink/NVSwitch, collective ops (AllReduce/AllGather/Broadcast) |
| L3 | Model Execution Layer(模型执行层) | 负责执行 Transformer 等模型计算 | FlashAttention, Fused MLP, Rope, kv-cache, Quantization kernel, Triton / TileLang / CUTLASS |
| L4 | System Algorithms Layer(系统算法层) | 高层推理与优化算法 | Speculative decoding, Paged KV cache, Continuous batching, Prefill/decode pipeline, Dynamic batching |
| L5 | Parallelism Strategy Layer(并行策略层) | 负责大模型在多 GPU 上的划分与同步 | TP(Tensor Parallelism), PP(Pipeline Parallelism), DP(Data Parallelism), MoE Parallel, ZeRO/FSDP, sequence parallel |
| L6 | Serving & Scheduler Layer(服务调度层) | 推理系统编排、资源调度、请求队列管理 | vLLM / TensorRT-LLM / TGI / SGLang, token scheduler, memory pool, request batching, graph capture, speculative serving |
| L7 | Application Layer(应用层) | 面向最终用户或下游系统的接口层 | API, RESTful service, chat interface, RAG integration, caching system |
| ## Details |
L0. Hardware Layer(硬件层)¶
目标:理解底层算力结构、访存瓶颈与通信拓扑,为性能分析和优化奠定基础。
📚 核心知识点¶
-
GPU 计算单元:SM、Warp、Thread、Block、Tensor Core
-
内存层次结构:Register / Shared / L2 / HBM
-
Memory Bandwidth 与 Compute Bound 分析
-
GPU 拓扑:PCIe / NVLink / NVSwitch / InfiniBand
-
NVLink/NVSwitch 拓扑对 AllReduce 性能的影响
-
GPU 性能建模:Roofline、Occupancy、Throughput、Latency
🧩 工具¶
-
nvidia-smi、nvcc --ptxas-options=-v -
Nsight Compute / Nsight Systems / CUPTI / perf
-
DCGM(NVIDIA GPU telemetry)
-
NVIDIA-SMI topo 查询 GPU 互联结构
🔍 推荐阅读¶
-
NVIDIA GPU Architecture Whitepapers (Volta → Hopper)
-
“Roofline Model for GPUs” (Williams et al., 2009)
L1. CUDA Runtime & Kernel Layer(计算运行时层)¶
目标:能编写/优化 GPU kernel,理解 CUDA 编程模型、并行执行与内存管理。
📚 核心知识点¶
-
CUDA 编程模型(thread/block/grid,SIMT)
-
Warp 执行、分支发散、同步机制(
__syncthreads()、barrier) -
CUDA Stream 与 Graph 执行机制(减少 launch overhead)
-
Kernel fusion、kernel launch overhead
-
Memory management:Pinned memory / Unified memory / Async copy / Memory pool
-
Tensor Core MMA 指令、CUDA WMMA、CUTLASS 基础
-
Kernel profiling & warp stall 分析
🧩 工具¶
-
Nsight Compute, Nsight Systems
-
CUDA Graph capture API
-
CUPTI event counters
🔍 源码参考¶
🔍 推荐论文¶
-
FlashAttention: “Fast and Memory-Efficient Exact Attention” (Dao et al., 2022)
-
Kernel fusion: “A study of Deep Learning Operator Fusion” (Google XLA team)
L2. Distributed Communication Layer(通信层)¶
目标:理解多 GPU / 多节点间通信机制,掌握高效 AllReduce / Scatter / Gather。
📚 核心知识点¶
-
NCCL 通信原语:AllReduce / AllGather / ReduceScatter / Broadcast
-
Ring vs Tree 拓扑算法
-
Overlap compute & comm(通信与计算重叠)
-
CUDA-aware communication & RDMA
-
Hierarchical communication(Node-local + Inter-node)
-
NCCL group、stream group、collective fusion
-
InfiniBand, UCX, GPUDirect RDMA
🧩 工具¶
-
nccl-tests/ NCCL Debug env vars (NCCL_DEBUG=INFO) -
Nsight Systems 查看通信重叠
-
ibstat,nvidia-smi topo -m
🔍 源码参考¶
🔍 推荐论文¶
-
Baidu Ring AllReduce (Li et al., 2017)
-
Megatron-LM (Shoeybi et al., 2019)
L3. Model Execution Layer(模型执行层)¶
目标:理解 Transformer 的推理计算图及其算子优化。
📚 核心知识点¶
-
Transformer 结构(Attention、FFN、LayerNorm、Rope)
-
FlashAttention / Fused MLP / QKV projection 优化
-
KV Cache 管理(paged、dynamic、hierarchical)
-
量化与算子融合:INT8 / FP8 / SmoothQuant / AWQ
-
Triton / TileLang / CUTLASS 编写高性能算子
-
Kernel autotuning(meta-scheduler)
🧩 工具¶
-
Triton profiler (
triton.testing) -
PyTorch Profiler / TensorRT Profiler
-
vLLM memory profiler
🔍 源码参考¶
🔍 推荐论文¶
-
FlashAttention-2 (Dao et al., 2023)
-
PagedAttention (vLLM, 2023)
-
SmoothQuant (Xiao et al., 2022)
-
AWQ (Lin et al., 2023)
L4. System Algorithms Layer(系统算法层)¶
目标:理解推理阶段的调度算法与 memory 管理逻辑。
📚 核心知识点¶
-
Continuous batching(动态批处理)
-
Speculative decoding / draft verification
-
Prefill & Decode pipeline
-
KV cache eviction / PagePool / Cache table 管理
-
CUDA graph 复用(graph capture)
-
Dynamic shape inference
-
Context parallel / attention mask 动态构建
-
Memory manager / Arena allocator
🧩 工具¶
-
Nsight Systems timeline 分析
-
vLLM profiling hooks (
--enable-memory-profile) -
PyTorch CUDA Graph API
🔍 源码参考¶
🔍 推荐论文¶
-
vLLM (Kwon et al., 2023)
-
SpecInfer (2023)
-
SGLang (2024) – dynamic serving with speculative decoding
L5. Parallelism Strategy Layer(并行策略层)¶
目标:掌握大模型推理的并行策略、通信模式和张量划分方案。
📚 核心知识点¶
-
Tensor Parallelism (intra-layer)
-
Pipeline Parallelism (inter-layer)
-
Data Parallelism (batch-level)
-
Expert / Mixture-of-Experts Parallelism
-
Sequence Parallel / Context Parallel
-
ZeRO Stage 1–3、FSDP、Parameter Sharding
-
Load balancing 与 activation checkpointing
🔍 源码参考¶
🔍 推荐论文¶
-
Megatron-LM (Shoeybi et al., 2019)
-
DeepSpeed ZeRO (Rajbhandari et al., 2020)
-
GSPMD (Google) (Xu et al., 2021)
-
Alpa (Zheng et al., 2022)
L6. Serving & Scheduler Layer(推理服务层)¶
目标:理解 LLM 推理服务如何调度请求、管理内存与资源。
📚 核心知识点¶
-
Token-level scheduler(prefill/decode overlap)
-
Batch padding 与 token streaming
-
Async engine & request queue
-
CUDA Graph & Stream Reuse
-
Memory pool 分配与碎片回收
-
Multi-model multiplexing(多模型共存)
-
RESTful API / GRPC / Streaming Output
-
Profiling 与 SLA 监控
🔍 源码参考¶
🔍 推荐论文¶
-
vLLM: Easy, Fast, and Cheap LLM Serving (2023)
-
SGLang (2024)
-
SpecInfer (2023)
L7. Application Layer(应用层)¶
目标:理解 LLM 推理系统如何与上层应用对接。
📚 核心知识点¶
-
Prompt cache / embedding cache / RAG pipeline
-
Token streaming protocol(WebSocket / HTTP chunk)
-
Multi-turn session & memory context
-
Load balancing / autoscaling / failover
-
Monitoring & observability(Prometheus, Grafana)
-
Cost optimization & resource scheduling
🔍 示例项目¶
-
OpenAI API serving architecture
-
ChatGPT / Claude session management
-
vLLM + FastAPI / RAG Fusion(LangChain, LlamaIndex)