CANN/cannbot-skills Attention层MLA+Indexer稀疏路径
Attention 层MLA Indexer 稀疏路径【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills参考模型cann-recipes-infer/models/deepseek-v3.2-exp/、cann-recipes-infer/models/glm-5/核心特征在 MLA Absorb 基础上增加 IndexerTop-K KV Block 选择FA 替换为稀疏版本。Prefill 和 Decode 共用forward_absorb路径。Attention 链路Prefill 和 Decode 共用# ─── Pre-Norm ─── hidden_states, residual npu_add_rms_norm(residual, hidden_states, weight, eps) # ─── 超级融合 prologPrefill 和 Decode 都用 npu_mla_prolog_v3─── q_nope, q_pe, _, qr, _ npu_mla_prolog_v3( token_xhidden_states, weight_dq..., weight_uq_qr..., weight_ukkv_b_proj_w_k, weight_dkv_kr..., kv_cachenope_cache, kr_cacherope_cache, cache_modePA_BSND, # 注意不是 PA_NZ cache_indexslot_mapping, # Decode # Prefill 且 cp_size1 时 cache_modeBSND后续手动 scatter_update_ 写入 cache weight_quant_mode1) # 仅 qb 量化 # C8 路径额外传kv_cache_quant_mode3, tile_size128 # ─── IndexerTop-K KV Block 选择 ─── # 内部流程投影 → RoPE(npu_rotary_mul) → 量化(npu_dynamic_quant, C8时) → Top-K topk_indices npu_lightning_indexer(q, k, weights, ...) # C8 路径npu_lightning_indexer_quant(...) # Indexer KV Cache: Prefill 用 scatter_update_Decode 用 npu_scatter_nd_update_ # ─── 稀疏 Flash Attentionkeyvaluelatent cache通过 topk_indices 选择─── # FP16 路径 output npu_sparse_flash_attention( queryq_nope, keyk_latent, valuek_latent, query_ropeq_pe, key_ropek_pe, sparse_indicestopk_indices, layout_queryTND, layout_kvPA_BSND, sparse_mode3) # C8 路径 output npu_sparse_flash_attention_antiquant( querycat([q_nope, q_pe]), keyk_latent, valuek_latent, key_quant_mode2, attention_mode2, rope_head_dim64, ...) # ─── V absorb ─── output matmul(output, kv_b_proj_w_v) # ─── O 投影支持 oproj_tp_size 额外并行维度─── output o_proj(output)Prefill vs Decode 关键差异环节PrefillDecodemla_prolog cache_modePA_BSNDCP 时BSND 后续scatter_update_PA_BSNDIndexer KV Cache 写入scatter_update_npu_scatter_nd_update_Offload可选无npu_gather_selection_kv_cache聚集离散 KV Block与 MLA Absorb无 Indexer的核心差异维度MLA Absorbdeepseek-r1MLA Indexerdeepseek-v3.2Prefill Attention展开 K/V → 标准 FAv1absorb → Indexer → 稀疏 FADecode Attentionabsorb → FA v2全量 KVabsorb → Indexer → 稀疏 FAFA 算子npu_fused_infer_attention_score(_v2)npu_sparse_flash_attentionKV Cache layoutPA_NZPA_BSNDV absorb 算子npu_transpose_batchmatmultorch.matmul【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考