CANN算子a2 Cube-to-Vec模式

张

张建站

2026/5/9 16:41:24

10分钟阅读

a2 Cube-to-Vec Pattern (GM Workspace Bridge)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing a cube → vec kernel on a2 (easyasc.a2, deviceb3). On a2,l0c_to_ubis not available. The cube output must transit through GM workspace.When to useCube computes a matmul tile in L0CVec must postprocess (scale, normalize, exp, cast, etc.) before final writebackTarget device is a2 (not a5)Data flowGM(q,k) → L1 → L0A/L0B → mmad → L0C → GM(workspace) → UB → vec ops → GM(output) ↑ ↑ cube FIX pipe vec MTE2 pipe └── CvMutex ──────┘Buffer declarations# GM workspace with pingpong (2 slots) ws split_workspace(DT.float, [GetCubeNum(), 2, TILE_M, TILE_N], namews) # Cube buffers (standard) l1q DBuff(DT.half, [TILE_M, TILE_K], Position.L1) l1k DBuff(DT.half, [TILE_N, TILE_K], Position.L1) l0c DBuff(DT.float, [TILE_M, TILE_N], Position.L0C) # Vec buffers (per sub-block, 192KB each) ub_data Tensor(DT.float, [HALF_M, TILE_N], Position.UB) ub_out Tensor(DT.half, [HALF_M, TILE_N], Position.UB)Synchronizationcvmutex CvMutex(0, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)src_end_pipePipe.FIX: cubes last operation isl0c_to_gm_nz2nd(FIX pipe)dst_end_pipePipe.MTE2: vecs first operation isgm_to_ub_pad(MTE2 pipe)This differs from a5s standarddst_end_pipePipe.Vbecause a5 usesl0c_to_ub→vf.Sub-block splitEach cube core has 2 vec sub-blocks, each with independent 192KB UB. UseGetSubBlockIdx()to split the M dimension:sb GetSubBlockIdx() sb_row Var(sb * HALF_M) # Cube writes full TILE_M to workspace ws[cube_idx, slot, 0:TILE_M, 0:TILE_N] l0c[cnt] # Each sub-block reads its own half ub_data ws[cube_idx, slot, sb_row:sb_row HALF_M, 0:TILE_N] # Each sub-block writes its own half to output out_row Var(q_row sb_row) output[out_row:out_row HALF_M, col:col TILE_N] ub_outWorkspace pingpongIndex the workspace slot withvar_mod(counter, 2):ws_cnt Var(0) # inside loop: ws_slot var_mod(ws_cnt, 2) ws[cube_idx, ws_slot, ...] l0c[...] # cube write ub_data ws[cube_idx, ws_slot, ...] # vec read ws_cnt 1The CvMutex lock/free cycle ensures the cube does not overwrite a slot that the vec is still reading from the previous iteration.Tail note for workspace slicesWhen an a2 cube - vec kernel hasvalid_m/valid_ntails, keep the workspace bridge itself on stable tile shapes whenever possible:cube side: prefer writing0:TILE_M, 0:TILE_Ninto workspace after tail zero-fill in local buffersvec side: prefer readingrow_begin:row_begin row_count, 0:TILE_Nfrom workspace, then handlevalid_nwith vec-side masking and final GM write boundariesReason:l0c_to_gm_nz2ndandgm_to_ub_padinfer row stride from the parent GM shape, not from a cropped workspace column spana workspace slice like[..., 0:row_count, 0:valid_n]may therefore be too small for the inferred stride even when the logical tail region is correctThis is a workspace-bridge rule, not a general never use GM tails rule. Direct final GM boundaries such asoutput[..., 0:valid_n]still work in the usual way.Complete iteration skeletonwith auto_sync(): for tile_idx in range(...): ws_slot var_mod(ws_cnt, 2) # Cube l1q[l1_cnt] q[...] l1k[l1_cnt] k[...] matmul(l0c[l0c_cnt], l1q[l1_cnt], l1k[l1_cnt], is_initTrue) cvmutex.lock() ws[cube_idx, ws_slot, 0:TILE_M, 0:TILE_N] l0c[l0c_cnt] cvmutex.ready() # Vec cvmutex.wait() ub_data ws[cube_idx, ws_slot, sb_row:sb_row HALF_M, 0:TILE_N] # ... vec postprocess ... output[...] ub_out cvmutex.free() l1_cnt 1; l0c_cnt 1; ws_cnt 1Capacity quick-check (TILE_M128, TILE_N128, D128)BufferSizeBudgetL1: l1q l1k DBuff128 KB512 KB ✓L0C: l0c DBuff128 KB128 KB ✓L0A (inner)64 KB64 KB ✓L0B (inner)64 KB64 KB ✓UB per sub-block~66 KB192 KB ✓Do not copy whenTarget is a5 — usel0c_to_ubvfinsteadKernel is cube-only — use directl0c_to_gm_nz2ndto output (no workspace needed)Vec preprocess is needed (vec → cube) — useVcMutexpattern insteadFiles to studyagent/example/kernels/a2/flash_attn_score.py— complete working exampleagent/example/kernels/a2/qk_matmul_batched.py— cube-only a2 baseline (no vec)agent/references/constraints/a2-device.md— a2-specific hardware differencesagent/references/constraints/vec-stride.md— continuous vs sliced vec operations【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

OpenClaw AI Agent实战指南：从自动化客服到个人助理的六大场景应用

1. 从工具到伙伴：OpenClaw AI Agent 如何重塑你的工作流如果你还在把AI当作一个简单的聊天机器人，或者一个偶尔帮你写点文案的“外挂”，那你可能错过了这个时代最激动人心的生产力革命。OpenClaw AI Agent，这个听起来有点赛博朋克…...

2026/5/9 16:41:08 阅读更多 →

CANN/ops-rand AI Core算子开发指南

AI Core算子开发指南【免费下载链接】ops-rand ops-rand是CANN （Compute Architecture for Neural Networks）算子库中提供的随机数生成库。项目地址: https://gitcode.com/cann/ops-rand 说明： 算子开发过程中涉及的基本概念如Tiling…...

2026/5/9 16:38:31 阅读更多 →

OpenClaw爆火背后：AI不再只是聊天，而是开始真正替你干活

一、OpenClaw到底是什么？OpenClaw 是一个开源的个人 AI 助理，它不是普通聊天机器人，而是一个可以运行在你自己电脑或服务器上的 AI Agent。它的官方定位很直接：OpenClaw 是运行在你自己设备上的个人 AI 助理，可以通过你…...

2026/5/9 16:30:09 阅读更多 →

LoopViT：结合循环机制的视觉Transformer优化架构

1. 项目概述在计算机视觉领域，Transformer架构近年来展现出惊人的潜力。LoopViT是我最近开发的一种新型视觉推理架构，它通过引入循环机制改进了传统视觉Transformer的计算效率和信息流模式。这个架构特别适合处理视频分析、医学影像分割等需要时序建模的…...

2026/5/8 5:06:09 阅读更多 →

实战指南：深度解锁微信网页版，让浏览器也能畅快聊天

实战指南：深度解锁微信网页版，让浏览器也能畅快聊天【免费下载链接】wechat-need-web 让微信网页版可用 / Allow the use of WeChat via webpage access 项目地址: https://gitcode.com/gh_mirrors/we/wechat-need-web 还在为微信网页版频繁提示…...

2026/5/9 14:14:14 阅读更多 →

智慧树学习效率提升指南：如何用自动化工具节省80%学习时间

智慧树学习效率提升指南：如何用自动化工具节省80%学习时间【免费下载链接】zhihuishu 智慧树刷课插件，自动播放下一集、1.5倍速度、无声项目地址: https://gitcode.com/gh_mirrors/zh/zhihuishu 还在为智慧树平台繁琐的视频学习流程而烦恼吗&am…...

2026/5/9 1:50:48 阅读更多 →