a2 Cube-to-Vec Pattern (GM Workspace Bridge)【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when writing a cube → vec kernel on a2 (easyasc.a2, deviceb3). On a2,l0c_to_ubis not available. The cube output must transit through GM workspace.When to useCube computes a matmul tile in L0CVec must postprocess (scale, normalize, exp, cast, etc.) before final writebackTarget device is a2 (not a5)Data flowGM(q,k) → L1 → L0A/L0B → mmad → L0C → GM(workspace) → UB → vec ops → GM(output) ↑ ↑ cube FIX pipe vec MTE2 pipe └── CvMutex ──────┘Buffer declarations# GM workspace with pingpong (2 slots) ws split_workspace(DT.float, [GetCubeNum(), 2, TILE_M, TILE_N], namews) # Cube buffers (standard) l1q DBuff(DT.half, [TILE_M, TILE_K], Position.L1) l1k DBuff(DT.half, [TILE_N, TILE_K], Position.L1) l0c DBuff(DT.float, [TILE_M, TILE_N], Position.L0C) # Vec buffers (per sub-block, 192KB each) ub_data Tensor(DT.float, [HALF_M, TILE_N], Position.UB) ub_out Tensor(DT.half, [HALF_M, TILE_N], Position.UB)Synchronizationcvmutex CvMutex(0, src_end_pipePipe.FIX, dst_end_pipePipe.MTE2)src_end_pipePipe.FIX: cubes last operation isl0c_to_gm_nz2nd(FIX pipe)dst_end_pipePipe.MTE2: vecs first operation isgm_to_ub_pad(MTE2 pipe)This differs from a5s standarddst_end_pipePipe.Vbecause a5 usesl0c_to_ub→vf.Sub-block splitEach cube core has 2 vec sub-blocks, each with independent 192KB UB. UseGetSubBlockIdx()to split the M dimension:sb GetSubBlockIdx() sb_row Var(sb * HALF_M) # Cube writes full TILE_M to workspace ws[cube_idx, slot, 0:TILE_M, 0:TILE_N] l0c[cnt] # Each sub-block reads its own half ub_data ws[cube_idx, slot, sb_row:sb_row HALF_M, 0:TILE_N] # Each sub-block writes its own half to output out_row Var(q_row sb_row) output[out_row:out_row HALF_M, col:col TILE_N] ub_outWorkspace pingpongIndex the workspace slot withvar_mod(counter, 2):ws_cnt Var(0) # inside loop: ws_slot var_mod(ws_cnt, 2) ws[cube_idx, ws_slot, ...] l0c[...] # cube write ub_data ws[cube_idx, ws_slot, ...] # vec read ws_cnt 1The CvMutex lock/free cycle ensures the cube does not overwrite a slot that the vec is still reading from the previous iteration.Tail note for workspace slicesWhen an a2 cube - vec kernel hasvalid_m/valid_ntails, keep the workspace bridge itself on stable tile shapes whenever possible:cube side: prefer writing0:TILE_M, 0:TILE_Ninto workspace after tail zero-fill in local buffersvec side: prefer readingrow_begin:row_begin row_count, 0:TILE_Nfrom workspace, then handlevalid_nwith vec-side masking and final GM write boundariesReason:l0c_to_gm_nz2ndandgm_to_ub_padinfer row stride from the parent GM shape, not from a cropped workspace column spana workspace slice like[..., 0:row_count, 0:valid_n]may therefore be too small for the inferred stride even when the logical tail region is correctThis is a workspace-bridge rule, not a general never use GM tails rule. Direct final GM boundaries such asoutput[..., 0:valid_n]still work in the usual way.Complete iteration skeletonwith auto_sync(): for tile_idx in range(...): ws_slot var_mod(ws_cnt, 2) # Cube l1q[l1_cnt] q[...] l1k[l1_cnt] k[...] matmul(l0c[l0c_cnt], l1q[l1_cnt], l1k[l1_cnt], is_initTrue) cvmutex.lock() ws[cube_idx, ws_slot, 0:TILE_M, 0:TILE_N] l0c[l0c_cnt] cvmutex.ready() # Vec cvmutex.wait() ub_data ws[cube_idx, ws_slot, sb_row:sb_row HALF_M, 0:TILE_N] # ... vec postprocess ... output[...] ub_out cvmutex.free() l1_cnt 1; l0c_cnt 1; ws_cnt 1Capacity quick-check (TILE_M128, TILE_N128, D128)BufferSizeBudgetL1: l1q l1k DBuff128 KB512 KB ✓L0C: l0c DBuff128 KB128 KB ✓L0A (inner)64 KB64 KB ✓L0B (inner)64 KB64 KB ✓UB per sub-block~66 KB192 KB ✓Do not copy whenTarget is a5 — usel0c_to_ubvfinsteadKernel is cube-only — use directl0c_to_gm_nz2ndto output (no workspace needed)Vec preprocess is needed (vec → cube) — useVcMutexpattern insteadFiles to studyagent/example/kernels/a2/flash_attn_score.py— complete working exampleagent/example/kernels/a2/qk_matmul_batched.py— cube-only a2 baseline (no vec)agent/references/constraints/a2-device.md— a2-specific hardware differencesagent/references/constraints/vec-stride.md— continuous vs sliced vec operations【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考