CANN/ops-rand AI Core算子开发指南

张

张建站

2026/5/9 16:38:31

10分钟阅读

AI Core算子开发指南【免费下载链接】ops-randops-rand是CANN Compute Architecture for Neural Networks算子库中提供的随机数生成库。项目地址: https://gitcode.com/cann/ops-rand说明算子开发过程中涉及的基本概念如Tiling、Kernel、Ascend C接口等详细介绍请参考《Ascend C算子开发》。AI Core算子是使用Ascend C语言开发运行在AI Core硬件单元的算子。本指南以ops-rand项目中的随机数算子为例介绍新算子开发流程以及涉及的交付件。工程创建开发算子前需完成环境部署并创建算子目录。算子定义算子功能说明与原型定义。Tiling实现实现Host侧Tiling策略。Kernel实现实现Device侧算子核函数。编译部署通过工程编译脚本完成自定义算子的编译和安装。算子验证通过常见算子调用方式验证自定义算子功能。工程创建1. 环境部署开发算子前请先参考环境部署完成基础环境搭建。2. 目录创建在src目录下创建新的算子目录目录名为算子名的小写下划线形式。# 创建算子目录 mkdir -p src/${op_name} mkdir -p src/${op_name}/arch35 mkdir -p src/${op_name}/tests创建完成后目录结构如下所示${op_name}/ # 算子名的小写下划线形式 ├── CMakeLists.txt # 算子编译配置文件 ├── ${op_name}.cpp # Kernel入口文件 ├── arch35/ # Ascend950特有实现 │ └── ${op_name}.h # Kernel实现头文件 └── tests/ # 测试用例目录 ├── CMakeLists.txt └── test_${op_name}_example.cpp # 算子测试用例算子定义开发算子前需要先确定目标算子的功能和计算逻辑。以stateless_random_uniform_v2算子为例功能生成无状态的均匀分布随机数输入seed随机种子、offset偏移量、shape输出形状输出填充了随机数的张量支持的数据类型FP32、FP16、BF16API接口定义在include/cann_ops_rand.h中定义算子的对外API接口#ifdef __cplusplus extern C { #endif /* 无状态均匀分布随机数生成 (内部API) */ aclError _aclrandStatelessRandomUniformV2( uint64_t seed, uint64_t offset, int32_t alg, void *output, uint64_t n, int32_t dtype ); #ifdef __cplusplus } #endif说明output参数类型为void*实际数据类型由dtype参数决定dtype参数0FP32, 1FP16, 2BF16Tiling实现Tiling简介因NPU中AI Core内部存储空间有限无法一次性将整个张量数据加载到计算单元中处理因此需要将输入张量切分为多个小块Tile逐块进行计算这一过程称为Tiling。TilingData结构体定义在Kernel实现头文件中定义TilingData结构体// stateless_random_uniform_v2.h struct StatelessRandomUniformV2TilingData { uint32_t tilingKey; // Tiling标识用于区分不同场景 uint32_t blockNum; // 核数 uint32_t blockTilingSize; // 每个核处理的数据量 uint32_t tailBlockTilingSize; // 尾核处理的数据量 uint32_t ubTilingSize; // UB切分大小 uint32_t alg; // 算法类型 uint32_t key[2]; // Philox算法密钥来自seed uint32_t counter[4]; // Philox算法计数器来自offset };说明tilingKey字段用于标识不同的 Tiling 场景便于在 Kernel 中根据场景选择不同的处理逻辑。Tiling策略计算在Host侧Kernel入口函数或API调用前计算Tiling参数// 获取平台信息 auto ascendcPlatform platform_ascendc::PlatformAscendCManager::GetInstance(); auto coreNum ascendcPlatform-GetCoreNumAiv(); // 获取可用核数 // 计算Block切分参数 constexpr uint32_t CORE_ALIGN_SIZE 512; // 核对齐大小 constexpr uint32_t MIN_TILING_SIZE 256; // 最小切分大小 constexpr uint32_t BLOCK_SIZE_BYTES 32; // 块大小字节 auto outputDtypeSize sizeof(float); auto outputSize n * outputDtypeSize; auto coreAlignFactor CORE_ALIGN_SIZE / outputDtypeSize; auto blockFactor CeilDiv(outputSize, coreNum); auto blockAlignFactor CeilDiv(blockFactor, coreAlignFactor) * coreAlignFactor; tilingData-blockTilingSize std::max(blockAlignFactor, MIN_TILING_SIZE); tilingData-blockNum CeilDiv(outputSize, tilingData-blockTilingSize); tilingData-tailBlockTilingSize outputSize - tilingData-blockTilingSize * (tilingData-blockNum - 1); // 计算UB切分参数 uint64_t ubSize 0; ascendcPlatform-GetCoreMemSize(platform_ascendc::CoreMemType::UB, ubSize); auto quarterUbSize ubSize / 4; // 使用1/4的UB空间 auto alignFactor BLOCK_SIZE_BYTES / outputDtypeSize; // 块对齐因子 tilingData-ubTilingSize CeilDiv(quarterUbSize / outputDtypeSize, alignFactor) * alignFactor; // 设置随机数种子参数 tilingData-key[0] static_castuint32_t(seed); tilingData-key[1] static_castuint32_t(seed 32); tilingData-counter[0] static_castuint32_t(offset); tilingData-counter[1] static_castuint32_t(offset 32);Kernel实现Kernel代码结构ops-rand项目的Kernel实现主要包含以下文件src/${op_name}/ ├── ${op_name}.cpp # Kernel入口文件 API实现 └── arch35/ └── ${op_name}.h # Kernel类实现Kernel类定义// stateless_random_uniform_v2.h template typename T class StatelessRandomUniformV2 { public: __aicore__ inline StatelessRandomUniformV2() {} __aicore__ inline void Init( GM_ADDR y, const StatelessRandomUniformV2TilingData* __restrict tilingData, TPipe* pipeIn ); __aicore__ inline void Process(); private: __aicore__ inline void ParseTilingData(const StatelessRandomUniformV2TilingData* tilingData); __aicore__ inline void Skip(const uint64_t count); // 跳过计数器 __aicore__ inline void DataTypeHandle(LocalTensorT yOutput, const uint32_t calCount); __aicore__ inline void CopyOut(); private: TPipe* pipe; // 队列和缓冲区 TQueQuePosition::VECOUT, BUFFER_NUM outQueY_; // 输出队列 TBufQuePosition::VECCALC philoxQueBuf_; // Philox结果缓冲区 GlobalTensorT outputGm_; // 全局输出张量 // Tiling参数 uint32_t blockNum_ 0; uint32_t blockTilingSize_ 0; uint32_t ubTilingSize_ 0; // 随机数生成状态 uint32_t key_[2] {0}; uint32_t counter_[4] {0}; };Init函数实现template typename T __aicore__ inline void StatelessRandomUniformV2T::Init( GM_ADDR y, const StatelessRandomUniformV2TilingData* tilingData, TPipe* pipeIn ) { // 1. 解析Tiling数据 ParseTilingData(tilingData); // 2. 计算当前核的偏移量 auto blockIdx GetBlockIdx(); blockOffset_ blockTilingSize_ * blockIdx; // 3. 处理尾核情况 if (blockIdx blockNum_ - 1) { currBlockTilingSize_ tailBlockTilingSize_; } else { currBlockTilingSize_ blockTilingSize_; } // 4. 初始化全局张量和队列 outputGm_.SetGlobalBuffer((__gm__ T*)y); pipe pipeIn; pipe-InitBuffer(outQueY_, BUFFER_NUM, ubTilingSize_ * sizeof(T)); pipe-InitBuffer(philoxQueBuf_, ubTilingSize_ * sizeof(uint32_t)); }Process函数实现template typename T __aicore__ inline void StatelessRandomUniformV2T::Process() { // 1. 计算并跳过偏移量 auto groupCnt (blockOffset_ RESULT_ELEMENT_CNT - 1) / RESULT_ELEMENT_CNT; Skip(groupCnt); // 2. 分UB循环处理 ubLoopCnt_ (currBlockTilingSize_ ubTilingSize_ - 1) / ubTilingSize_; for (auto idx 0; idx ubLoopCnt_; idx) { // 处理尾块 currUbTilingSize_ ubTilingSize_; if ((idx ubLoopCnt_ - 1) (currBlockTilingSize_ % ubTilingSize_ ! 0)) { currUbTilingSize_ currBlockTilingSize_ % ubTilingSize_; } currOffset_ blockOffset_ idx * ubTilingSize_; // 3. 生成随机数 LocalTensoruint32_t philoxRes philoxQueBuf_.Getuint32_t(); LocalTensorT yOutput outQueY_.AllocTensorT(); // 使用Philox算法生成随机数 PhiloxRandom10( philoxRes, {key_[0], key_[1]}, {counter_[0], counter_[1], counter_[2], counter_[3]}, currUbTilingSize_ ); // 4. 数据类型转换 DataTypeHandle(yOutput, currUbTilingSize_); outQueY_.EnQue(yOutput); // 5. 拷贝输出 CopyOut(); // 6. 更新计数器 groupCnt (currUbTilingSize_ RESULT_ELEMENT_CNT - 1) / RESULT_ELEMENT_CNT; Skip(groupCnt); } }数据类型转换template typename T __aicore__ inline void StatelessRandomUniformV2T::DataTypeHandle( LocalTensorT yOutput, const uint32_t calCount ) { if constexpr (AscendC::IsSameTypeT, half::value) { // FP16: 取10位尾数指数设为15 Uint16ToHalf(yOutput, calCount); } else if constexpr (AscendC::IsSameTypeT, float::value) { // FP32: 取23位尾数指数设为127 Uint32ToFloat(yOutput, calCount); } else if constexpr (AscendC::IsSameTypeT, bfloat16_t::value) { // BF16: 取7位尾数指数设为127 Uint16ToBfloat16(yOutput, calCount); } }Kernel入口函数// stateless_random_uniform_v2.cpp extern C __global__ __aicore__ void stateless_random_uniform_v2( GM_ADDR y, StatelessRandomUniformV2TilingData tilingData ) { TPipe pipe; StatelessRandomUniformV2float op; op.Init(y, tilingData, pipe); op.Process(); }Host侧API实现aclError _aclrandStatelessRandomUniformV2( uint64_t seed, uint64_t offset, int32_t alg, float *output, uint64_t n ) { // 1. 参数校验 if (output nullptr || n 0) { return ACL_ERROR; } // 2. 分配设备内存 void *devOut nullptr; aclrtMalloc(devOut, n * sizeof(float), ACL_MEM_MALLOC_HUGE_FIRST); // 3. 计算Tiling参数 auto ascendcPlatform platform_ascendc::PlatformAscendCManager::GetInstance(); StatelessRandomUniformV2TilingData *tilingData new StatelessRandomUniformV2TilingData(); // ... Tiling参数计算 ... // 4. 启动Kernel stateless_random_uniform_v2tilingData-blockNum, nullptr, 0( (__gm__ uint8_t *)devOut, *tilingData ); // 5. 拷贝结果到Host aclrtMemcpy(output, n * sizeof(float), devOut, n * sizeof(float), ACL_MEMCPY_DEVICE_TO_HOST); // 6. 释放资源 aclrtFree(devOut); return ACL_SUCCESS; }编译部署算子开发完成后需对算子工程进行编译生成ops-rand安装包。1. 准备工作参考工程创建完成基础环境搭建同时检查算子开发交付件是否完备。2. 配置环境变量# 默认路径安装 source /usr/local/Ascend/cann/set_env.sh # 指定路径安装 # source ${install_path}/cann/set_env.sh3. 编译ops-rand包进入项目根目录执行编译命令# 编译所有算子 bash build.sh --pkg --socascend950 # 编译指定算子 bash build.sh --pkg --socascend950 --ops${op_name}若提示如下信息说明编译成功Self-extractable archive cann-950-ops-rand_9.0.0_linux-x86_64.run successfully created.4. 安装ops-rand包./build_out/cann-${soc_name}-ops-rand_${cann_version}_linux-${arch}.run --full --install-path${install_path}${soc_name}表示NPU型号名称如950。${cann_version}表示CANN版本号如9.0.0。${arch}表示CPU架构如aarch64、x86_64。${install_path}表示指定安装路径需要与toolkit包安装在相同路径默认安装在/usr/local/Ascend目录。ops-rand安装在${install_path}/cann路径中。算子验证# 编译并运行测试 bash build.sh --ops${op_name} --run随机数生成注意事项Philox算法ops-rand使用Philox算法生成随机数该算法具有以下特点确定性相同的seed和offset产生相同的随机序列并行友好支持多核并行生成通过counter偏移实现高质量通过多轮加密混淆保证随机性数据类型转换公式数据类型尾数位数指数值转换公式FP3223127(exp 23) | mantissa- 1.0fFP161015(exp 10) | mantissa- 1.0hBF167127(exp 7) | mantissa- 1.0bf【免费下载链接】ops-randops-rand是CANN Compute Architecture for Neural Networks算子库中提供的随机数生成库。项目地址: https://gitcode.com/cann/ops-rand创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

OpenClaw爆火背后：AI不再只是聊天，而是开始真正替你干活

一、OpenClaw到底是什么？OpenClaw 是一个开源的个人 AI 助理，它不是普通聊天机器人，而是一个可以运行在你自己电脑或服务器上的 AI Agent。它的官方定位很直接：OpenClaw 是运行在你自己设备上的个人 AI 助理，可以通过你…...

2026/5/9 16:30:09 阅读更多 →

HCCL Atlas A3 rank table配置

rank table配置资源信息（Atlas A3 训练系列产品/Atlas A3 推理系列产品） 【免费下载链接】hccl 集合通信库（Huawei Collective Communication Library，简称HCCL）是基于昇腾AI处理器的高性能集合通信库，为计…...

2026/5/9 16:27:22 阅读更多 →

新手也能玩转CTF：用Python脚本爆破PNG图片CRC，修复宽高实战（附完整代码）

从零掌握PNG文件修复：CRC爆破实战与CTF技巧精解在网络安全竞赛的MISC（杂项）领域中，PNG图片分析是经典题型之一。许多CTF新手面对损坏的图片文件时，往往无从下手。本文将带你深入理解PNG文件结构，并通过Pyt…...

2026/5/9 16:26:27 阅读更多 →

LoopViT：结合循环机制的视觉Transformer优化架构

1. 项目概述在计算机视觉领域，Transformer架构近年来展现出惊人的潜力。LoopViT是我最近开发的一种新型视觉推理架构，它通过引入循环机制改进了传统视觉Transformer的计算效率和信息流模式。这个架构特别适合处理视频分析、医学影像分割等需要时序建模的…...

2026/5/8 5:06:09 阅读更多 →

实战指南：深度解锁微信网页版，让浏览器也能畅快聊天

实战指南：深度解锁微信网页版，让浏览器也能畅快聊天【免费下载链接】wechat-need-web 让微信网页版可用 / Allow the use of WeChat via webpage access 项目地址: https://gitcode.com/gh_mirrors/we/wechat-need-web 还在为微信网页版频繁提示…...

2026/5/9 14:14:14 阅读更多 →

智慧树学习效率提升指南：如何用自动化工具节省80%学习时间

智慧树学习效率提升指南：如何用自动化工具节省80%学习时间【免费下载链接】zhihuishu 智慧树刷课插件，自动播放下一集、1.5倍速度、无声项目地址: https://gitcode.com/gh_mirrors/zh/zhihuishu 还在为智慧树平台繁琐的视频学习流程而烦恼吗&am…...

2026/5/9 1:50:48 阅读更多 →