华为昇腾310P废物利用——大模型推理服务
华为昇腾310P废物利用注310P不支持bf16、W4A4带宽200G双芯版的300I duo, 有48g和96g两种目前市面上所有昇腾的卡均不支持FP8最终性能优化结果Qwen3-8B-W8A8TPS 15Tokens/s昇腾的PyTorch图模式使用和vllm-ascend的源码里面有reduce-overhead和max-autotune两种模式reduce-overhead只支持910B和910C而且vllm-ascend里面写死了reduce-overhead模式MindIE Qwen 3-8B-W8A81. Launch the container on thehostdockerrun-it-d--nethost --shm-size16g\--namemindie-qwen3-8b-310p\-w/workspace/MindIE-LLM/examples/atb_models\--device/dev/davinci0:rwm\--device/dev/davinci1:rwm\--device/dev/davinci2:rwm\--device/dev/davinci3:rwm\--device/dev/davinci_manager:rwm\--device/dev/hisi_hdc:rwm\--device/dev/devmm_svm:rwm\-v/usr/local/Ascend/driver:/usr/local/Ascend/driver:ro\-v/usr/local/dcmi:/usr/local/dcmi:ro\-v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro\-v/usr/local/sbin:/usr/local/sbin:ro\-v/Users/zhaojiacheng/repos/MindIE-LLM:/workspace/MindIE-LLM\-v/home/s_zhaojiacheng:/home/s_zhaojiacheng\swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:3.0.0b2-300I-Duo-py311-openeuler24.03-lts\bashEnter the container:dockerexec-itmindie-qwen3-8b-310pbash2. Prepare the environment inside the containercd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh prepare-env3. Download the model from ModelScope Recommended: download directly into a normal directory, not only into the default cache.mkdir-p/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s modelscope download\--modelEco-Tech/Qwen3-8B-w8a8s-310\--local_dir/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s If you already downloaded it earlier into the default cache with: modelscope download--modelEco-Tech/Qwen3-8B-w8a8s-310thenflatten it into a real directory first:mkdir-p/home/s_zhaojiacheng/models/Qwen3-8B-w8a8scp-aL\/home/s_zhaojiacheng/.cache/modelscope/hub/models/Eco-Tech/Qwen3-8B-w8a8s-310/.\/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s/ Check the files exist:ls/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s4. Compress W8A8S into W8A8SCcd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh compress\--w8a8s-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc After it finishes, check the output directory exists:ls/home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc5. Start the OpenAI-compatible servercd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh serve\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc\--model-name qwen3-8b-w8a8sc\--port1025This should start mindie_llm_server and expose the OpenAI-compatible endpoint on127.0.0.1:1025.6. Verify theserviceList models: curlhttp://127.0.0.1:1025/v1/models Expected model id: qwen3-8b-w8a8sc Test one inference request: curlhttp://127.0.0.1:1025/v1/chat/completions\-HContent-Type: application/json\-d{ model: qwen3-8b-w8a8sc, messages: [ {role: user, content: What is deep learning?} ], max_tokens: 128, stream: false }Short version If you want the shortest working sequence inside the container:cd/workspace/MindIE-LLM scripts/qwen3_8b_310p_w8a8sc.sh prepare-env modelscope download\--modelEco-Tech/Qwen3-8B-w8a8s-310\--local_dir/home/s_zhaojiacheng/models/Qwen3-8B-w8a8s scripts/qwen3_8b_310p_w8a8sc.sh compress\--w8a8s-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8s\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc scripts/qwen3_8b_310p_w8a8sc.sh serve\--w8a8sc-weight /home/s_zhaojiacheng/models/Qwen3-8B-w8a8sc\--model-name qwen3-8b-w8a8sc\--port1025Then test: curlhttp://127.0.0.1:1025/v1/models One important detail:forthis single-310P flow,donot try to serve Qwen3-8B-w8a8s-310 directly. The supported path is download W8A8S -compress to W8A8SC -serve W8A8SC. If you want, I can also rewrite this into one clean host-sidebashscript that doesdockerrun,dockerexec, download, compress, and serve end to end.