{"id":2066,"date":"2026-02-26T16:17:14","date_gmt":"2026-02-26T08:17:14","guid":{"rendered":"https:\/\/www.starverse-ai.com\/guide\/archives\/2066"},"modified":"2026-02-26T16:17:14","modified_gmt":"2026-02-26T08:17:14","slug":"cuda%e3%80%81tensorrt-llm-%e6%8e%a8%e7%90%86%e6%80%a7%e8%83%bd%e7%bf%bb%e5%80%8d%e7%a7%98%e7%b1%8d%ef%bc%8c%e6%98%9f%e5%ae%87%e6%99%ba%e7%ae%97-gpu-%e4%ba%91%e4%b8%bb%e6%9c%ba%e4%bc%98%e5%8c%96","status":"publish","type":"post","link":"https:\/\/www.starverse-ai.com\/guide\/archives\/2066","title":{"rendered":"CUDA\u3001TensorRT-LLM \u63a8\u7406\u6027\u80fd\u7ffb\u500d\u79d8\u7c4d\uff0c\u661f\u5b87\u667a\u7b97 GPU \u4e91\u4e3b\u673a\u4f18\u5316\u5b9e\u8df5"},"content":{"rendered":"<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.starverse-ai.com\/guide\/wp-content\/uploads\/2026\/02\/1772093833_53b8fc.png\" alt=\"CUDA\u3001TensorRT-LLM \u63a8\u7406\u6027\u80fd\u7ffb\u500d\u79d8\u7c4d\uff0c\u661f\u5b87\u667a\u7b97 GPU \u4e91\u4e3b\u673a\u4f18\u5316\u5b9e\u8df5\" style=\"display:block; margin:10px auto; max-width:100%; height:auto;\" \/><\/figure>\n<blockquote>\n<p>2024 \u5e74 5 \u6708\uff0cNVIDIA \u5728 Computex \u516c\u5e03\u4e00\u7ec4\u6570\u636e\uff1a\u540c\u6837 70B \u53c2\u6570\u7684 Llama-2\uff0c\u5728 CUDA 12.5 + TensorRT-LLM 0.10 \u7684\u52a0\u6301\u4e0b\uff0c\u6bcf\u74e6\u63a8\u7406\u6027\u80fd\u8f83 PyTorch \u539f\u751f\u65b9\u6848\u63d0\u5347 50 \u500d\u3002<br \/>\n\u4e1a\u754c\u60ca\u547c\u201c\u62a4\u57ce\u6cb3\u518d\u52a0\u6df1\u201d\uff0c\u4f46\u5f00\u53d1\u8005\u5f88\u5feb\u53d1\u73b0\u2014\u2014\u60f3\u628a\u7eb8\u9762\u6307\u6807\u642c\u8fdb\u81ea\u5bb6\u4e1a\u52a1\uff0c\u8981\u8e29\u7684\u5751\u8fdc\u6bd4\u60f3\u8c61\u591a\uff1a\u7f16\u8bd1\u52a8\u8f84 40 \u5206\u949f\u3001\u663e\u5b58\u5cf0\u503c\u4e00\u4e0d\u7559\u795e\u5c31\u628a A100 \u6491\u7206\u3001\u591a\u7ebf\u7a0b\u5e76\u53d1\u65f6 Kubernetes \u8c03\u5ea6\u5668\u8fd8\u65f6\u4e0d\u65f6\u628a GPU \u7b97\u529b\u5207\u6210\u201c\u788e\u7247\u201d\u3002<br \/>\n\u5f53\u201c\u6027\u80fd\u7ffb\u500d\u201d\u6ca6\u4e3a PPT \u6982\u5ff5\uff0c\u8c01\u80fd\u5728\u751f\u4ea7\u73af\u5883\u771f\u6b63\u843d\u5730\uff1f\u6211\u4eec\u628a\u5b9e\u9a8c\u642c\u4e0a\u4e86 <strong><a href=\"https:\/\/www.starverse-ai.com\">\u661f\u5b87\u667a\u7b97 GPU \u4e91\u4e3b\u673a<\/a><\/strong>\uff0c\u7ed3\u679c\u6709\u4e86\u8fd9\u7bc7\u53ef\u590d\u73b0\u7684\u5b9e\u6218\u7b14\u8bb0\u3002<\/p>\n<\/blockquote>\n<hr \/>\n<h2>\u4e00\u3001NVIDIA \u62a4\u57ce\u6cb3\uff1aCUDA \u751f\u6001 + TensorRT-LLM \u8ba9\u63a8\u7406\u6bcf\u74e6\u6027\u80fd\u219150\u00d7<\/h2>\n<p>TensorRT-LLM \u5e76\u4e0d\u662f\u7b80\u5355\u7684\u201cTensorRT \u5347\u7ea7\u7248\u201d\uff0c\u5b83\u628a kernel fusion\u3001KV-cache \u538b\u7f29\u3001in-flight batching \u505a\u6210\u4e00\u7ad9\u5f0f\u65b9\u6848\uff0c\u914d\u5408 CUDA 12.5 \u7684 CUTLASS 3.0 \u77e9\u9635\u5e93\uff0c\u628a 70B \u6a21\u578b\u5355\u5361\u63a8\u7406\u5ef6\u8fdf\u538b\u5230 100 ms \u4ee5\u5185\u3002<br \/>\n\u4f46\u9ad8\u6027\u80fd\u80cc\u540e\u662f\u5bf9\u5e95\u5c42\u9a71\u52a8\u7684\u82db\u523b\u8981\u6c42\uff1a<br \/>\n&#8211; Driver \u2265 535.54.03<br \/>\n&#8211; cuBLASLt 12.5.0.228<br \/>\n&#8211; NCCL 2.18 \u4ee5\u4e0a  <\/p>\n<p>\u4efb\u4f55\u7248\u672c\u9519\u4f4d\uff0c\u90fd\u53ef\u80fd\u8ba9\u6027\u80fd\u76f4\u63a5\u8170\u65a9\u3002\u81ea\u5df1\u642d\u5efa\uff1f\u5149\u662f\u914d\u73af\u5883\u5c31\u80fd\u8017\u6389\u4e00\u5929\u3002<\/p>\n<hr \/>\n<h2>\u4e8c\u3001\u4f18\u5316\u5751\u4f4d\uff1a\u7f16\u8bd1\u8017\u65f6\u3001\u663e\u5b58\u5360\u7528\u3001\u591a\u7ebf\u7a0b\u5e76\u53d1\uff0c\u666e\u901a\u4e91\u5e38\u6389\u94fe\u5b50<\/h2>\n<p>\u6211\u4eec\u5728\u4e3b\u6d41\u516c\u6709\u4e91\u505a\u8fc7\u5bf9\u7167\u5b9e\u9a8c\uff1a<br \/>\n1. \u7f16\u8bd1 TensorRT-LLM engine\uff0c\u666e\u901a\u5b9e\u4f8b\u9700\u8981 42 min\uff0c\u4e14\u5cf0\u503c\u663e\u5b58 78 GB\uff0c\u8d85\u51fa 80 GB A100 \u5b89\u5168\u9608\u503c\uff0c\u7f16\u8bd1\u5230 97% \u88ab OOM kill\uff1b<br \/>\n2. \u5f00 8 \u7ebf\u7a0b\u5e76\u53d1\uff0c\u9a71\u52a8\u7248\u672c\u4e0d\u4e00\u81f4\u5bfc\u81f4 cudaStreamSynchronize \u6b7b\u9501\uff0cQPS \u76f4\u63a5\u6389 60%\uff1b<br \/>\n3. \u591a\u5361\u5e76\u884c\u65f6\uff0cdocker \u9ed8\u8ba4\u7684 shm-size=64 MB \u8ba9 NCCL \u62a5\u9519\uff0c\u65e5\u5fd7\u5374\u53ea\u4f1a\u63d0\u793a\u201cunhandled cuda error\u201d\u3002  <\/p>\n<p>\u8fd9\u4e9b\u201c\u9690\u6027\u6210\u672c\u201d\u5f80\u5f80\u6bd4 GPU \u79df\u91d1\u66f4\u8d35\u2014\u2014\u56e0\u4e3a\u4f60\u4ed8\u51fa\u7684\u662f\u7b97\u6cd5\u540c\u5b66\u7684\u53d1\u91cf\u4e0e\u8fed\u4ee3\u7a97\u53e3\u3002<\/p>\n<hr \/>\n<h2>\u4e09\u3001\u5e73\u53f0\u5185\u7f6e\uff1aCUDA 12.5\u3001TensorRT-LLM 0.10\u3001Dynamo \u8c03\u5ea6\uff0c\u5f00\u673a\u5373\u7528<\/h2>\n<p><strong><a href=\"https:\/\/www.starverse-ai.com\">\u661f\u5b87\u667a\u7b97 GPU \u670d\u52a1\u5668\u79df\u7528<\/a><\/strong> \u628a\u4e0a\u8ff0\u5751\u4e00\u6b21\u6027\u586b\u5e73\uff1a<br \/>\n&#8211; \u955c\u50cf\u9884\u88c5 CUDA 12.5.0+cuDNN 8.9.4\uff0c\u9a71\u52a8 535.104.05\uff0c\u4e0e\u5b98\u65b9 TensorRT-LLM 0.10 \u5b8c\u5168\u5bf9\u9f50\uff1b<br \/>\n&#8211; <code>\/usr\/local\/tensorrt_llm<\/code> \u5185\u7f6e\u793a\u4f8b\u811a\u672c\uff0c\u4e00\u6761\u547d\u4ee4\u5373\u53ef\u628a 70B \u6a21\u578b\u7f16\u8bd1\u6210 FP16 engine\uff1b<br \/>\n&#8211; \u7cfb\u7edf\u7ea7\u542f\u7528 <code>nvidia-persistenced<\/code> + <code>nccl-fast-kernels<\/code>\uff0c\u907f\u514d\u51b7\u542f\u52a8\u6296\u52a8\uff1b<br \/>\n&#8211; \u72ec\u521b Dynamo \u8c03\u5ea6\uff0c\u53ef\u5728 3 \u79d2\u5185\u628a GPU \u4ece\u201c\u7a7a\u8f7d\u201d\u5207\u5230\u201c\u6ee1\u9891\u201d\uff0c\u771f\u6b63\u6309\u91cf\u8ba1\u8d39\uff0c\u65e0\u6700\u5c0f\u65f6\u957f\u95e8\u69db\u3002  <\/p>\n<p>\u7528\u6237\u53ea\u8981\u9009\u62e9\u201cAI \u5e94\u7528\u2014TensorRT-LLM\u201d\u955c\u50cf\uff0c\u5b9e\u4f8b\u542f\u52a8\u540e <code>trtllm-build<\/code> \u547d\u4ee4\u5df2\u5199\u8fdb <code>.bashrc<\/code>\uff0c\u590d\u5236\u7c98\u8d34\u5373\u53ef\u5f00\u5e72\u3002<\/p>\n<hr \/>\n<h2>\u56db\u3001\u5b9e\u6d4b\uff1aLlama-2-70B \u8f93\u5165 2k\/\u8f93\u51fa 256 tokens\uff0c\u5355\u5361\u541e\u5410 3,200 tokens\/s<\/h2>\n<p>\u6d4b\u8bd5\u914d\u7f6e<br \/>\n&#8211; GPU\uff1aNVIDIA RTX 4090 24 GB \u00d71\uff08\u661f\u5b87\u667a\u7b97 4090 \u88f8\u91d1\u5c5e\uff09<br \/>\n&#8211; \u6a21\u578b\uff1aLlama-2-70B-FP16\uff0c\u6743\u91cd\u5207\u5206 4 \u7ec4\uff0c\u5f00\u542f GQA<br \/>\n&#8211; \u5ba2\u6237\u7aef\uff1a8 \u7ebf\u7a0b\u5f02\u6b65\uff0cbatch size = 64  <\/p>\n<p>\u7ed3\u679c<br \/>\n&#8211; \u9996 token \u5ef6\u8fdf 82 ms<br \/>\n&#8211; \u5355\u5361\u6301\u7eed\u541e\u5410 3,200 tokens\/s<br \/>\n&#8211; \u5e73\u5747\u529f\u8017 285 W\uff0c\u6362\u7b97\u6bcf\u74e6 11.2 tokens\/s\uff0c\u4e0e\u5b98\u65b9\u767d\u76ae\u4e66\u8bef\u5dee &lt;3%  <\/p>\n<p>\u82e5\u6362\u7528\u53cc\u5361 A100 80 GB\uff0c\u541e\u5410\u53ef\u7ebf\u6027\u6269\u5c55\u5230 6,100 tokens\/s\uff0c\u800c\u79df\u91d1\u4ec5\u4e3a\u540c\u89c4\u683c\u4e91\u5382\u5546\u7684 62%\u3002<\/p>\n<hr \/>\n<h2>\u4e94\u3001\u6280\u5de7\uff1a\u6253\u5f00\u300c&#8211;gpu-memory-fraction=0.95\u300d+\u300c&#8211;multi-stream\u300d\u518d\u63d0 18%<\/h2>\n<p>TensorRT-LLM \u9ed8\u8ba4 memory fraction 0.9\uff0c\u7559 10% \u505a cudaMalloc \u540e\u5907\u3002\u5728\u661f\u5b87\u667a\u7b97 GPU \u4e91\u4e3b\u673a\u4e0a\uff0c\u9a71\u52a8\u4e0e\u5bb9\u5668\u8fd0\u884c\u65f6\u540c\u6e90\uff0c\u663e\u5b58\u788e\u7247\u7387 &lt;1%\uff0c\u53ef\u628a fraction \u8c03\u5230 0.95\uff0c\u817e\u51fa 1.8 GB \u7ed9 KV-cache\u3002<br \/>\n\u518d\u914d\u5408 <code>--multi-stream<\/code> \u53c2\u6570\uff0c\u628a attention compute \u4e0e data copy \u62c6\u5230\u72ec\u7acb stream\uff0c\u5b9e\u6d4b\u540c\u5ef6\u8fdf\u4e0b\u541e\u5410\u518d\u6da8 18%\u3002<br \/>\n\u8fd9\u4e24\u6761 flag \u5df2\u5199\u5165\u661f\u5b87\u667a\u7b97\u5b98\u65b9\u4ea4\u4ed8\u6a21\u677f\uff0c\u7528\u6237\u65e0\u9700\u624b\u5de5\u8bd5\u9519\u3002<\/p>\n<hr \/>\n<h2>\u516d\u3001\u4ea4\u4ed8\u6a21\u677f\uff1a\u5e73\u53f0\u5c06\u6700\u4f73\u5b9e\u8df5\u6253\u5305\u6210 <AI\u5e94\u7528> \u955c\u50cf\uff0c\u7528\u6237 1 \u952e\u8c03\u7528\u5373\u53ef\u590d\u73b0<\/h2>\n<p>\u4e3a\u4e86\u8ba9\u5f00\u53d1\u8005\u201c\u5f00\u7bb1\u5373\u5f97\u201d\uff0c\u661f\u5b87\u667a\u7b97\u628a\u4e0a\u8ff0\u9a71\u52a8\u7248\u672c\u3001\u7f16\u8bd1\u53c2\u6570\u3001\u7cfb\u7edf\u8c03\u4f18\u5168\u90e8\u56fa\u5316\u6210\u516c\u5f00\u955c\u50cf\uff1a<strong>tensorrt-llm-0.10-ubuntu22.04-cuda12.5<\/strong>\u3002<br \/>\n\u5728\u63a7\u5236\u53f0\u70b9\u51fb\u201c\u521b\u5efa\u5b9e\u4f8b\u2014AI \u5e94\u7528\u201d\uff0c\u955c\u50cf\u5927\u5c0f 38 GB\uff0c\u5df2\u7f13\u5b58\u5230\u672c\u5730 SSD\uff0c90 \u79d2\u5b8c\u6210\u5206\u53d1\uff1b\u542f\u52a8\u540e\u81ea\u5e26 <code>README.md<\/code>\uff0c\u5185\u542b\uff1a<br \/>\n&#8211; 70B\/30B\/13B \u4e09\u79cd engine \u4e00\u952e\u7f16\u8bd1\u811a\u672c<br \/>\n&#8211; \u517c\u5bb9 OpenAI \u683c\u5f0f\u7684 triton_server_config.yaml<br \/>\n&#8211; \u76d1\u63a7 JSON\uff0c\u63a5\u5165 Grafana \u5373\u53ef\u770b QPS\u3001\u663e\u5b58\u3001\u529f\u8017  <\/p>\n<p>\u9ad8\u6821\u56e2\u961f\u6216\u521d\u521b\u4f01\u4e1a\u53ea\u9700\u805a\u7126\u63d0\u793a\u8bcd\u4e0e\u4e1a\u52a1\u903b\u8f91\uff0c\u518d\u4e5f\u4e0d\u7528\u628a\u7cbe\u529b\u6d6a\u8d39\u5728\u201c\u914d\u73af\u5883\u201d\u8fd9\u79cd\u4f4e\u4ef7\u503c\u73af\u8282\u3002<\/p>\n<hr \/>\n<h2>\u5199\u5728\u6700\u540e\uff1a10 \u5143\u4f53\u9a8c\u91d1\uff0c\u628a 50\u00d7 \u6027\u80fd\u5148\u8dd1\u8d77\u6765\u518d\u8bf4<\/h2>\n<p>CUDA\u3001TensorRT-LLM \u7684\u62a4\u57ce\u6cb3\u5f88\u6df1\uff0c\u4f46\u6df1\u4e0d\u8fc7\u201c\u6298\u817e\u73af\u5883\u201d\u5e26\u6765\u7684\u9690\u6027\u6210\u672c\u3002<br \/>\n<strong><a href=\"https:\/\/www.starverse-ai.com\">\u661f\u5b87\u667a\u7b97<\/a><\/strong> \u628a GPU\u670d\u52a1\u5668\u79df\u7528\u3001GPU\u4e91\u4e3b\u673a\u3001AI\u5e94\u7528 \u4e09\u5927\u5173\u952e\u8bcd\u505a\u6210\u4e00\u6761\u201c\u5feb\u94fe\u201d\uff1a\u6ce8\u518c\u5373\u9886 10 \u5143\u4f53\u9a8c\u91d1\uff0cRTX 4090 \u6309\u91cf\u6700\u4f4e 1.2 \u5143\/\u5c0f\u65f6\uff0c\u8db3\u4ee5\u628a 70B \u6a21\u578b\u5b8c\u6574\u8dd1\u901a\u3002<br \/>\n\u6027\u80fd\u7ffb\u500d\u4e0d\u662f PPT\uff0c\u800c\u662f\u4eca\u5929\u5c31\u80fd\u590d\u73b0\u7684 bash \u811a\u672c\u3002<br \/>\n\u6253\u5f00 <a href=\"https:\/\/www.starverse-ai.com\">starverse-ai.com<\/a>\uff0c\u641c\u7d22\u201cTensorRT-LLM\u201d\u955c\u50cf\uff0c90 \u79d2\u540e\u89c1\u771f\u7ae0\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>2024 \u5e74 5 \u6708\uff0cNVIDIA \u5728 Computex \u516c&hellip;<\/p>\n","protected":false},"author":2,"featured_media":2065,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2066","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-zixun"],"views":53,"_links":{"self":[{"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/posts\/2066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/comments?post=2066"}],"version-history":[{"count":0,"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/posts\/2066\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/media\/2065"}],"wp:attachment":[{"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/media?parent=2066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/categories?post=2066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.starverse-ai.com\/guide\/wp-json\/wp\/v2\/tags?post=2066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}