Edge AIのためのDNN推論処理の最適化

83.1K Views

July 24, 23

tieriv 自動運転 computing tierivmeetup

スライド概要

2023/7/13「自動運転におけるAIコンピューティング」
発表者：梅田弾

TIER IV

@TIER_IV

スライド一覧

TIER IV（ティアフォー）は、「自動運転の民主化」をビジョンとし、Autowareを活用したソフトウェアプラットフォームと統合開発環境を提供しています。 #Autoware #opensource #AutonomousDriving #deeptech

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

Openな資産とFPGAを活用して、領域特化型のProcessorを作ろう！

自動運転 tieriv fpga rtl autoware riscv llvm

TIER IV 38.4K

Hailo-8上の低消費電力Edge AI

tieriv 自動運転 computing hailo edgeai autoware dnn

TIER IV 29.9K

小さく始める Blue/Green Deployment

tieriv sre cicd

TIER IV 29.4K

Autoware紹介と自動運転ハードウェアアーキテクチャ

自動運転 autoware tieriv fpga rtl

TIER IV 28.3K

Webエンジニアが自動運転企業でやっていること

web 自動運転

TIER IV 23K

Tier IV Tech Meetup #1 - 世界初オープンソースの自動運転ソフトウェア「Autoware」ができること＆開発秘話 -

自動運転 autoware tieriv

TIER IV 17.9K

各ページのテキスト

Edge AIのための DNN推論処理の最適化株式会社ティアフォー梅田弾

Speaker 経歴: 早稲田大学基幹理工研究科情報理工専攻博士後期課程修了 ~2016 早稲田大学基幹理工学部助手車載ソフトウェア向けの並列化コンパイラの研究に従事 ~2017 電気メーカ研究所ビックデータ可視化に関する研究開発に従事 Dan Umeda ~2022 車メーカ自動運転AIの研究開発に従事 2019~ 早稲田大学客員講師・招聘研究員並列化コンパイラの研究に従事 2023~ TIER IV DNN高速化やEdge AIの研究開発に従事

TIER IVでのYOLOX最適化の取り組み事例を紹介 Edge AIのための DNN推論処理の最適化 1) YOLOXの前処理最適化 2) YOLOXのモデル最適化(GPU/DLA) 3) YOLOXの量子化 3

Artiﬁcal Intelligence (AI)とDeep Neural Network (DNN)の進化の流れ機械学習としての側面 10x Per Year J. Sevilla el al., arXiv:2202.05924, “Compute Trends Across Three Eras of Machine Learning” https://arxiv.org/abs/2202.05924 Deep Learning (AlexNet)の登場から、AIに必要なパラメータと計算量が爆発的に増加しつづけている 4

https://arxiv.org/abs/2202.05924

GPU性能の進展の流れ計算機としての側面 Predicting GPU Performance https://epochai.org/blog/predicting-gpu-performance 今のところは、 AIの学習・推論に必要な GPU性能も進化し続けている AIの進化に比べてれば緩やか (2x per 2.69 years) => 機械学習と計算機のギャップ大きい 5

https://epochai.org/blog/predicting-gpu-performance

Peak Power vs Peak Performance DNNを処理するために必要な AIプロセッサは電力により制約される => 特にエッジ環境化では低消費電力であることが求められる A. Reither et al., arXiv:2210.04055, “AI and ML Accelerator Survey and Trends” https://arxiv.org/abs/2210.04055 空冷では <10W程度が望ましい IoT autonomous 10W cloud or datacenter 100W 組み込み・Edge向けの限られた性能のプロセッサに実装するには AIの最適化実装が重要本発表ではDNN推論処理の最適化・高速化手法について説明します 6

https://arxiv.org/abs/2210.04055

DNN推論処理の最適化・高速化手法 ● ● ● モデルの設計レベル最適化 ○ どのようなマクロアーキテクチャを選択するか ■ Ex) Residual (YOLOv3), CSP (YOLOv4), ELAN (YOLOv7) … ○ NASによるモデルの自動探索 ■ Ex) YOLO-NAS モデル圧縮 ○ Pruning (枝刈り)による重みの削減 ■ Channel Pruning ■ Sparse Pruning ○ Quantization (量子化)によるビット数削減 ■ Post Training Quantization (PTQ) ■ Quantized Aware Training (QAT) その他 ○ 推論パイプラインの最適化 ■ 前処理のアクセラレータ(GPU)実装 ■ ヘテロジニアスなアクセラレータ(GPU+DLA)の活用 7

TIERIVでのDNN最適化の取り組み Perception / Detection Layer Sensing layer Object Recognition Cameras Pre-processing LiDAR-Camera BoundingBox Fusion YOLO X Perception / Tracking Layer Shape Estimation Dynamic Object Detection LiDAR Clustering LiDARs Pre-processing RADARs Pre-processing Detection by Tracker Centerpoint Merger RADAR Fusion Multi Object Tracker NVIDIA SoC (Jetson Xavier)で動作する動的物体認識を題材にの最適化技術について紹介します 9

10.

TIER IV YOLOシリーズの変遷 darknet53 +FPN YOLO 15/6 Joseph Redmon YOLOv2 16/12 Joseph Redmon YOLOv3 18/4 Joseph Redmon ELAN (C2F) + PAN CSP + PAN YOLOv4 20/4 Alexey Bochkosky Chien-Yao Wang Scaled YOLOv4 YOLOR 20/11 Chien-Yao Wang 21/5 Chien-Yao Wang Alexey Bochkosky YOLOv5 YOLOv7 22/7 Chien-Yao Wang Alexey Bochkosky YOLOv8 Model Scaling 20/6? Ultralystics 23/1 Ultralystics YOLOX ResNet +FPN 21/3 Megvii YOLO-NAS 23/5 Deci PP-YOLO 20/8 Baidu 10

11.

YOLOX Object Detection ● Z. Ge et al., arXiv:2107.08430, “YOLOX: Exceeding YOLO Series in 2021” https://arxiv.org/abs/2107.08430 YOLOX ○ 2021年にMegvii Technologyに提案された物体検出DNN ■ YOLOv3, YOLOv4を踏襲したsingle shot detector ○ Decoupling Headによる分類と回帰タスクのコンフリクトの緩和 ○ Anchor Freeにより汎化性能の向上 ○ スケーラブルなモデルの提供(Nano, Tiny, S, M, L, X) ○ Apache License (TIER IVとしてはここを重視) https://github.com/Megvii-BaseDetection/YOLOX より引用 11

12.

ターゲットのSoC Jeston Xavier / Jetson Orin 現在のCamera Perceptionで使用している SoC Jetson Xavier [評価条件] 次世代のCamera Perceptionで検討している SoC Jetson Orin [評価条件] JetPack : 5.0.1 Power Model : MAXN Jetson Clock : enable https://developer.nvidia.com/blog/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/ より引用 TOPS : Tera Operations Per Second INT8のピーク性能を指す JetPack : 5.0.1 Power Model : MAXN Jetson Clock : enable https://developer.nvidia.com/blog/delivering-server-class-performance-at-the-edge-with-nvidia-jetson-orin/ より引用 AI Peak Performance ・GPU : 22 TOPS (Dense INT8) ・DLA0 : 5 TOPS (Dense INT8) ・DLA1 : 5 TOPS (Dense INT8) AI Peak Performance ・GPU : 170 TOPS (Sparse INT8), 85 TOPS (Dense INT8) ・DLA0 : 52.5 TOPS (Sparse INT8), 26.25 TOPS (Dense INT8) ・DLA1 : 52.5 TOPS (Sparse INT8), 26.25 TOPS (Dense INT8) Power Consumption : ~30W Power Consumption : ~60W 12

13.

TIER IV Jetson上のYOLOXのデプロイフロー学習環境 (サーバー) https://onnx.ai/ 学習共通フォーマットへの変換 DNN最適化実施前 Model : YOLOX-Tiny Precision : FP16 推論環境（エッジ） https://developer.nvidia.com/tensorrt GPUへの最適化 DNN最適化後 Model : YOLOX-？ Precision : INT8 YOLOX-Tiny FP16をスタートポイントとして、 DNN最適化を実施する 13

14.

YOLOXによる物体認識のパイプライン Xavier 3基 1920x1280 960x608 cam0 Preprcessing on CPU cam1 resize, letterbox, NHWC2NHCW … 5ms@Xavier CPU Inference on GPU YOLOX-Tiny with FP16 11ms@Xavier GPU Postprocessing on CPU/GPU decode, nms ~1ms@CPU 14

15.

YOLOXによる物体認識のパイプライン Dynamic Object Detection Traffic Light Detection Xavier 3基 1920x1280 cam0 960x608 処理時間が支配的な前処理と YOLOX推論処理を最適化する Preprcessing on CPU cam1 Xavier 1基 resize, letterbox, NHWC2NHCW … 5ms@Xavier CPU 1.YOLOXの前処理最適化 Inference on GPU YOLOX-Tiny with FP16 11ms@Xavier GPU Postprocessing on CPU/GPU decode, nms ~1ms@CPU 2. YOLOXのモデル最適化 (GPU/DLA) 3. YOLOXの量子化 15

16.

1. YOLOXの前処理の最適化 16

17.

YOLOX前処理のパイプライン cam0 Preprcessing on CPU cam1 2.033ms cv::resize Detection on GPU resize, letterbox, NHWC2NHCW … Postprocessing on CPU/GPU YOLOX-Tiny with FP16 0.199ms cv::copyMakeBorder (Letterbox) 2.996ms cv::dnn::blobFromImages NHWC8toNHCW32 （合計） 5.198ms @Xavier CPU 17

18.

YOLOX前処理のCUDA化 cam0 Preprcessing on CPU cam1 Detection on GPU resize, letterbox, NHWC2NHCW … cv::resize YOLOX-Tiny with FP16 0.199ms 2.033ms Postprocessing on CPU/GPU 2.996ms cv::copyMakeBorder (Letterbox) （合計） 5.198ms cv::dnn::blobFromImages NHWC8toNHCW32 0.737ms 0.094ms 0.078ms 0.165ms resize_bilinear_gpu letterbox_gpu NHWC2NCHW_gpu toFloat_gpu (合計) 1.074ms 各処理のCUDA化により4.84x高速化が実現 18

19.

YOLOX前処理のCUDA化+Kernel Fusion cam0 Preprcessing on CPU cam1 Detection on GPU resize, letterbox, NHWC2NHCW … 2.996ms cv::copyMakeBorder (Letterbox) cv::resize 11.04x YOLOX-Tiny with FP16 0.199ms 2.033ms Postprocessing 4.83x on CPU/GPU （合計） 5.198ms cv::dnn::blobFromImages NHWC8toNHCW32 0.737ms 0.094ms 0.078ms 0.165ms resize_bilinear_gpu letterbox_gpu NHWC2NCHW_gpu toFloat_gpu J. Filipoviˇc et al., arXiv:1305.1183 , “OPTIMIZING CUDA CODE BY KERNEL FUSION—APPLICATION ON BLAS” https://arxiv.org/abs/1305.1183 (合計) 1.074ms 0.471ms resize_bilinear_letterbox_NHWC2NCHW32 Kernel Fusion による最適化により、更に 2.28x高速化を実施 (DRAMアクセス削減 ) 19

https://arxiv.org/abs/1305.1183

20.

2. YOLOXのモデル最適化 (GPU/DLA) 20

21.

Xavier/Orin GPU上のYOLOXシリーズのベンチマーク Input Resolution : 960x608 [評価条件] JetPack : 5.0.1 Power Model : MAXN Jetson Clock : enable 実行時間あまり変わらない YOLOX(960x608)の各モデルの計算量とパラメータ数 YOLOX-Tiny YOLOX-S YOLOX-M YOLOX-L YOLOX-X 1.76x Computation (GFLOPS) Parameters (M) 21.66 38.07 104.84 221.37 401.11 5.04 8.94 25.28 54.14 98.97 Idle? https://developer.nvidia.com/blog/nvidia-jetson-agx -xavier-32-teraops-ai-robotics/ より引用計算量の差の割に YOLOX-TinyとYOLOX-Sの間で実行時間差があまりない => YOLOX-TinyがXavier/Orinがもつ並列資源を使い倒せていない (計算効率低) 21

https://developer.nvidia.com/blog/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/

22.

YOLOX-tiny Network Architecture 960x608 480x304 240x152 120x76 30x19 60x38 CSP-Res 60x38 120x76 60x38 30x19 decoupled head TIER IV YOLOX-Tiny の場合 64ch 128ch 256ch 128ch 64ch 128ch YOLOX-Tinyで使われるConv + Activationの基本構成 ■Convolution Channel : 24xN ■Activation : SWISH => Xavier GPUのプロファイルを元に効率の良いCNNアーキテクチャを検討する decoupled head CSP(3,8n,m) 192ch DB(3,8n) CSP(3,4n,m) 96ch DB(3,4n) CSP(3,2n,m) 48ch upsample CSP(3,4n,m) 96ch upsample CSP(3,8n,3m) SPP(8n) 192ch DB(3,16n) CSP-Res(3,4n,3m) 96ch DB(3,8n) CSP-Res(3,2n,3m) DB(3,4n) CSP-Res(3,n,m) 32ch 48ch 256ch decoupled head YOLOX-S の場合 DB(3,2n) BB(3,n) preprocess 24ch 22

23.

TIER IV Xavier GPU上のConvolutionプロファイル Batch : 1 Resolution: 64x64 Activation: RELU Precision : INT8 X axis：Output Channel Z axis: Input Channel Y axis: Effciency※ cin=128 cout=128 https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#op timize-tensor-cores より引用 TensorRT Docの”13.5. Optimizing Tensor Core”で言及されている Tensor Core layers tend to achieve better performance if the I/O tensor dimensions are aligned to a certain minimum granularity: cin=64 cout=64 cin=32 cout=32 32の倍数のIN/Out Channelですべての計算資源を使い倒せる 24ch倍数で変化するYOLOX-Tinyは効率の低いパラメータ設定になっている 1.64x => 計算効率や精度面から YOLOX-Sに変更する 32chの倍数で計算効率が向上する規則性がある ※Efficiency : (GFLOPs/DNN time)/TOPS*100 GFLOPs : 2*H*W*Cin*K*K/S*Cout/1000/1000/1000 23

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#optimize-tensor-cores

24.

TIER IV Xavier上のConv+Activationのプロファイル結果 Resolution: 64x64 Activation: [ReLU/SWISH] X axis：Output Channel Z axis: Input Channel ReLU f(x) = max(0, x) Y axis: Effciency Activation図は https://pytorch.org/docs/ より引用 SWISH (SiLU) f(x) = x・sigmoid(x) SWISHの場合Reluと比べて計算効率が 30-40%低下する 24

https://pytorch.org/docs/

25.

TIER IV Nsight System Proﬁlerによる可視化(Xavier) Xavier上のYOLOX-Sのプロファイル結果（推論 1回分） generateNativePointwise (SWISH)が全体の CUDA Kernelの26%を占める上位３つのCUDA Kernelが Convolution SWISHがボトルネック SWISHをRELUに置換 2.15x 1.64x 参考: Layer Fusionがサポートされる組み合わせ一覧 https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#fusion-types 25

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#fusion-types

26.

TIER IV ActivationをReLUにしたときのYOLOX性能評価 CREATING PERFORMANT DL MODELS FOR USE IN CLIENT APPLICATIONS (GTC20) STRATEGY 2 https://developer.download.nvidia.com/video/gputechconf/gtc/2 020/presentations/s21317-creating-performant-dl-models-for-u se-in-client-applications.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJ odHRwczovL3d3dy5nb29nbGUuY29tLyJ9 でも同様なことが言及されている 1.22x INT8の場合精度が逆転する (ReLU > SWISH) 1.22x SWISH ReLU 1.28x 1.38x 1.31x ReLUに変更しても、急激な精度低下は生じない INT8時ではReLUの方が精度が高いエッジ環境下では複雑な活性関数よりもReLUのようなシンプルな活性化関数のほうが有利場合がある 26

https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21317-creating-performant-dl-models-for-use-in-client-applications.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczovL3d3dy5nb29nbGUuY29tLyJ9

27.

TIER IV Xavier DLA上のYOLOX実装 DLA x2 (Totally 10TOPs) INT8 5TOPs x 2 https://developer.nvidia.com/blog/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/ より引用 Orinの場合、DLAだけで105TOPS (Sparse)の性能があるため、DLAの活用はより重要となる http://nvdla.org/primer.htm より引用 Convolution core, Activation engine, Pooling engine等からなるFixedなアクセラレータを活用する 27

28.

TIER IV Xavier/Orin DLAの使い方と制約 ■使い方以下４つを追加するだけで DLA実行が可能 config->setFlag(BuilderFlag::kGPU_FALLBACK); config->setDefaultDeviceType(DeviceType::kDLA); config->setDLACore(dlaCore); runtime->setDLACore(dlaCore); https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dla-using-trt-api ■制約 Support Layer : Convolution, Deconvolution, Pooling, Activation (ReLU, Sigmoid, TanH, Clipped ReLU and Leaky RELU), ElementWise, Concatenation, Resize… Precision : FP16 or INT8 Quantization : PTQのみ (Jetpack 5.1.0時点) Calibration : kENTROPY_CALIBRATION_2のみ (Jetpack 5.1.0時点) Batching : Staticバッチのみ (Jetpack 5.1.0時点) 28

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dla-using-trt-api

29.

TIER IV YOLOX-SのDLAへのビルド TensorRT engineビルドのログを抜粋 [I] [TRT] ---------- Layers Running on DLA ---------[I] [TRT] [DlaLayer] {ForeignNode[/backbone/backbone/stem/Concat.../backbone/backbone/dark5/dark5.1/conv1/act/Mul]} [I] [TRT] [DlaLayer] {ForeignNode[/backbone/backbone/dark5/dark5.1/m.0/MaxPool.../head/Concat_2]} [I] [TRT] ---------- Layers Running on GPU ---------[I] [TRT] [GpuLayer] COPY: Reformatting CopyNode for Network Input images [I] [TRT] [GpuLayer] SLICE: /backbone/backbone/stem/Slice [I] [TRT] [GpuLayer] SLICE: /backbone/backbone/stem/Slice_1 [I] [TRT] [GpuLayer] SLICE: /backbone/backbone/stem/Slice_2 [I] [TRT] [GpuLayer] SLICE: /backbone/backbone/stem/Slice_3 GPUへfallbackされたオペレータ [I] [TRT] [GpuLayer] SLICE: /backbone/backbone/stem/Slice_4 [I] [TRT] [GpuLayer] SLICE: /backbone/backbone/stem/Slice_5 [I] [TRT] [GpuLayer] POOLING: /backbone/backbone/dark5/dark5.1/m.1/MaxPool [I] [TRT] [GpuLayer] POOLING: /backbone/backbone/dark5/dark5.1/m.2/MaxPool [I] [TRT] [GpuLayer] SHUFFLE: /head/Reshape [I] [TRT] [GpuLayer] COPY: /head/Reshape_copy_output [I] [TRT] [GpuLayer] SHUFFLE: /head/Reshape_1 [I] [TRT] [GpuLayer] COPY: /head/Reshape_1_copy_output [I] [TRT] [GpuLayer] SHUFFLE: /head/Reshape_2 [I] [TRT] [GpuLayer] COPY: /head/Reshape_2_copy_output [I] [TRT] [GpuLayer] SHUFFLE: /head/Transpose 29

30.

TIER IV Xavier DLA上のYOLOX-Sのプロファイル結果 ■最適化前（Original） Stem+DataReformat4DLA GPU Kernel 1ms DLA Kernel SPP DLA Kernel GPU Kernel 0.5ms DataRefor mat4GPU GPU Kernel 0.5ms DLAサポート外のオペレータがあり、GPU fallbackが発生 => DLA効率を上げるためには、GPU fallbackを削減することが重要 30

31.

TIER IV Xavier DLA上のYOLOX-Sの最適化(Remove SPP) ■Remove SPP Kernel > 8のpoolingは DLAサポート外 SPP(n) BB(1,n) maxpool 5x5 maxpool 9x9 BB(1,2n) SPPの削除 SPP(n) BB(1,n) maxpool 13x13 maxpool 5x5 maxpool 9x9 maxpool 13x13 BB(1,2n) SPPによるGPU Fallbackがなくなり、約 2ms実行時間を削減 (中間結果の GPUとのデータ授受なし ) 31

32.

TIER IV Xavier DLA上のYOLOX-Sの最適化(Simplify Stem) ■Simplify Stem SliceはDLAサポート外 Convへの置き換え GPUへの負荷を約 0.7ms削減(GPU使用率の削減 ) 32

33.

TIER IV Xavier DLA上のYOLOX-Sの最適化(Switch to ReLU) ■Switch to ReLU Basic Block (k,n) :BB SWISH (Sigmoid+Mul) は追加の計算コストを要する Basic Block (k,n) :BB conv kxkxn conv kxkxn swish RELU HWサポートのある ReLUに置き換えることで 20ms削減 33

34.

TIER IV Xavier DLA上のYOLOX-Sの最適化(INT8 Quantiztion) ■INT8 Quantization PTQ Entropy Calibration INT8 Quantization実施により、約 5ms削減 34

35.

TIER IV Xavier DLA上のYOLOX-Sの評価 3.51x 2.68x Faster 2.15x 2.43x 1.64x DLA実装対応によるモデルのシンプル化により計算効率が更に改善 GPU効率 : 33.44% DLA効率 : 48.23% 参考: “Getting started with the Deep Learning Accelerator on NVIDIA Jetson Orin”, https://github.com/NVIDIA-AI-IOT/jetson_dla_tutorial 35

https://github.com/NVIDIA-AI-IOT/jetson_dla_tutorial

36.

TIER IV 最新のモデルアーキテクチャを使った最適化 YOLOシリーズの変遷 darknet53 +FPN YOLO 15/6 Joseph Redmon YOLOv2 16/12 Joseph Redmon 20-21 はCSPが主流 CSPNet (Cross Stage Partial Network) YOLOv3 18/4 Joseph Redmon ELAN (C2F) + PAN CSP + PAN YOLOv4 20/4 Alexey Bochkosky Chien-Yao Wang Scaled YOLOv4 YOLOR 20/11 Chien-Yao Wang 21/5 Chien-Yao Wang Alexey Bochkosky YOLOv5 22以降はELANベースが台頭 ELAN (Efficient Layer Aggregation Network) YOLOv8のC2Fもほぼ同じ考え 20/6? Ultralystics YOLOv7 22/7 Chien-Yao Wang Alexey Bochkosky YOLOv8 Model Scaling S/M/L/X 23/1 Ultralystics YOLOX ResNet +FPN PP-YOLO Designing Network Design Strategies Through Gradient Path Analysis, Chein-Yao Wang 20/8 Baidu YOLO-NAS 21/3 Megvii ちょっと古くなってきてしまったので、モダンなアーキテクチャを使いたい => ELAN (C2F) 23/5 Deci 基本的には新しいマクロアーキテクチャを採用したい 36

37.

TIER IV 最新のモデルアーキテクチャを使った最適化 CSP-Res Block ELAN-Res (C2F) Block YOLOv7/YOLOv8 ライクなマクロアーキテクチャへ変更 conv 1x1xn conv 1x1xn conv 1x1xn conv 3x3xn conv 3x3xn conv 3x3xn conv 1x1xn 1x1を3x3への置き換えで計算密度（効率）向上 conv 1x1xn conv 1x1xn conv 3x3xn conv 3x3xn conv 3x3xn conv 1x1xn conv 3x3xn conv 1x1x2n 2.15x 4.18x 3.51x 2.88x 2.43x conv 3x3xn conv 3x3xn 1.64x conv 1x1x2n 37

38.

最適化前後でのYOLOX処理全体の実行時間 ● 今までの前処理高速化(10x)と量子化(2x)の取り組みで、全体として約2.75倍の高速化が見込める resolution YOLOX-tiny YOLOX-S-Opt 960x608 960x608 Floating Point (FP16/32) 8bit Integer (INT8) Preprocess CPU GPU@22TOPS preprocess time 5ms@CPU 0.5ms@GPU GFLOPS 21.7 46.8 Params 5.03 10.1 DNN time [ms] 11@FP16 5.33@INT8 Efficiency [%] 8.95 39.66 GPU@22TOPS 0.5ms 11ms YOLOX-tiny Preprocess: ~10x CPU YOLOX-S-Opt : GPU/DLAに最適化した YOLOX-S (SWISH2RELU, DLA最適化、 ELAN) 5ms 5.3ms GPU上で 2.75x 高速化 YOLOX-S-Opt Inference ~2x 38

39.

最適化前後でのYOLOX処理全体の実行時間 ● 今までの前処理高速化(10x)と量子化(2x)の取り組みで、全体として約2.75倍の高速化が見込める resolution YOLOX-tiny YOLOX-S-Opt 960x608 960x608 Floating Point (FP16/32) 8bit Integer (INT8) CPU GPU@22TOPS preprocess time 5ms@CPU 0.5ms@GPU GFLOPS 21.7 46.8 Params 5.03 10.1 DNN time [ms] 11@FP16 5.33@INT8 Efficiency [%] 8.95 39.66 CPU GPU@22TOPS Preprocess 5ms 0.5ms 11ms YOLOX-tiny 5.3ms GPU上で 2.75x 高速化 YOLOX-S-Opt 余剰時間を使って、モデルを高精度化する YOLOX-S-Opt : GPU/DLAに最適化した YOLOX-S (SWISH2RELU, DLA最適化、 ELAN) 39

40.

YOLOXのモデルスケーリング ● ● 従来のモデルをSからDepthをスケーリング従来の解像度(960x608)から高解像度化(1280x960) resolution YOLOX-tiny YOLOX-S+-Opt -1280x960 960x608 1280x960 Floating Point (FP16/32) 8bit Integer (INT8) CPU GPU@22TOPS preprocess time 5ms@CPU 0.9ms@GPU GFLOPS 21.7 136.7 Params 5.03 14.83 DNN time [ms] 11@FP16 13.76@INT8 CPU Efficiency [%] 8.95 45.16 Model Depth : 2.0x Model Width : 1.0x Resolution : 2.1x GPU@22TOPS Preprocess 11ms 5ms 0.9ms YOLOX-tiny 14ms YOLOX-S+-Opt-1280x960 [期待できる効果 ] 遠方/少物体精度の向上近方物体精度の向上 YOLOX-SPlus-Opt ・モデル最適化・Depth Scaling ・解像度向上 40

41.

3. YOLOXの量子化 41

42.

YOLOXの量子化 [Neta, GTC2021] [Horowitz, ISCCC 2014] 8bit整数型(INT8)は電力・面積共に効率が良い理論上ではINT8はFP16の２倍速いひとまず事後的な量子化が実施できる Post Traininig Quantization (PTQ)を実施してみる 42

43.

□評価条件 TensorRT Entropy量子化 TensorRT : 8.5.2 Calibration : EntropyV2 Calibration data : 1000 images Dataset : BDD100K Metrics : mAP50 https://github.com/Cartucho/mAP -11.03 (-23%) FP32 INT8 量子化により遠方の物体が認識できなくなる TensorRTで推奨されている Entropy量子化を行うと INT8で23%の精度が発生 43

https://github.com/Cartucho/mAP

44.

量子化アルゴリズム毎の精度評価 TensorRT (TRT) calibration ● ● ● Pytorch calibration IInt8EntropyCalibrator2 (recommended) IInt8MinMaxCalibrator IInt8LegacyCalibrator (percentile) ● ● ● ● TensorRT document ‘7. Working with INT8’ https://docs.nvidia.com/deeplearning/tensorrt/developer -guide/index.html#working-with-int8 0.00% Wu et al., arXiv:2004.09602, ‘INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION’ INT8 max: Simply use global maximum absolute value entropy: TensorRT’s entropy calibration percentile: Get rid of outlier based on given percentile. mse: MSE(Mean Squared Error) based calibration pytorch-quantization document https://docs.nvidia.com/deeplearning/tensorrt/pyto rch-quantization-toolkit/docs/userguide.html ▼1.33% ▼2.25% ▼1.94% ▼9.59% ▼10.80% ▼2.40% ▼3.00% ▼1.31% ▼1.79% ▼0.90% ▼1.75% ▼23.56% ▼55.86% no support calibration枚数 : trainデータセットから 1000枚 WeightはすべてPer-channelのMax Calibration YOLOXの量子化時精度を維持するためには、適切なキャリブレーションアルゴリズムの選定が必要 => MinMaxまたはPercentile 99.9999が向いている 44

45.

GPUにおける量子化フロー INT8 Quantization Implicit Quantization PTQ YOLOXはEntropy と相性が悪い dynamic range API calibration Entropy MinMax Explicit Quantization QAT Legacy GPUの場合は様々な手法を選択可能 https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#quantization-workflows より引用 45

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#quantization-workflows

46.

DLAにおける量子化フロー DLAの場合、量子化の制約が存在し、方法は２択（Jetpack 5.1.0時点） INT8 Quantization Implicit Quantization PTQ YOLOXはEntropy と相性が悪い MinMax Explicit Quantization QAT dynamic range API calibration Entropy Quantization : PTQのみ (Jetpack 5.1.0時点) Calibration : kENTROPY_CALIBRATION_2のみ (Jetpack 5.1.0時点) Legacy https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#quantization-workflows より引用 DLAの場合はMinMax CalibrationによるPTQ及びQATもunsupported => dynamic range APIを用いて、キャリブレーションフリーな量子化をする 46

https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#quantization-workflows

47.

[beta]

RELU6+Implicit Quantization
学習時にRELU6で最大値
に制限をかけることで量子
化と親和性の高いモデル
を獲得できる

if (precision_ == "int8") {
network->getInput(0)->setDynamicRange(0, 255.0);
for (int i = 0; i < num; i++) {
nvinfer1::ILayer *layer = network->getLayer(i);
nvinfer1::ITensor* out = layer->getOutput(0);
if (m_clip != 0.0) {
std::cout << "Set max value for outputs : "
<<
m_clip << "

" << name << std::endl;

↑
6.0

out->setDynamicRange(0.0, m_clip);
}
…
}

全レイヤのOutputのDynamic
Rangeを6.0で指定
https://pytorch.org/docs/stable/generated/torch.nn.ReLU6.html

47

https://pytorch.org/docs/stable/generated/torch.nn.ReLU6.html

48.

量子化手法毎の精度評価 DLAサポートの量子化手法 INT8モデルのDLA実装には、RELU6+Dynamic Range APIの使用が有用 48

49.

Performance Evaluation on Xavier mAP50 : TIER IV Internal Dataset使用時の精度 YOLOX-S+-Opt 960x960-INT8 Better YOLOX-S+-Opt 1280x960-INT8 Resolution GFLOPS Params GPU time@Xa vier (INT8) GPU Efficiency DLA time@Xa vier (INT8) DLA Efficiency YOLOX-tiny 960x608 21.5 5.02 6.74 14.49 - - YOLOX-M 960x608 104.5 25.24 15.26 31.13 - - YOLOX-S-Opt 960x608 45.4 14.8 5.36 39.66 10.19 YOLOX-S+-Opt 960x960 102.9 14.8 10.05 46.53 31.17 66.01 YOLOX-S+-Opt 1280x960 136.2 14.8 13.91 44.51 40.65 67.01 YOLOX-M 960x608-INT8 YOLOX-S-Opt 960x608-INT8 YOLOX-S 960x608-INT8 YOLOX-Tiny 960x608-INT8 YOLOX-Tiny 960x608-FP16 (baseline) 57.46 ベースラインの YOLOX-Tiny-FP16と同等の推論時間で +10mAPの精度改善が実現 49

50.

TIER IV 複数XavierをOrinに集約 YOLOX-S+-Optは960x960 の解像度 (102.9G)のものを使用 Xavier４基 CAM Front GPU(20TOPS) CAM Front Left Orin 1基 CAM Front/Rear 1. 2. 3. 22ms 11.00ms YOLOX-Tiny (FP16) 11.00ms YOLOX-Tiny (FP16) Preprocess on GPU Model Optimization for GPUs/DLAs INT8 Quantization GPU(85TOPS) CAM Front Left/Right DLA0(26TOPS) CAM Rear Left/Right DLA1(26TOPS) 4.1ms 4.1ms 4.1ms 4.1ms YOLOX-S+ -Opt (INT8) YOLOX-S+ -Opt (INT8) YOLOX-S+ -Opt (INT8) YOLOX-S+ -Opt (INT8) 11.69ms YOLOX-S+-Opt (INT8) 11.69ms YOLOX-S+-Opt (INT8) 11.69ms YOLOX-S+-Opt (INT8) 11.69ms YOLOX-S+-Opt (INT8) モデル規模(tiny->S+)・解像度(960x608->960x960)を向上し、精度を従来よりも向上しつつ、 6-8つのYOLOXを1台のOrinで機能を実現できる見込み 50

51.

TIER IV 複数XavierをOrinに集約 DLA0 GPU DLA1 GPU GPU GPU 51

52.

DNN推論処理の最適化のまとめ ● YOLOXの前処理の最適化 ○ ● YOLOXのモデル最適化 ○ ○ ○ ○ ● 前処理をCUDA化＋Kernel Fusionを実施し、CPU実装に比べて11倍高速化 Tensor Core上のminimum granularityを考慮して、TinyからSモデルへ変更 Xavier GPU上の計算効率を考慮して、SWISHからReLUに変更 DLAへのデプロイ対応のために、モデルのシンプル化を実施最新のモデルアーキテクチャ構造を加味し、ELAN構造を追加 => 高速化した分、モデルをスケーリング(Depth: 2.0, Resolution: 2.1x) YOLOXのINT8量子化 ○ ○ PTQでの精度劣化を最小化するために、CalibrationにMinMaxまたはPercentile 99.9999を使用 DLAへのデプロイ対応のために、量子化FriendlyなRelu6を用いて、Dynamic Rangeを指定 tensorrt_yolox https://github.com/autowarefoundation/autoware.universe/tree/main/perception/ten sorrt_yolox YOLOX-SPlus-Opt https://awf.ml.dev.web.auto/perception/models/yolox-sPlus-opt.onnx 52

53.

https://tier4.jp/