Orin NX

Nsight 系统 |NVIDIA 开发者

开始配置

NoMachine

Downloads – Download

1
2
3
sudo apt update
sudo apt upgrade -y
sudo dpkg -i nomachine_*_arm64.deb

配置启动

  1. 配置服务器以允许远程连接

    1
    sudo systemctl start nxserver
  2. 设置NoMachine为开机启动:

    1
    sudo systemctl enable nxserver
  3. 设置EGL Captureyes,这是NoMachine提供的一个屏幕捕获功能,主要用于改善在特定显示服务器环境下的远程桌面体验:

    1
    sudo /etc/NX/nxserver --eglcapture yes

    该命令重启后生效,可使用以下命令二次确认,当出现EGL Capture has been enabled则表示该功能已写入配置文件。

    1
    if [ -f "/usr/lib/systemd/user/[email protected]" ] && grep -q "nxpreload.sh" "/usr/lib/systemd/user/[email protected]" && [ -f "/usr/share/applications/org.gnome.Shell.desktop" ] && grep -q "nxpreload.sh" "/usr/share/applications/org.gnome.Shell.desktop" && [ -f "/usr/NX/etc/node.cfg" ] && grep -q "EnableEGLCapture 1" "/usr/NX/etc/node.cfg"; then echo "EGL Capture has been enabled"; else echo "Not enabled"; fi
  4. 重启NoMachine服务:

    1
    sudo systemctl restart nxserver

重启

snap版本

刷机后snap版本是2.7.0,Jetson内核与snap2.7.0不兼容,所以用snap2.7.0安装chrome/firefox后,有问题

修复方法:回退到与 Jetson 兼容的旧版本 Snap

执行以下命令即可(安装Snap 2.68.5并锁定,使其不会被snap或apt更新)

1
2
3
4
snap download snapd --revision=24724
sudo snap ack snapd_24724.assert
sudo snap install snapd_24724.snap
sudo snap refresh --hold snapd

由于orin是arm架构因此无法安装x86 版本的chrome,只能安装 chromium,具体命令如下:

1
2
3
sudo add-apt-repository ppa:a-v-shkop/chromium
sudo apt-get update
sudo apt-get install chromium-browser

GPIO

如果要在 Jetson 中使用硬件 PWM,则需要修改 Pinmux 表来多路复用。Jetpack 提供了一个名为 jetson-io 的工具,它允许创建和更新可以使用 PWM 的 dtb。

1
sudo /opt/nvidia/jetson-io/jetson-io.py

选择 Configure Jetson 40pin Header > Configure header pins manually , pwm7(32)选择并 BackSave pin changesSave and reboot to reconfigure pins,按下任意键后重启,即可完成设置。

libgpiod

在使用第三方载板时,jetson-io工具无法确定载板型号。需要手动配置引脚复用。

1
2
3
4
5
6
7
8
9
10
nvidia@tegra-ubuntu:/sys/class/pwm/pwmchip4$ sudo cat /sys/kernel/debug/gpio | grep PG.06
gpio-389 (PG.06 |usbhub_power_en ) out lo
nvidia@tegra-ubuntu:/sys/class/pwm/pwmchip4$ ^C
nvidia@tegra-ubuntu:/sys/class/pwm/pwmchip4$ sudo cat /sys/kernel/debug/gpio | grep PH.00
gpio-391 (PH.00 |m2_KeyB_power_en ) out lo
nvidia@tegra-ubuntu:/sys/class/pwm/pwmchip4$
nvidia@tegra-ubuntu:/sys/class/pwm/pwmchip4$ sudo cat /sys/kernel/debug/gpio | grep PN.01
gpio-433 (PN.01 )
nvidia@tegra-ubuntu:/sys/class/pwm/pwmchip4$ sudo cat /sys/kernel/debug/gpio | grep PCC.00
gpio-328 (PCC.00 |user-led ) out lo

但是载板并没有引出PWM引脚,使用GPIO模拟。

sudo apt-get install libgpiod-dev

sudo gpioinfo

设置 GPIO 高低:sudo gpioset --mode=wait gpiochip0 106=1

sudo gpioset --mode=wait gpiochip0 106=0

GPIO占用

Orin NX Super 设备树路径: /bus@0/i2c@c250000/gpio@24

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 查看当前状态
fdtget /boot/dtb/kernel_tegra234-p3768-0000+p3767-0000-nv-super.dtb \
/bus@0/i2c@c250000/gpio@24/j16-30 status
# 输出: okay

# 备份 DTB
sudo cp /boot/dtb/kernel_tegra234-p3768-0000+p3767-0000-nv-super.dtb \
/boot/dtb/kernel_tegra234-p3768-0000+p3767-0000-nv-super.dtb.bak

# 禁用 j16-30 hog (gpiochip2 line 7)
sudo fdtput -t s /boot/dtb/kernel_tegra234-p3768-0000+p3767-0000-nv-super.dtb \
/bus@0/i2c@c250000/gpio@24/j16-30 status disabled

# 验证修改
fdtget /boot/dtb/kernel_tegra234-p3768-0000+p3767-0000-nv-super.dtb \
/bus@0/i2c@c250000/gpio@24/j16-30 status
# 输出: disabled

# 重启生效
sudo reboot

重启后验证:

1
2
3
4
5
sudo gpioinfo gpiochip2
# line 7 应显示 "unused" 而非 "[used]"

sudo gpioget gpiochip2 7
# 应输出 0 (成功) 而不是 "Device or resource busy"

恢复方法: sudo cp /boot/dtb/...nv-super.dtb.bak /boot/dtb/...nv-super.dtb && sudo reboot

opencv

https://jishuzhan.net/article/2013776823067918337

需要手动编译OpenCV 以支持 CUDA 加速

一键安装脚本,修改version (OpenCV 版本),ARCH_BIN (CUDA 算力架构),PYTHON_VERSION_NUM (Python 版本)

Jetson 设备型号 架构代号 ARCH_BIN 修改值 备注
Jetson AGX Orin Ampere “8.7” 脚本默认值
Jetson Orin NX Ampere “8.7” 与 AGX Orin 相同
Jetson Orin Nano Ampere “8.7” 与 AGX Orin 相同
Jetson AGX Xavier Volta “7.2”
Jetson Xavier NX Volta “7.2”
Jetson TX2 Pascal “6.2”
Jetson Nano (B01) Maxwell “5.3” 老款 Nano 请填这个
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
#!/bin/bash
#
# Copyright (c) 2024.
# Modified for Jetson Orin NX (CUDA Arch 8.7)
# Based on instructions for OpenCV 4.10.0
#

# ================= 配置区域 (Config Area) =================

# 1. OpenCV 版本
version="4.10.0"

# 2. CUDA 算力架构 (重要!)
# Orin 系列 (AGX Orin, Orin NX, Orin Nano) -> 8.7
# Xavier 系列 -> 7.2
# Nano/TX1 -> 5.3
ARCH_BIN="8.7"

# 3. Python 版本 (根据你的系统修改)
# 运行 'python3 --version' 查看
# Jetpack 5 (Ubuntu 20.04) 通常是 3.8
# Jetpack 6 (Ubuntu 22.04) 通常是 3.10
PYTHON_VERSION_NUM="3.10"

# 工作目录
folder="workspace"

# =========================================================

set -e

# ---------------------------------------------------------
# 0. 清理旧版本交互
# ---------------------------------------------------------
for (( ; ; ))
do
echo "Do you want to remove the default OpenCV (yes/no)?"
read rm_old

if [ "$rm_old" = "yes" ]; then
echo "** Remove other OpenCV first"
sudo apt -y purge *libopencv*
break
elif [ "$rm_old" = "no" ]; then
break
fi
done

echo "------------------------------------"
echo "** Install requirement (1/4)"
echo "------------------------------------"
sudo apt-get update
# 安装基础编译工具
sudo apt-get install -y build-essential cmake git pkg-config unzip curl

# 图像/视频编解码库
sudo apt-get install -y libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install -y libgstreamer1.0-dev libgstreamer-plugins-base1.0-dev
sudo apt-get install -y libv4l-dev v4l-utils qv4l2

# 图片格式库
sudo apt-get install -y libjpeg-dev libpng-dev libtiff-dev

# TBB 并行库 (兼容处理: Ubuntu 22.04 使用 libtbb12, 旧版使用 libtbb2)
sudo apt-get install -y libtbb-dev
if apt-cache search --names-only '^libtbb2$' | grep -q libtbb2; then
sudo apt-get install -y libtbb2
elif apt-cache search --names-only '^libtbb12$' | grep -q libtbb12; then
sudo apt-get install -y libtbb12
fi

# GUI 支持
sudo apt-get install -y libgtk2.0-dev

echo "------------------------------------"
echo "** Download opencv ${version} (2/4)"
echo "------------------------------------"
mkdir -p $folder
cd ${folder}

# 下载 OpenCV 源码 (增加防重复下载判断)
if [ ! -f "opencv-${version}.zip" ]; then
echo "Downloading OpenCV source..."
curl -L https://github.com/opencv/opencv/archive/${version}.zip -o opencv-${version}.zip
else
echo "opencv-${version}.zip already exists."
fi

# 下载 Contrib 源码
if [ ! -f "opencv_contrib-${version}.zip" ]; then
echo "Downloading OpenCV Contrib source..."
curl -L https://github.com/opencv/opencv_contrib/archive/${version}.zip -o opencv_contrib-${version}.zip
else
echo "opencv_contrib-${version}.zip already exists."
fi

# 解压
echo "Unzipping..."
unzip -o opencv-${version}.zip > /dev/null
unzip -o opencv_contrib-${version}.zip > /dev/null

# 清理压缩包(可选,这里保留以免重试时需要重新下载,如果空间不足可取消注释)
# rm opencv-${version}.zip opencv_contrib-${version}.zip

cd opencv-${version}/

echo "------------------------------------"
echo "** Build opencv ${version} (3/4)"
echo "------------------------------------"
mkdir -p release
cd release/

# CMake 配置
# 注意:Orin NX 使用 8.7 架构
cmake -D WITH_CUDA=ON \
-D WITH_CUDNN=ON \
-D CUDA_ARCH_BIN="${ARCH_BIN}" \
-D CUDA_ARCH_PTX="" \
-D OPENCV_GENERATE_PKGCONFIG=ON \
-D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-${version}/modules \
-D WITH_GSTREAMER=ON \
-D WITH_LIBV4L=ON \
-D BUILD_opencv_python3=ON \
-D BUILD_opencv_gapi=OFF \
-D BUILD_TESTS=OFF \
-D BUILD_PERF_TESTS=OFF \
-D BUILD_EXAMPLES=OFF \
-D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=/usr/local ..

echo "Compiling... This may take a while."
make -j$(nproc)

echo "------------------------------------"
echo "** Install opencv ${version} (4/4)"
echo "------------------------------------"
sudo make install

# 配置环境变量到 .bashrc
# 判断是否已经存在,避免重复写入
if ! grep -q "export LD_LIBRARY_PATH=/usr/local/lib" ~/.bashrc; then
echo 'export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
fi

# 根据 Python 版本路径配置 PYTHONPATH
SITE_PACKAGES_PATH="/usr/local/lib/python${PYTHON_VERSION_NUM}/site-packages"

if ! grep -q "export PYTHONPATH=${SITE_PACKAGES_PATH}" ~/.bashrc; then
echo "export PYTHONPATH=${SITE_PACKAGES_PATH}/:\$PYTHONPATH" >> ~/.bashrc
fi

# 提示用户手动 source
echo "------------------------------------"
echo "** Install opencv ${version} successfully"
echo "** IMPORTANT: Please run the following command to apply changes:"
echo " source ~/.bashrc"
echo "** Bye :)"

MVS

到下载页面下载:

海康机器人-机器视觉-下载中心

机器视觉工业相机SDK Runtime组件包(Linux)V4.7.0

MvCamCtrlSDK_Runtime-4.7.0_aarch64_20251113.deb

TensorRT

添加工具软链接

1
sudo ln -s /usr/src/tensorrt/bin/trtexec /usr/local/bin/trtexec

cuda-toolkit

安装

sudo apt-get install cuda-toolkit

在~/.bashrc文件末尾添加

1
2
3
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.6/lib64
export PATH=$PATH:/usr/local/cuda-12.6/bin
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda-12.6

使用检索

trtexec使用

pt导出onnx(动态纬度)

1
yolo export model=best.pt format=onnx dynamic=True opset=12

转换模型

1
2
3
4
5
6
7
8
9
trtexec \
--onnx=best.onnx \
--saveEngine=yolo11n.engine \
--fp16 \
--minShapes=images:1x3x640x640 \
--optShapes=images:1x3x640x640 \
--maxShapes=images:1x3x640x640 \
--memPoolSize=workspace:4096 \
--verbose

导出batch2模型

1
2
3
4
5
6
7
8
trtexec \
--onnx=best.onnx \
--saveEngine=yolo11n_batch2.engine \
--fp16 \
--minShapes=images:2x3x640x640 \
--optShapes=images:2x3x640x640 \
--maxShapes=images:2x3x640x640 \
--verbose

MVS使用

开始

DLA推理

纯 DLA 推理 yolo26 模型时,两个 attention 模块(model.10model.22)在 DLA 上耗时极高,
nsys profiling 显示:

1
2
3
ForeignNode[2] (model.10 attention body): 4.56ms  28.0%  ← 瓶颈
ForeignNode[4] (model.22 attention body): 4.59ms 28.2% ← 瓶颈
两者合计: 9.15ms = 56% of total 16.26ms
1
2
3
4
5
model.0-model.9   : Backbone (Conv + C3k2 blocks)      [DLA 兼容]
model.10 : C2PSA block (含 PSA attention) [DLA 兼容但慢]
model.11-model.21 : Neck (SPPF + Upsample + C3k2) [DLA 兼容]
model.22 : C2PSA block (含 PSA attention) [DLA 兼容但慢]
model.23 : Detection Head (Conv + Concat + decode) [DLA 兼容]

模型结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
{
"Layers": [
"Reformatting CopyNode for Input Tensor 0 to /model.0/conv/Conv + PWN(PWN(/model.0/act/Sigmoid), PWN(/model.0/act/Mul))",
"/model.0/conv/Conv + PWN(PWN(/model.0/act/Sigmoid), PWN(/model.0/act/Mul))",
"/model.1/conv/Conv + PWN(PWN(/model.1/act/Sigmoid), PWN(/model.1/act/Mul))",
"/model.2/cv1/conv/Conv + PWN(PWN(/model.2/cv1/act/Sigmoid), PWN(/model.2/cv1/act/Mul))",
"/model.2/Split_1",
"/model.2/m.0/cv1/conv/Conv + PWN(PWN(/model.2/m.0/cv1/act/Sigmoid), PWN(/model.2/m.0/cv1/act/Mul))",
"Reformatting CopyNode for Input Tensor 0 to /model.2/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.2/m.0/cv2/act/Sigmoid), PWN(/model.2/m.0/cv2/act/Mul)), PWN(/model.2/m.0/Add))",
"Reformatting CopyNode for Input Tensor 1 to /model.2/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.2/m.0/cv2/act/Sigmoid), PWN(/model.2/m.0/cv2/act/Mul)), PWN(/model.2/m.0/Add))",
"/model.2/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.2/m.0/cv2/act/Sigmoid), PWN(/model.2/m.0/cv2/act/Mul)), PWN(/model.2/m.0/Add))",
"/model.2/Split_output_0 copy",
"/model.2/Split_output_1 copy",
"/model.2/m.0/Add_output_0 copy",
"/model.2/cv2/conv/Conv + PWN(PWN(/model.2/cv2/act/Sigmoid), PWN(/model.2/cv2/act/Mul))",
"/model.3/conv/Conv + PWN(PWN(/model.3/act/Sigmoid), PWN(/model.3/act/Mul))",
"/model.4/cv1/conv/Conv + PWN(PWN(/model.4/cv1/act/Sigmoid), PWN(/model.4/cv1/act/Mul))",
"/model.4/Split_4",
"/model.4/m.0/cv1/conv/Conv + PWN(PWN(/model.4/m.0/cv1/act/Sigmoid), PWN(/model.4/m.0/cv1/act/Mul))",
"/model.4/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.4/m.0/cv2/act/Sigmoid), PWN(/model.4/m.0/cv2/act/Mul)), PWN(/model.4/m.0/Add))",
"/model.4/Split_output_0 copy",
"/model.4/Split_output_1 copy",
"/model.4/m.0/Add_output_0 copy",
"/model.4/cv2/conv/Conv + PWN(PWN(/model.4/cv2/act/Sigmoid), PWN(/model.4/cv2/act/Mul))",
"/model.5/conv/Conv + PWN(PWN(/model.5/act/Sigmoid), PWN(/model.5/act/Mul))",
"/model.6/cv1/conv/Conv + PWN(PWN(/model.6/cv1/act/Sigmoid), PWN(/model.6/cv1/act/Mul))",
"/model.6/m.0/cv1/conv/Conv + PWN(PWN(/model.6/m.0/cv1/act/Sigmoid), PWN(/model.6/m.0/cv1/act/Mul))",
"/model.6/m.0/m/m.0/cv1/conv/Conv + PWN(PWN(/model.6/m.0/m/m.0/cv1/act/Sigmoid), PWN(/model.6/m.0/m/m.0/cv1/act/Mul))",
"/model.6/m.0/m/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.6/m.0/m/m.0/cv2/act/Sigmoid), PWN(/model.6/m.0/m/m.0/cv2/act/Mul)), PWN(/model.6/m.0/m/m.0/Add))",
"/model.6/m.0/m/m.1/cv1/conv/Conv + PWN(PWN(/model.6/m.0/m/m.1/cv1/act/Sigmoid), PWN(/model.6/m.0/m/m.1/cv1/act/Mul))",
"/model.6/m.0/m/m.1/cv2/conv/Conv + PWN(PWN(PWN(/model.6/m.0/m/m.1/cv2/act/Sigmoid), PWN(/model.6/m.0/m/m.1/cv2/act/Mul)), PWN(/model.6/m.0/m/m.1/Add))",
"/model.6/m.0/cv2/conv/Conv + PWN(PWN(/model.6/m.0/cv2/act/Sigmoid), PWN(/model.6/m.0/cv2/act/Mul))",
"/model.6/m.0/m/m.1/Add_output_0 copy",
"/model.6/m.0/cv3/conv/Conv + PWN(PWN(/model.6/m.0/cv3/act/Sigmoid), PWN(/model.6/m.0/cv3/act/Mul))",
"/model.6/Split_output_0 copy",
"/model.6/Split_output_1 copy",
"/model.6/cv2/conv/Conv + PWN(PWN(/model.6/cv2/act/Sigmoid), PWN(/model.6/cv2/act/Mul))",
"/model.7/conv/Conv + PWN(PWN(/model.7/act/Sigmoid), PWN(/model.7/act/Mul))",
"/model.8/cv1/conv/Conv + PWN(PWN(/model.8/cv1/act/Sigmoid), PWN(/model.8/cv1/act/Mul))",
"/model.8/m.0/cv1/conv/Conv + PWN(PWN(/model.8/m.0/cv1/act/Sigmoid), PWN(/model.8/m.0/cv1/act/Mul))",
"/model.8/m.0/m/m.0/cv1/conv/Conv + PWN(PWN(/model.8/m.0/m/m.0/cv1/act/Sigmoid), PWN(/model.8/m.0/m/m.0/cv1/act/Mul))",
"/model.8/m.0/m/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.8/m.0/m/m.0/cv2/act/Sigmoid), PWN(/model.8/m.0/m/m.0/cv2/act/Mul)), PWN(/model.8/m.0/m/m.0/Add))",
"/model.8/m.0/m/m.1/cv1/conv/Conv + PWN(PWN(/model.8/m.0/m/m.1/cv1/act/Sigmoid), PWN(/model.8/m.0/m/m.1/cv1/act/Mul))",
"/model.8/m.0/m/m.1/cv2/conv/Conv + PWN(PWN(PWN(/model.8/m.0/m/m.1/cv2/act/Sigmoid), PWN(/model.8/m.0/m/m.1/cv2/act/Mul)), PWN(/model.8/m.0/m/m.1/Add))",
"/model.8/m.0/cv2/conv/Conv + PWN(PWN(/model.8/m.0/cv2/act/Sigmoid), PWN(/model.8/m.0/cv2/act/Mul))",
"/model.8/m.0/m/m.1/Add_output_0 copy",
"/model.8/m.0/cv3/conv/Conv + PWN(PWN(/model.8/m.0/cv3/act/Sigmoid), PWN(/model.8/m.0/cv3/act/Mul))",
"/model.8/Split_output_0 copy",
"/model.8/Split_output_1 copy",
"/model.8/cv2/conv/Conv + PWN(PWN(/model.8/cv2/act/Sigmoid), PWN(/model.8/cv2/act/Mul))",
"/model.9/cv1/conv/Conv",
"/model.9/m/MaxPool",
"/model.9/m_1/MaxPool",
"/model.9/m_2/MaxPool",
"/model.9/cv1/conv/Conv_output_0 copy",
"/model.9/m/MaxPool_output_0 copy",
"/model.9/m_1/MaxPool_output_0 copy",
"/model.9/cv2/conv/Conv + PWN(PWN(/model.9/cv2/act/Sigmoid), PWN(/model.9/cv2/act/Mul))",
"/model.10/cv1/conv/Conv + PWN(PWN(/model.10/cv1/act/Sigmoid), PWN(/model.10/cv1/act/Mul))",
"/model.10/Split_13",
"/model.10/m/m.0/attn/qkv/conv/Conv",
"/model.10/m/m.0/attn/Reshape",
"/model.10/m/m.0/attn/Split",
"/model.10/m/m.0/attn/Split_16",
"/model.10/m/m.0/attn/MatMul",
"/model.10/m/m.0/attn/Softmax",
"/model.10/m/m.0/attn/Split_18",
"/model.10/m/m.0/attn/Reshape_2",
"/model.10/m/m.0/attn/MatMul_1",
"/model.10/m/m.0/attn/Reshape_1",
"/model.10/m/m.0/attn/pe/conv/Conv + /model.10/m/m.0/attn/Add",
"Reformatting CopyNode for Input Tensor 0 to /model.10/m/m.0/attn/proj/conv/Conv + /model.10/m/m.0/Add",
"/model.10/m/m.0/attn/proj/conv/Conv + /model.10/m/m.0/Add",
"/model.10/m/m.0/ffn/ffn.0/conv/Conv + PWN(PWN(/model.10/m/m.0/ffn/ffn.0/act/Sigmoid), PWN(/model.10/m/m.0/ffn/ffn.0/act/Mul))",
"/model.10/m/m.0/ffn/ffn.1/conv/Conv + /model.10/m/m.0/Add_1",
"/model.10/Split_output_0 copy",
"/model.10/m/m.0/Add_1_output_0 copy",
"/model.10/cv2/conv/Conv + PWN(PWN(/model.10/cv2/act/Sigmoid), PWN(/model.10/cv2/act/Mul))",
"/model.11/Resize",
"/model.11/Resize_output_0 copy",
"/model.13/cv1/conv/Conv + PWN(PWN(/model.13/cv1/act/Sigmoid), PWN(/model.13/cv1/act/Mul))",
"/model.13/m.0/cv1/conv/Conv + PWN(PWN(/model.13/m.0/cv1/act/Sigmoid), PWN(/model.13/m.0/cv1/act/Mul))",
"/model.13/m.0/m/m.0/cv1/conv/Conv + PWN(PWN(/model.13/m.0/m/m.0/cv1/act/Sigmoid), PWN(/model.13/m.0/m/m.0/cv1/act/Mul))",
"/model.13/m.0/m/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.13/m.0/m/m.0/cv2/act/Sigmoid), PWN(/model.13/m.0/m/m.0/cv2/act/Mul)), PWN(/model.13/m.0/m/m.0/Add))",
"/model.13/m.0/m/m.1/cv1/conv/Conv + PWN(PWN(/model.13/m.0/m/m.1/cv1/act/Sigmoid), PWN(/model.13/m.0/m/m.1/cv1/act/Mul))",
"/model.13/m.0/m/m.1/cv2/conv/Conv + PWN(PWN(PWN(/model.13/m.0/m/m.1/cv2/act/Sigmoid), PWN(/model.13/m.0/m/m.1/cv2/act/Mul)), PWN(/model.13/m.0/m/m.1/Add))",
"/model.13/m.0/cv2/conv/Conv + PWN(PWN(/model.13/m.0/cv2/act/Sigmoid), PWN(/model.13/m.0/cv2/act/Mul))",
"/model.13/m.0/m/m.1/Add_output_0 copy",
"/model.13/m.0/cv3/conv/Conv + PWN(PWN(/model.13/m.0/cv3/act/Sigmoid), PWN(/model.13/m.0/cv3/act/Mul))",
"/model.13/Split_output_0 copy",
"/model.13/Split_output_1 copy",
"/model.13/cv2/conv/Conv + PWN(PWN(/model.13/cv2/act/Sigmoid), PWN(/model.13/cv2/act/Mul))",
"/model.14/Resize",
"/model.14/Resize_output_0 copy",
"/model.16/cv1/conv/Conv + PWN(PWN(/model.16/cv1/act/Sigmoid), PWN(/model.16/cv1/act/Mul))",
"/model.16/m.0/cv1/conv/Conv + PWN(PWN(/model.16/m.0/cv1/act/Sigmoid), PWN(/model.16/m.0/cv1/act/Mul))",
"Reformatting CopyNode for Output Tensor 0 to /model.16/m.0/cv1/conv/Conv + PWN(PWN(/model.16/m.0/cv1/act/Sigmoid), PWN(/model.16/m.0/cv1/act/Mul))",
"/model.16/m.0/m/m.0/cv1/conv/Conv",
"Reformatting CopyNode for Input Tensor 0 to PWN(PWN(/model.16/m.0/m/m.0/cv1/act/Sigmoid), PWN(/model.16/m.0/m/m.0/cv1/act/Mul))",
"PWN(PWN(/model.16/m.0/m/m.0/cv1/act/Sigmoid), PWN(/model.16/m.0/m/m.0/cv1/act/Mul))",
"Reformatting CopyNode for Input Tensor 0 to /model.16/m.0/m/m.0/cv2/conv/Conv",
"/model.16/m.0/m/m.0/cv2/conv/Conv",
"Reformatting CopyNode for Input Tensor 0 to PWN(PWN(PWN(/model.16/m.0/m/m.0/cv2/act/Sigmoid), PWN(/model.16/m.0/m/m.0/cv2/act/Mul)), PWN(/model.16/m.0/m/m.0/Add))",
"Reformatting CopyNode for Input Tensor 1 to PWN(PWN(PWN(/model.16/m.0/m/m.0/cv2/act/Sigmoid), PWN(/model.16/m.0/m/m.0/cv2/act/Mul)), PWN(/model.16/m.0/m/m.0/Add))",
"PWN(PWN(PWN(/model.16/m.0/m/m.0/cv2/act/Sigmoid), PWN(/model.16/m.0/m/m.0/cv2/act/Mul)), PWN(/model.16/m.0/m/m.0/Add))",
"Reformatting CopyNode for Input Tensor 0 to /model.16/m.0/m/m.1/cv1/conv/Conv",
"/model.16/m.0/m/m.1/cv1/conv/Conv",
"Reformatting CopyNode for Input Tensor 0 to PWN(PWN(/model.16/m.0/m/m.1/cv1/act/Sigmoid), PWN(/model.16/m.0/m/m.1/cv1/act/Mul))",
"PWN(PWN(/model.16/m.0/m/m.1/cv1/act/Sigmoid), PWN(/model.16/m.0/m/m.1/cv1/act/Mul))",
"Reformatting CopyNode for Input Tensor 0 to /model.16/m.0/m/m.1/cv2/conv/Conv",
"/model.16/m.0/m/m.1/cv2/conv/Conv",
"Reformatting CopyNode for Input Tensor 0 to PWN(PWN(PWN(/model.16/m.0/m/m.1/cv2/act/Sigmoid), PWN(/model.16/m.0/m/m.1/cv2/act/Mul)), PWN(/model.16/m.0/m/m.1/Add))",
"PWN(PWN(PWN(/model.16/m.0/m/m.1/cv2/act/Sigmoid), PWN(/model.16/m.0/m/m.1/cv2/act/Mul)), PWN(/model.16/m.0/m/m.1/Add))",
"Reformatting CopyNode for Output Tensor 0 to PWN(PWN(PWN(/model.16/m.0/m/m.1/cv2/act/Sigmoid), PWN(/model.16/m.0/m/m.1/cv2/act/Mul)), PWN(/model.16/m.0/m/m.1/Add))",
"Reformatting CopyNode for Input Tensor 0 to /model.16/m.0/cv2/conv/Conv + PWN(PWN(/model.16/m.0/cv2/act/Sigmoid), PWN(/model.16/m.0/cv2/act/Mul))",
"/model.16/m.0/cv2/conv/Conv + PWN(PWN(/model.16/m.0/cv2/act/Sigmoid), PWN(/model.16/m.0/cv2/act/Mul))",
"Reformatting CopyNode for Output Tensor 0 to /model.16/m.0/cv2/conv/Conv + PWN(PWN(/model.16/m.0/cv2/act/Sigmoid), PWN(/model.16/m.0/cv2/act/Mul))",
"Reformatting CopyNode for Input Tensor 0 to /model.16/m.0/cv3/conv/Conv + PWN(PWN(/model.16/m.0/cv3/act/Sigmoid), PWN(/model.16/m.0/cv3/act/Mul))",
"/model.16/m.0/cv3/conv/Conv + PWN(PWN(/model.16/m.0/cv3/act/Sigmoid), PWN(/model.16/m.0/cv3/act/Mul))",
"/model.16/Split_output_0 copy",
"/model.16/Split_output_1 copy",
"/model.16/cv2/conv/Conv + PWN(PWN(/model.16/cv2/act/Sigmoid), PWN(/model.16/cv2/act/Mul))",
"/model.17/conv/Conv + PWN(PWN(/model.17/act/Sigmoid), PWN(/model.17/act/Mul))",
"/model.13/cv2/act/Mul_output_0 copy",
"/model.19/cv1/conv/Conv + PWN(PWN(/model.19/cv1/act/Sigmoid), PWN(/model.19/cv1/act/Mul))",
"/model.19/m.0/cv1/conv/Conv + PWN(PWN(/model.19/m.0/cv1/act/Sigmoid), PWN(/model.19/m.0/cv1/act/Mul))",
"/model.19/m.0/m/m.0/cv1/conv/Conv + PWN(PWN(/model.19/m.0/m/m.0/cv1/act/Sigmoid), PWN(/model.19/m.0/m/m.0/cv1/act/Mul))",
"/model.19/m.0/m/m.0/cv2/conv/Conv + PWN(PWN(PWN(/model.19/m.0/m/m.0/cv2/act/Sigmoid), PWN(/model.19/m.0/m/m.0/cv2/act/Mul)), PWN(/model.19/m.0/m/m.0/Add))",
"/model.19/m.0/m/m.1/cv1/conv/Conv + PWN(PWN(/model.19/m.0/m/m.1/cv1/act/Sigmoid), PWN(/model.19/m.0/m/m.1/cv1/act/Mul))",
"/model.19/m.0/m/m.1/cv2/conv/Conv + PWN(PWN(PWN(/model.19/m.0/m/m.1/cv2/act/Sigmoid), PWN(/model.19/m.0/m/m.1/cv2/act/Mul)), PWN(/model.19/m.0/m/m.1/Add))",
"/model.19/m.0/cv2/conv/Conv + PWN(PWN(/model.19/m.0/cv2/act/Sigmoid), PWN(/model.19/m.0/cv2/act/Mul))",
"/model.19/m.0/m/m.1/Add_output_0 copy",
"/model.19/m.0/cv3/conv/Conv + PWN(PWN(/model.19/m.0/cv3/act/Sigmoid), PWN(/model.19/m.0/cv3/act/Mul))",
"/model.19/Split_output_0 copy",
"/model.19/Split_output_1 copy",
"/model.19/cv2/conv/Conv + PWN(PWN(/model.19/cv2/act/Sigmoid), PWN(/model.19/cv2/act/Mul))",
"/model.20/conv/Conv + PWN(PWN(/model.20/act/Sigmoid), PWN(/model.20/act/Mul))",
"/model.10/cv2/act/Mul_output_0 copy",
"/model.22/cv1/conv/Conv + PWN(PWN(/model.22/cv1/act/Sigmoid), PWN(/model.22/cv1/act/Mul))",
"/model.22/Split_34",
"/model.22/m.0/m.0.0/cv1/conv/Conv + PWN(PWN(/model.22/m.0/m.0.0/cv1/act/Sigmoid), PWN(/model.22/m.0/m.0.0/cv1/act/Mul))",
"/model.22/m.0/m.0.0/cv2/conv/Conv + PWN(PWN(PWN(/model.22/m.0/m.0.0/cv2/act/Sigmoid), PWN(/model.22/m.0/m.0.0/cv2/act/Mul)), PWN(/model.22/m.0/m.0.0/Add))",
"/model.22/m.0/m.0.1/attn/qkv/conv/Conv",
"/model.22/m.0/m.0.1/attn/Reshape",
"/model.22/m.0/m.0.1/attn/Split",
"/model.22/m.0/m.0.1/attn/Split_38",
"/model.22/m.0/m.0.1/attn/MatMul",
"/model.22/m.0/m.0.1/attn/Softmax",
"/model.22/m.0/m.0.1/attn/Split_40",
"/model.22/m.0/m.0.1/attn/Reshape_2",
"/model.22/m.0/m.0.1/attn/MatMul_1",
"/model.22/m.0/m.0.1/attn/Reshape_1",
"/model.22/m.0/m.0.1/attn/pe/conv/Conv + /model.22/m.0/m.0.1/attn/Add",
"Reformatting CopyNode for Input Tensor 0 to /model.22/m.0/m.0.1/attn/proj/conv/Conv + /model.22/m.0/m.0.1/Add",
"/model.22/m.0/m.0.1/attn/proj/conv/Conv + /model.22/m.0/m.0.1/Add",
"/model.22/m.0/m.0.1/ffn/ffn.0/conv/Conv + PWN(PWN(/model.22/m.0/m.0.1/ffn/ffn.0/act/Sigmoid), PWN(/model.22/m.0/m.0.1/ffn/ffn.0/act/Mul))",
"/model.22/m.0/m.0.1/ffn/ffn.1/conv/Conv + /model.22/m.0/m.0.1/Add_1",
"/model.22/Split_output_0 copy",
"/model.22/Split_output_1 copy",
"/model.22/m.0/m.0.1/Add_1_output_0 copy",
"/model.22/cv2/conv/Conv + PWN(PWN(/model.22/cv2/act/Sigmoid), PWN(/model.22/cv2/act/Mul))",
"/model.23/cv2.2/cv2.2.0/conv/Conv + PWN(PWN(/model.23/cv2.2/cv2.2.0/act/Sigmoid), PWN(/model.23/cv2.2/cv2.2.0/act/Mul))",
"/model.23/cv2.2/cv2.2.1/conv/Conv",
"PWN(PWN(/model.23/cv2.2/cv2.2.1/act/Sigmoid), PWN(/model.23/cv2.2/cv2.2.1/act/Mul))",
"/model.23/cv2.2/cv2.2.2/Conv",
"/model.23/cv3.2/cv3.2.0/cv3.2.0.0/conv/Conv + PWN(PWN(/model.23/cv3.2/cv3.2.0/cv3.2.0.0/act/Sigmoid), PWN(/model.23/cv3.2/cv3.2.0/cv3.2.0.0/act/Mul))",
"/model.23/cv3.2/cv3.2.0/cv3.2.0.1/conv/Conv + PWN(PWN(/model.23/cv3.2/cv3.2.0/cv3.2.0.1/act/Sigmoid), PWN(/model.23/cv3.2/cv3.2.0/cv3.2.0.1/act/Mul))",
"/model.23/cv3.2/cv3.2.1/cv3.2.1.0/conv/Conv + PWN(PWN(/model.23/cv3.2/cv3.2.1/cv3.2.1.0/act/Sigmoid), PWN(/model.23/cv3.2/cv3.2.1/cv3.2.1.0/act/Mul))",
"/model.23/cv3.2/cv3.2.1/cv3.2.1.1/conv/Conv + PWN(PWN(/model.23/cv3.2/cv3.2.1/cv3.2.1.1/act/Sigmoid), PWN(/model.23/cv3.2/cv3.2.1/cv3.2.1.1/act/Mul))",
"/model.23/cv3.2/cv3.2.2/Conv",
"/model.23/Reshape_2",
"/model.23/Reshape_2_copy_output",
"/model.23/cv2.1/cv2.1.0/conv/Conv + PWN(PWN(/model.23/cv2.1/cv2.1.0/act/Sigmoid), PWN(/model.23/cv2.1/cv2.1.0/act/Mul))",
"/model.23/cv2.1/cv2.1.1/conv/Conv",
"PWN(PWN(/model.23/cv2.1/cv2.1.1/act/Sigmoid), PWN(/model.23/cv2.1/cv2.1.1/act/Mul))",
"/model.23/cv2.1/cv2.1.2/Conv",
"/model.23/cv3.1/cv3.1.0/cv3.1.0.0/conv/Conv + PWN(PWN(/model.23/cv3.1/cv3.1.0/cv3.1.0.0/act/Sigmoid), PWN(/model.23/cv3.1/cv3.1.0/cv3.1.0.0/act/Mul))",
"/model.23/cv3.1/cv3.1.0/cv3.1.0.1/conv/Conv + PWN(PWN(/model.23/cv3.1/cv3.1.0/cv3.1.0.1/act/Sigmoid), PWN(/model.23/cv3.1/cv3.1.0/cv3.1.0.1/act/Mul))",
"/model.23/cv3.1/cv3.1.1/cv3.1.1.0/conv/Conv + PWN(PWN(/model.23/cv3.1/cv3.1.1/cv3.1.1.0/act/Sigmoid), PWN(/model.23/cv3.1/cv3.1.1/cv3.1.1.0/act/Mul))",
"/model.23/cv3.1/cv3.1.1/cv3.1.1.1/conv/Conv + PWN(PWN(/model.23/cv3.1/cv3.1.1/cv3.1.1.1/act/Sigmoid), PWN(/model.23/cv3.1/cv3.1.1/cv3.1.1.1/act/Mul))",
"/model.23/cv3.1/cv3.1.2/Conv",
"/model.23/Reshape_1",
"/model.23/Reshape_1_copy_output",
"/model.23/cv2.0/cv2.0.0/conv/Conv + PWN(PWN(/model.23/cv2.0/cv2.0.0/act/Sigmoid), PWN(/model.23/cv2.0/cv2.0.0/act/Mul))",
"Reformatting CopyNode for Input Tensor 0 to /model.23/cv2.0/cv2.0.1/conv/Conv",
"/model.23/cv2.0/cv2.0.1/conv/Conv",
"Reformatting CopyNode for Input Tensor 0 to PWN(PWN(/model.23/cv2.0/cv2.0.1/act/Sigmoid), PWN(/model.23/cv2.0/cv2.0.1/act/Mul))",
"PWN(PWN(/model.23/cv2.0/cv2.0.1/act/Sigmoid), PWN(/model.23/cv2.0/cv2.0.1/act/Mul))",
"Reformatting CopyNode for Input Tensor 0 to /model.23/cv2.0/cv2.0.2/Conv",
"/model.23/cv2.0/cv2.0.2/Conv",
"Reformatting CopyNode for Output Tensor 0 to /model.23/cv2.0/cv2.0.2/Conv",
"/model.23/cv3.0/cv3.0.0/cv3.0.0.0/conv/Conv + PWN(PWN(/model.23/cv3.0/cv3.0.0/cv3.0.0.0/act/Sigmoid), PWN(/model.23/cv3.0/cv3.0.0/cv3.0.0.0/act/Mul))",
"/model.23/cv3.0/cv3.0.0/cv3.0.0.1/conv/Conv + PWN(PWN(/model.23/cv3.0/cv3.0.0/cv3.0.0.1/act/Sigmoid), PWN(/model.23/cv3.0/cv3.0.0/cv3.0.0.1/act/Mul))",
"/model.23/cv3.0/cv3.0.1/cv3.0.1.0/conv/Conv + PWN(PWN(/model.23/cv3.0/cv3.0.1/cv3.0.1.0/act/Sigmoid), PWN(/model.23/cv3.0/cv3.0.1/cv3.0.1.0/act/Mul))",
"/model.23/cv3.0/cv3.0.1/cv3.0.1.1/conv/Conv + PWN(PWN(/model.23/cv3.0/cv3.0.1/cv3.0.1.1/act/Sigmoid), PWN(/model.23/cv3.0/cv3.0.1/cv3.0.1.1/act/Mul))",
"/model.23/cv3.0/cv3.0.2/Conv",
"/model.23/Reshape",
"/model.23/Reshape_copy_output",
"PWN(/model.23/Sigmoid)",
"/model.23/Constant_12_output_0 + ONNXTRT_Broadcast_117",
"/model.23/Constant_10_output_0",
"PWN(/model.23/Add_1)",
"/model.23/Constant_9_output_0",
"PWN(/model.23/Sub)",
"PWN(/model.23/Sub_1)",
"PWN(PWN(/model.23/Add_2), PWN(/model.23/Constant_11_output_0 + ONNXTRT_Broadcast_115, PWN(/model.23/Div_1)))",
"/model.23/Div_1_output_0 copy",
"PWN(/model.23/Mul_2)",
"/model.23/Mul_2_output_0 copy"
],
"Bindings": [
"images",
"output0"
]
}

Q&A

Errors were encountered while processing: nvidia-l4t-bootloader报错

Jetson升级踩坑实录:nvidia-l4t-bootloader报错终极解决方案(附完整命令)-CSDN博客

Jetson Orin NX 性能检测工具指南

Jetson Orin NX 上可用于 TensorRT / CUDA / DLA 性能分析的核心工具:

工具 用途 安装状态
nsys (Nsight Systems) GPU/DLA/CPU timeline profiling 自带 (2024.5.4)
trtexec TensorRT engine benchmark & layer profiling 自带 (TRT 10.3)
tegrastats 实时 GPU/DLA/CPU/内存/温度监控 自带
jtop 交互式系统资源监控 pip install jetson-stats
nvpmodel 功耗/性能模式切换 自带
jetson_clocks 锁定最高频率 自带

注意nvvp (Visual Profiler) 在 JetPack 6.x 已废弃,由 nsys 替代。

1. nsys (Nsight Systems)

1.1 基本用法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 版本确认
nsys --version
# NVIDIA Nsight Systems version 2024.5.4.34-...

# 基础 profiling (应用级)
nsys profile -o my_report ./my_application

# 指定采样选项
nsys profile \
-t cuda,nvtx,osrt \ # 追踪 CUDA, NVTX markers, OS Runtime
--duration=10 \ # 采集 10 秒
--stats=true \ # 生成统计摘要
-o profile_output \
./my_application
1.2 TensorRT + DLA Profiling
1
2
3
4
5
6
7
8
9
# 对 trtexec DLA engine 进行 profiling
nsys profile \
--stats=true \
-o dla_profile \
/usr/src/tensorrt/bin/trtexec \
--loadEngine=model/yolo26_dla0_int8_640.engine \
--iterations=200 \
--warmUp=2000 \
--dumpProfile
1.3 关键输出解读
1
2
3
4
5
6
7
8
9
10
11
# nsys stats 输出示例:
CUDA API Statistics:
Time(%) Total Time Calls Average Name
45.2% 234.5ms 200 1.17ms cudaStreamSynchronize
22.1% 114.8ms 200 0.57ms cudaLaunchKernel
...

CUDA Kernel Statistics:
Time(%) Total Time Instances Average Name
28.0% 45.6ms 200 0.23ms ForeignNode[2] ← DLA subgraph
...
1.4 报告查看
1
2
3
4
5
6
7
8
9
10
# 命令行统计
nsys stats my_report.nsys-rep

# 导出为 JSON (便于脚本处理)
nsys export --type=json --output=report.json my_report.nsys-rep

# GUI 查看 (在 Windows/Linux PC 上安装 Nsight Systems GUI)
# 下载: https://developer.nvidia.com/nsight-systems
# 从 NX 拷贝 .nsys-rep 文件到 PC 打开
scp [email protected]:/path/to/report.nsys-rep .

2. trtexec

2.1 Engine 构建
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
TRTEXEC=/usr/src/tensorrt/bin/trtexec

# GPU FP16
$TRTEXEC --onnx=model.onnx --saveEngine=model_gpu_fp16.engine \
--fp16 --memPoolSize=workspace:4096MiB

# GPU INT8 (需要校准数据或 PTQ)
$TRTEXEC --onnx=model.onnx --saveEngine=model_gpu_int8.engine \
--int8 --fp16 --memPoolSize=workspace:4096MiB

# DLA INT8 (DLA core 0, 允许 GPU fallback)
$TRTEXEC --onnx=model.onnx --saveEngine=model_dla0_int8.engine \
--useDLACore=0 --allowGPUFallback --int8 --fp16 \
--memPoolSize=workspace:4096MiB

# DLA-GPU Hybrid (指定层设备)
$TRTEXEC --onnx=model.onnx --saveEngine=model_hybrid.engine \
--useDLACore=0 --allowGPUFallback --int8 --fp16 \
--memPoolSize=workspace:4096MiB \
--layerDeviceTypes="/model.10/m/m.0/attn/MatMul:GPU,/model.10/m/m.0/attn/Softmax:GPU"
2.2 Engine Benchmark
1
2
3
4
5
6
7
8
9
10
11
# 基本推理测试
$TRTEXEC --loadEngine=model.engine --iterations=500 --warmUp=3000

# 带 layer profiling 的详细测试
$TRTEXEC --loadEngine=model.engine --iterations=200 \
--dumpProfile --exportProfile=profile.json

# 关键输出指标:
# Throughput: xxx qps ← 吞吐量
# GPU Compute Time: mean=x.xxms ← 单帧 GPU 计算时间
# Total Host Walltime: x.xxms ← 总延迟(含 H2D/D2H)
2.3 Layer 信息导出
1
2
3
4
5
6
7
8
# 导出层级信息 (JSON)
$TRTEXEC --onnx=model.onnx --useDLACore=0 --allowGPUFallback --int8 --fp16 \
--memPoolSize=workspace:4096MiB \
--dumpLayerInfo --exportLayerInfo=layers.json --skipInference

# 导出层级 timing profile
$TRTEXEC --loadEngine=model.engine --iterations=100 \
--dumpProfile --exportProfile=timing.json
2.4 DLA-GPU Hybrid 层控制
1
2
3
4
5
6
7
8
9
10
11
# --layerDeviceTypes 语法: "layerName:GPU" 或 "layerName:DLA"
# 多层用逗号分隔

# 示例: 将注意力层强制到 GPU
$TRTEXEC --onnx=model.onnx --useDLACore=0 --allowGPUFallback --int8 --fp16 \
--layerDeviceTypes="/model.10/m/m.0/attn/qkv/conv/Conv:GPU,\
/model.10/m/m.0/attn/Split:GPU,\
/model.10/m/m.0/attn/Transpose:GPU,\
/model.10/m/m.0/attn/MatMul:GPU,\
/model.10/m/m.0/attn/Softmax:GPU" \
--saveEngine=hybrid.engine

3. tegrastats

3.1 基本用法
1
2
3
4
5
6
7
8
# 实时监控 (1 秒刷新)
tegrastats

# 自定义刷新间隔 (毫秒)
tegrastats --interval 500

# 输出到文件
tegrastats --interval 1000 --logfile /tmp/tegra_log.txt &
3.2 输出字段说明
1
2
3
4
5
6
7
RAM 6543/15823MB   ← 内存使用/总量
GR3D_FREQ 76% ← GPU 利用率
NVDLA0_FREQ 100% ← DLA0 利用率
NVDLA1_FREQ 85% ← DLA1 利用率
CPU [20%@2201] ← CPU 使用率@频率(MHz)
tj: 52C ← 芯片结温(Thermal Junction)
VDD_CPU_GPU_CV 4500mW ← CPU/GPU/CV 功耗

4. nvpmodel & jetson_clocks

4.1 性能模式
1
2
3
4
5
6
7
8
# 查看当前模式
nvpmodel -q

# 设置为最高性能 (MAXN_SUPER for Orin NX)
sudo nvpmodel -m 0

# 可用模式查看
nvpmodel -p --verbose
4.2 频率锁定
1
2
3
4
5
6
7
8
# 锁定所有时钟到最高频率 (benchmark 必须)
sudo jetson_clocks

# 查看当前频率状态
sudo jetson_clocks --show

# 恢复默认 (dynamic scaling)
sudo jetson_clocks --restore

5. VPI Profiling

5.1 VPI Python 基准测试
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import vpi, numpy as np, time

W, H = 1280, 720
src = vpi.asimage(np.random.randint(0, 255, (H, W), dtype=np.uint8))
warp = vpi.WarpMap(vpi.WarpGrid((W, H)))

for backend in [vpi.Backend.CUDA, vpi.Backend.VIC]:
# warmup
for _ in range(30):
with backend:
out = src.remap(warp)
out.cpu()
# benchmark
times = []
for _ in range(200):
t0 = time.perf_counter()
with backend:
out = src.remap(warp)
out.cpu()
times.append((time.perf_counter() - t0) * 1000)
times.sort()
print(f"{backend}: avg={sum(times)/len(times):.3f}ms min={times[0]:.3f}ms")
5.2 VPI 支持的后端
后端 硬件 说明
VPI_BACKEND_CUDA GPU CUDA Cores 通用计算,延迟最低
VPI_BACKEND_PVA PVA (Programmable Vision Accelerator) 图像处理专用,功耗低
VPI_BACKEND_VIC VIC (Video Image Compositor) 视频处理专用,带宽优化
VPI_BACKEND_NVENC NVENC (Video Encoder) 编码专用
VPI_BACKEND_CPU CPU 最慢,调试用

Remap 操作支持: CUDA, VIC, CPU(不支持 PVA)

5.3 VPI 在 Pipeline 中的硬件分配

当前 stereo_3d_pipeline 中的 VPI 调用:

操作 后端 硬件 延迟
Remap (双目校正) L+R CUDA GPU ~2.8ms (dual)
ConvertImageFormat (NV12→Gray) CUDA GPU ~0.1ms
TemporalNoiseReduction CUDA GPU ~0.5ms (如启用)

6. 常用 Benchmark 流程

完整性能测试流程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 1. 设置最高性能模式
sudo nvpmodel -m 0
sudo jetson_clocks

# 2. 预热系统 (运行 5 秒空闲)
sleep 5

# 3. TRT Engine benchmark
/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine \
--iterations=500 --warmUp=3000 --avgRuns=10

# 4. Pipeline 全链路测试
cd /home/nvidia/NX_volleyball/stereo_3d_pipeline
timeout 15 ./build/stereo_pipeline -c config/pipeline_triple.yaml

# 5. nsys 全链路 profiling
nsys profile --stats=true -o pipeline_profile \
timeout 10 ./build/stereo_pipeline -c config/pipeline_triple.yaml

# 6. 同时监控系统状态
tegrastats --interval 200 --logfile /tmp/bench_tegra.log &
Benchmark 注意事项
  1. 始终锁定频率jetson_clocks 必须在测试前执行,否则 DVFS 导致结果不稳定
  2. 充分预热:TRT engine 首次推理较慢(JIT 优化),至少 warmUp 2-3 秒
  3. 温度影响:长时间运行会导致 thermal throttling,关注 tj 温度
  4. DLA 独立计时:DLA 延迟不反映在 GPU Compute Time 中,需要用 nsys--dumpProfile 查看
  5. 内存带宽竞争:DLA 和 GPU 共享 LPDDR5 带宽,同时使用会互相影响

jtop JetPack 版本识别修复指南

jtop JetPack 版本识别修复指南

问题描述

在 Jetson Orin NX(JetPack 6.2, L4T R36.4.7)上运行 jtop 时,JetPack 版本显示为 MISSING
系统实际已正确安装 nvidia-jetpack 6.2.1+b38

根因分析

jtop(jetson-stats 4.3.2)使用内部映射表将 L4T 版本号映射到 JetPack 版本。
映射表位于:

/usr/local/lib/python3.10/dist-packages/jtop/core/jetson_variables.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

该表包含 `"36.4.3": "6.2"` 等条目,但 **缺少 `"36.4.7"` 对应条目**。
当 L4T 版本为 R36.4.7 时,查表失败,显示 MISSING。

## 验证步骤

```bash
# 1. 确认 L4T 版本
cat /etc/nv_tegra_release
# 应输出: # R36 (release), REVISION: 4.7, ...

# 2. 确认 JetPack 已安装
dpkg -l | grep nvidia-jetpack
# 应显示: nvidia-jetpack 6.2.1+b38

# 3. 检查 jtop 映射表
grep "36.4" /usr/local/lib/python3.10/dist-packages/jtop/core/jetson_variables.py
# 如果只有 "36.4.3": "6.2",没有 "36.4.7",则确认问题

修复方法

向映射表中添加 "36.4.7": "6.2" 条目:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 备份原文件
sudo cp /usr/local/lib/python3.10/dist-packages/jtop/core/jetson_variables.py \
/usr/local/lib/python3.10/dist-packages/jtop/core/jetson_variables.py.bak

# 在 "36.4.3": "6.2" 之前插入 "36.4.7": "6.2"
sudo python3 -c "
path = '/usr/local/lib/python3.10/dist-packages/jtop/core/jetson_variables.py'
with open(path, 'r') as f:
content = f.read()

old = '\"36.4.3\": \"6.2\"'
new = '\"36.4.7\": \"6.2\",\n \"36.4.3\": \"6.2\"'
content = content.replace(old, new)

with open(path, 'w') as f:
f.write(content)
print('Patched successfully')
"

# 重启 jtop 服务
sudo systemctl restart jtop.service

验证修复

1
2
3
4
5
6
7
# 检查映射是否已添加
grep "36.4.7" /usr/local/lib/python3.10/dist-packages/jtop/core/jetson_variables.py
# 应输出包含 "36.4.7": "6.2" 的行

# 运行 jtop 确认
jtop
# JetPack 应显示为 6.2 而非 MISSING