当前位置：首页 > news >正文

jetson orin nano super AI模型部署之路（八）tensorrt C++ api介绍

news 来源：原创 2025/7/12 22:30:08

我们基于tensorrt-cpp-api这个仓库介绍。这个仓库的代码是一个非常不错的tensorrt的cpp api实现，可基于此开发自己的项目。

我们从src/main.cpp开始按顺序说明。

一、首先是声明我们创建tensorrt model的参数。

 // Specify our GPU inference configuration optionsOptions options;// Specify what precision to use for inference// FP16 is approximately twice as fast as FP32.options.precision = Precision::FP16;// If using INT8 precision, must specify path to directory containing// calibration data.options.calibrationDataDirectoryPath = "";// Specify the batch size to optimize for.options.optBatchSize = 1;// Specify the maximum batch size we plan on running.options.maxBatchSize = 1;// Specify the directory where you want the model engine model file saved.options.engineFileDir = ".";

这里Options是在src/engine.h中声明的一个结构体


// 网络的选项配置
struct Options {// 用于 GPU 推理的精度模式// 可选值：FP32（单精度浮点）、FP16（半精度浮点）、INT8（8位量化）Precision precision = Precision::FP16;// 如果选择 INT8 精度，必须提供校准数据集的目录路径std::string calibrationDataDirectoryPath;// 用于计算 INT8 校准数据的批量大小// 应设置为 GPU 能支持的最大批量大小int32_t calibrationBatchSize = 128;// 优化的批量大小// 表示引擎在构建时会针对该批量大小进行优化int32_t optBatchSize = 1;// 最大允许的批量大小// 表示引擎支持的最大批量大小int32_t maxBatchSize = 16;// GPU 设备索引// 用于指定在哪个 GPU 上运行推理int deviceIndex = 0;// 引擎文件保存的目录// 用于存储 TensorRT 引擎文件std::string engineFileDir = ".";// 最大允许的输入宽度// 默认为 -1，表示期望固定的输入大小int32_t maxInputWidth = -1;// 最小允许的输入宽度// 默认为 -1，表示期望固定的输入大小int32_t minInputWidth = -1;// 优化的输入宽度// 默认为 -1，表示期望固定的输入大小int32_t optInputWidth = -1;
};

二、构建tensorrt推理的engine

Engine<float> engine(options);

这里Engine是一个模板类，继承字IEngine类。IEngine 是一个抽象接口类，定义了 TensorRT 推理引擎的核心功能。
它包括构建和加载网络、运行推理、获取输入和输出张量维度等功能。
通过模板参数 T，可以支持不同的数据类型（如 float 或 int）。
具体的实现需要在派生类中完成。

IEngine类的实现在include/interfaces/IEngine.h中。

template <typename T>
class IEngine {
public:// 虚析构函数，确保派生类的资源能够正确释放virtual ~IEngine() = default;// 构建并加载 ONNX 模型到 TensorRT 引擎文件// 参数：// - onnxModelPath: ONNX 模型文件的路径// - subVals: 输入数据的减法均值，用于归一化// - divVals: 输入数据的除法均值，用于归一化// - normalize: 是否对输入数据进行归一化// 返回值：// - 如果成功构建并加载网络，返回 true；否则返回 falsevirtual bool buildLoadNetwork(std::string onnxModelPath, const std::array<float, 3> &subVals = {0.f, 0.f, 0.f},const std::array<float, 3> &divVals = {1.f, 1.f, 1.f}, bool normalize = true) = 0;// 从磁盘加载 TensorRT 引擎文件到内存// 参数：// - trtModelPath: TensorRT 引擎文件的路径// - subVals: 输入数据的减法均值，用于归一化// - divVals: 输入数据的除法均值，用于归一化// - normalize: 是否对输入数据进行归一化// 返回值：// - 如果成功加载网络，返回 true；否则返回 falsevirtual bool loadNetwork(std::string trtModelPath, const std::array<float, 3> &subVals = {0.f, 0.f, 0.f},const std::array<float, 3> &divVals = {1.f, 1.f, 1.f}, bool normalize = true) = 0;// 运行推理// 参数：// - inputs: 输入数据，格式为 [input][batch][cv::cuda::GpuMat]// - featureVectors: 输出数据，格式为 [batch][output][feature_vector]// 返回值：// - 如果推理成功，返回 true；否则返回 falsevirtual bool runInference(const std::vector<std::vector<cv::cuda::GpuMat>> &inputs, std::vector<std::vector<std::vector<T>>> &featureVectors) = 0;// 获取输入张量的维度// 返回值：// - 输入张量的维度列表，类型为 nvinfer1::Dims3virtual const std::vector<nvinfer1::Dims3> &getInputDims() const = 0;// 获取输出张量的维度// 返回值：// - 输出张量的维度列表，类型为 nvinfer1::Dimsvirtual const std::vector<nvinfer1::Dims> &getOutputDims() const = 0;
};

再来看Engine类的实现。

template <typename T>
class Engine : public IEngine<T> {
public:// 构造函数，初始化引擎选项Engine(const Options &options);// 析构函数，释放 GPU 缓冲区~Engine();// 构建并加载 ONNX 模型到 TensorRT 引擎文件// 将模型缓存到磁盘以避免重复构建，并加载到内存中// 默认情况下，输入数据会被归一化到 [0.f, 1.f]// 如果需要 [-1.f, 1.f] 的归一化，可以通过 subVals 和 divVals 参数设置bool buildLoadNetwork(std::string onnxModelPath, const std::array<float, 3> &subVals = {0.f, 0.f, 0.f},const std::array<float, 3> &divVals = {1.f, 1.f, 1.f}, bool normalize = true) override;// 从磁盘加载 TensorRT 引擎文件到内存// 默认情况下，输入数据会被归一化到 [0.f, 1.f]// 如果需要 [-1.f, 1.f] 的归一化，可以通过 subVals 和 divVals 参数设置bool loadNetwork(std::string trtModelPath, const std::array<float, 3> &subVals = {0.f, 0.f, 0.f},const std::array<float, 3> &divVals = {1.f, 1.f, 1.f}, bool normalize = true) override;// 运行推理// 输入格式：[input][batch][cv::cuda::GpuMat]// 输出格式：[batch][output][feature_vector]bool runInference(const std::vector<std::vector<cv::cuda::GpuMat>> &inputs, std::vector<std::vector<std::vector<T>>> &featureVectors) override;// 工具函数：调整图像大小并保持宽高比，通过在右侧或底部填充来实现// 适用于需要将检测坐标映射回原始图像的场景（例如 YOLO 模型）static cv::cuda::GpuMat resizeKeepAspectRatioPadRightBottom(const cv::cuda::GpuMat &input, size_t height, size_t width,const cv::Scalar &bgcolor = cv::Scalar(0, 0, 0));// 获取输入张量的维度[[nodiscard]] const std::vector<nvinfer1::Dims3> &getInputDims() const override { return m_inputDims; };// 获取输出张量的维度[[nodiscard]] const std::vector<nvinfer1::Dims> &getOutputDims() const override { return m_outputDims; };// 工具函数：将三维嵌套的输出数组转换为二维数组// 适用于批量大小为 1 且有多个输出特征向量的情况static void transformOutput(std::vector<std::vector<std::vector<T>>> &input, std::vector<std::vector<T>> &output);// 工具函数：将三维嵌套的输出数组转换为一维数组// 适用于批量大小为 1 且只有一个输出特征向量的情况static void transformOutput(std::vector<std::vector<std::vector<T>>> &input, std::vector<T> &output);// 工具函数：将输入从 NHWC 格式转换为 NCHW 格式，并应用归一化和均值减法static cv::cuda::GpuMat blobFromGpuMats(const std::vector<cv::cuda::GpuMat> &batchInput, const std::array<float, 3> &subVals,const std::array<float, 3> &divVals, bool normalize, bool swapRB = false);private:// 构建网络bool build(std::string onnxModelPath, const std::array<float, 3> &subVals, const std::array<float, 3> &divVals, bool normalize);// 将引擎选项序列化为字符串std::string serializeEngineOptions(const Options &options, const std::string &onnxModelPath);// 获取设备名称列表void getDeviceNames(std::vector<std::string> &deviceNames);// 清除 GPU 缓冲区void clearGpuBuffers();// 输入数据的归一化、缩放和均值减法参数std::array<float, 3> m_subVals{};std::array<float, 3> m_divVals{};bool m_normalize;// 存储输入和输出 GPU 缓冲区的指针std::vector<void *> m_buffers;std::vector<uint32_t> m_outputLengths{};std::vector<nvinfer1::Dims3> m_inputDims;std::vector<nvinfer1::Dims> m_outputDims;std::vector<std::string> m_IOTensorNames;int32_t m_inputBatchSize;// TensorRT 运行时和推理上下文std::unique_ptr<nvinfer1::IRuntime> m_runtime = nullptr;std::unique_ptr<Int8EntropyCalibrator2> m_calibrator = nullptr;std::unique_ptr<nvinfer1::ICudaEngine> m_engine = nullptr;std::unique_ptr<nvinfer1::IExecutionContext> m_context = nullptr;// 引擎选项const Options m_options;// TensorRT 日志器Logger m_logger;
};// 构造函数：初始化引擎选项
template <typename T>
Engine<T>::Engine(const Options &options) : m_options(options) {}// 析构函数：清除 GPU 缓冲区
template <typename T>
Engine<T>::~Engine() { clearGpuBuffers(); }

相较于抽象接口类IEngine，Engine类添加了resizeKeepAspectRatioPadRightBottom、transformOutput、blobFromGpuMats这几个static的public函数。

    // 工具函数：调整图像大小并保持宽高比，通过在右侧或底部填充来实现// 适用于需要将检测坐标映射回原始图像的场景（例如 YOLO 模型）static cv::cuda::GpuMat resizeKeepAspectRatioPadRightBottom(const cv::cuda::GpuMat &input, size_t height, size_t width,const cv::Scalar &bgcolor = cv::Scalar(0, 0, 0));

    // 工具函数：将三维嵌套的输出数组转换为二维数组// 适用于批量大小为 1 且有多个输出特征向量的情况static void transformOutput(std::vector<std::vector<std::vector<T>>> &input, std::vector<std::vector<T>> &output);// 工具函数：将三维嵌套的输出数组转换为一维数组// 适用于批量大小为 1 且只有一个输出特征向量的情况static void transformOutput(std::vector<std::vector<std::vector<T>>> &input, std::vector<T> &output);// 工具函数：将输入从 NHWC 格式转换为 NCHW 格式，并应用归一化和均值减法static cv::cuda::GpuMat blobFromGpuMats(const std::vector<cv::cuda::GpuMat> &batchInput, const std::array<float, 3> &subVals,const std::array<float, 3> &divVals, bool normalize, bool swapRB = false);

在类中，static 修饰的成员函数是静态成员函数，它与普通成员函数有以下区别：

(1)静态成员函数不依赖于类的实例
静态成员函数属于类本身，而不是类的某个实例。
调用静态成员函数时，不需要创建类的对象，可以直接通过类名调用。

(2)静态成员函数不能访问非静态成员变量
静态成员函数没有 this 指针，因此无法访问类的非静态成员变量或非静态成员函数。
静态成员函数只能访问类的静态成员变量或调用其他静态成员函数。

在这里为什么使用 static？

(1)工具函数的设计
transformOutput 是一个工具函数，用于将三维嵌套的输出数组转换为二维数组。
它的功能与类的实例无关，因此设计为静态函数更合理。
这样可以直接通过类名调用，而不需要创建类的对象。

(2)提高代码的可读性和效率
将与类实例无关的函数声明为静态函数，可以明确表示该函数不依赖于类的状态。
静态函数的调用效率通常比非静态函数更高，因为它不需要隐式传递 this 指针。

第二个修改是对于getInputDims和getOutputDims函数的处理。在抽象接口类IEngine中，其实现方式是

virtual const std::vector<nvinfer1::Dims3> &getInputDims() const = 0;

在Engine类中，其声明变成了

    // 获取输入张量的维度[[nodiscard]] const std::vector<nvinfer1::Dims3> &getInputDims() const override { return m_inputDims; };

[[nodiscard]] 是 C++17 引入的一个属性，用于提示调用者不要忽略函数的返回值。
如果调用者忽略了带有 [[nodiscard]] 属性的函数的返回值，编译器会发出警告。

在这段代码中，getInputDims() 返回的是输入张量的维度信息。如果调用者忽略了这个返回值，可能会导致程序逻辑错误。
使用 [[nodiscard]] 可以提醒开发者注意返回值的重要性。

例如engine.getInputDims(); // 如果没有使用返回值，编译器会发出警告。

override 是 C++11 引入的关键字，用于显式声明一个函数是从基类继承并重写的虚函数。
如果函数没有正确地重写基类中的虚函数（例如函数签名不匹配），编译器会报错。

它可以防止由于函数签名错误而导致的意外行为。
在这段代码中，getInputDims() 是从基类 IEngine 中继承并重写的虚函数。

第一个 const: 表示返回的引用是常量，调用者不能修改返回的 std::vector。
第二个 const: 表示该成员函数不会修改类的成员变量，保证函数是只读的。

需要注意的是，在src/engine.h中，使用了模版类的实现，这有点要注意：


// 忽略之前的所有代码// 构造函数：初始化引擎选项
template <typename T>
Engine<T>::Engine(const Options &options) : m_options(options) {}// 析构函数：清除 GPU 缓冲区
template <typename T>
Engine<T>::~Engine() { clearGpuBuffers(); }// Include inline implementations
#include "engine/EngineRunInference.inl"
#include "engine/EngineUtilities.inl"
#include "engine/EngineBuildLoadNetwork.inl"

在文件的最后，才include了这几个头文件。

EngineRunInference.inl: 包含 runInference 函数的实现。

EngineUtilities.inl: 包含工具函数（如 transformOutput）的实现。

EngineBuildLoadNetwork.inl: 包含 buildLoadNetwork 和 loadNetwork 函数的实现。

将 #include 头文件放在文件末尾的原因是为了包含内联实现文件（.inl 文件），这种设计通常用于模板类或需要将实现与声明分离的情况下。

.inl 文件的作用
.inl 文件通常用于存放模板类或函数的实现。
在 C++ 中，模板类或模板函数的实现必须在编译时可见，因此不能像普通类那样将实现放在 .cpp 文件中。
为了保持代码的清晰和模块化，开发者会将模板类的实现从头文件中分离出来，放入 .inl 文件中。
为什么将 .inl 文件放在文件末尾？
2.1 避免重复定义
如果在头文件的开头包含 .inl 文件，可能会导致重复定义的问题，尤其是在头文件被多次包含时。
将 .inl 文件放在头文件末尾，可以确保模板类的声明先被处理，然后再处理实现部分。

2.2 确保依赖的声明已完成
.inl 文件中的实现可能依赖于头文件中声明的类或函数。
将 .inl 文件放在头文件末尾，可以确保所有必要的声明都已完成，避免编译错误。

2.3 提高代码的可读性
将 .inl 文件放在头文件末尾，可以将类的声明和实现逻辑分开，增强代码的可读性和可维护性。
开发者可以在头文件中快速查看类的接口，而不需要直接阅读实现细节。
3. 为什么不直接放在 .cpp 文件中？
模板类或模板函数的实现不能放在 .cpp 文件中，因为模板的具体类型是在编译时实例化的，而 .cpp 文件是在链接阶段处理的。
如果将模板实现放在 .cpp 文件中，编译器在实例化模板时将无法找到实现，导致链接错误。

三、tensorrt推理Engine的具体实现

3.1 buildLoadNetwork实现

buildLoadNetwork在include/engine/EngineBuildLoadNetwork.inl中实现。

template <typename T>
bool Engine<T>::buildLoadNetwork(std::string onnxModelPath, const std::array<float, 3> &subVals, const std::array<float, 3> &divVals,bool normalize) {const auto engineName = serializeEngineOptions(m_options, onnxModelPath);const auto engineDir = std::filesystem::path(m_options.engineFileDir);std::filesystem::path enginePath = engineDir / engineName;spdlog::info("Searching for engine file with name: {}", enginePath.string());if (Util::doesFileExist(enginePath)) {spdlog::info("Engine found, not regenerating...");} else {if (!Util::doesFileExist(onnxModelPath)) {auto msg = "Could not find ONNX model at path: " + onnxModelPath;spdlog::error(msg);throw std::runtime_error(msg);}spdlog::info("Engine not found, generating. This could take a while...");if (!std::filesystem::exists(engineDir)) {std::filesystem::create_directories(engineDir);spdlog::info("Created directory: {}", engineDir.string());}auto ret = build(onnxModelPath, subVals, divVals, normalize);if (!ret) {return false;}}return loadNetwork(enginePath, subVals, divVals, normalize);
}

其中，最重要的是build函数的实现。build的实现比较长，该函数的作用是从一个 ONNX 模型文件构建 TensorRT 引擎。
它会解析 ONNX 模型文件，设置优化配置（如动态批量大小、动态输入宽度、精度模式等），并最终生成一个序列化的 TensorRT 引擎文件。

其完整实现如下代码，后面会对其中重要代码展开解释，一并写在代码注释上。

3.2 build函数实现（最重要的函数）

template <typename T>
bool Engine<T>::build(std::string onnxModelPath, const std::array<float, 3> &subVals, const std::array<float, 3> &divVals, bool normalize) {// Create our engine builder.//创建 TensorRT 构建器和网络// nvinfer1::createInferBuilder:// 创建一个 TensorRT 构建器（IBuilder），用于构建和优化网络。// m_logger 是一个自定义的日志器，用于记录 TensorRT 的日志信息。auto builder = std::unique_ptr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(m_logger));if (!builder) {return false;}// Define an explicit batch size and then create the network (implicit batch// size is deprecated). More info here:// https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#explicit-implicit-batch// builder->createNetworkV2:// 创建一个网络定义（INetworkDefinition），用于描述神经网络的结构。// 使用 kEXPLICIT_BATCH 标志表示显式批量大小（TensorRT 7.0 之后推荐使用）。auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);auto network = std::unique_ptr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));if (!network) {return false;}// Create a parser for reading the onnx file.//  创建 ONNX 解析器并解析模型// nvonnxparser::createParser:// 创建一个 ONNX 解析器（IParser），用于将 ONNX 模型解析为 TensorRT 的网络定义。auto parser = std::unique_ptr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, m_logger));if (!parser) {return false;}// We are going to first read the onnx file into memory, then pass that buffer// to the parser. Had our onnx model file been encrypted, this approach would// allow us to first decrypt the buffer.// 读取 ONNX 文件:// 将 ONNX 模型文件读入内存缓冲区（buffer），然后传递给解析器。// 这种方式允许对模型文件进行预处理（如解密）。std::ifstream file(onnxModelPath, std::ios::binary | std::ios::ate);std::streamsize size = file.tellg();file.seekg(0, std::ios::beg);std::vector<char> buffer(size);if (!file.read(buffer.data(), size)) {auto msg = "Error, unable to read engine file";spdlog::error(msg);throw std::runtime_error(msg);}// parser->parse:// 解析 ONNX 模型文件，将其转换为 TensorRT 的网络定义。// ONNX 是一种常见的模型格式，TensorRT 提供了专门的解析器来支持 ONNX 模型。// 通过将文件读入内存，可以灵活处理加密或压缩的模型文件。// Parse the buffer we read into memory.auto parsed = parser->parse(buffer.data(), buffer.size());if (!parsed) {return false;}// Ensure that all the inputs have the same batch sizeconst auto numInputs = network->getNbInputs();if (numInputs < 1) {auto msg = "Error, model needs at least 1 input!";spdlog::error(msg);throw std::runtime_error(msg);}// 检查输入的批量大小// TensorRT 要求所有输入的批量大小必须一致。const auto input0Batch = network->getInput(0)->getDimensions().d[0];for (int32_t i = 1; i < numInputs; ++i) {if (network->getInput(i)->getDimensions().d[0] != input0Batch) {auto msg = "Error, the model has multiple inputs, each with differing batch sizes!";spdlog::error(msg);throw std::runtime_error(msg);}}// Check to see if the model supports dynamic batch size or notbool doesSupportDynamicBatch = false;if (input0Batch == -1) {doesSupportDynamicBatch = true;spdlog::info("Model supports dynamic batch size");} else {spdlog::info("Model only supports fixed batch size of {}", input0Batch);// If the model supports a fixed batch size, ensure that the maxBatchSize// and optBatchSize were set correctly.if (m_options.optBatchSize != input0Batch || m_options.maxBatchSize != input0Batch) {auto msg = "Error, model only supports a fixed batch size of " + std::to_string(input0Batch) +". Must set Options.optBatchSize and Options.maxBatchSize to 1";spdlog::error(msg);throw std::runtime_error(msg);}}const auto input3Batch = network->getInput(0)->getDimensions().d[3];bool doesSupportDynamicWidth = false;if (input3Batch == -1) {doesSupportDynamicWidth = true;spdlog::info("Model supports dynamic width. Using Options.maxInputWidth, Options.minInputWidth, and Options.optInputWidth to set the input width.");// Check that the values of maxInputWidth, minInputWidth, and optInputWidth are validif (m_options.maxInputWidth < m_options.minInputWidth || m_options.maxInputWidth < m_options.optInputWidth ||m_options.minInputWidth > m_options.optInputWidth|| m_options.maxInputWidth < 1 || m_options.minInputWidth < 1 || m_options.optInputWidth < 1) {auto msg = "Error, invalid values for Options.maxInputWidth, Options.minInputWidth, and Options.optInputWidth";spdlog::error(msg);throw std::runtime_error(msg);}}//设置优化配置auto config = std::unique_ptr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());if (!config) {return false;}//设置优化配置，包括动态批量大小和动态输入宽度。// 使用 IOptimizationProfile 定义输入的最小、优化和最大维度。// 为什么要这么写？// TensorRT 允许为动态输入设置优化配置，以便在推理时根据实际输入调整性能。// 通过设置优化配置，可以在性能和灵活性之间取得平衡。// Register a single optimization profilenvinfer1::IOptimizationProfile *optProfile = builder->createOptimizationProfile();for (int32_t i = 0; i < numInputs; ++i) {// Must specify dimensions for all the inputs the model expects.const auto input = network->getInput(i);const auto inputName = input->getName();const auto inputDims = input->getDimensions();int32_t inputC = inputDims.d[1];int32_t inputH = inputDims.d[2];int32_t inputW = inputDims.d[3];int32_t minInputWidth = std::max(m_options.minInputWidth, inputW);// Specify the optimization profile`if (doesSupportDynamicBatch) {optProfile->setDimensions(inputName, nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, inputC, inputH, minInputWidth));} else {optProfile->setDimensions(inputName, nvinfer1::OptProfileSelector::kMIN,nvinfer1::Dims4(m_options.optBatchSize, inputC, inputH, minInputWidth));}if (doesSupportDynamicWidth) {optProfile->setDimensions(inputName, nvinfer1::OptProfileSelector::kOPT,nvinfer1::Dims4(m_options.optBatchSize, inputC, inputH, m_options.optInputWidth));optProfile->setDimensions(inputName, nvinfer1::OptProfileSelector::kMAX,nvinfer1::Dims4(m_options.maxBatchSize, inputC, inputH, m_options.maxInputWidth));} else {optProfile->setDimensions(inputName, nvinfer1::OptProfileSelector::kOPT,nvinfer1::Dims4(m_options.optBatchSize, inputC, inputH, inputW));optProfile->setDimensions(inputName, nvinfer1::OptProfileSelector::kMAX,nvinfer1::Dims4(m_options.maxBatchSize, inputC, inputH, inputW));}}config->addOptimizationProfile(optProfile);// Set the precision levelconst auto engineName = serializeEngineOptions(m_options, onnxModelPath);if (m_options.precision == Precision::FP16) {// Ensure the GPU supports FP16 inferenceif (!builder->platformHasFastFp16()) {auto msg = "Error: GPU does not support FP16 precision";spdlog::error(msg);throw std::runtime_error(msg);}config->setFlag(nvinfer1::BuilderFlag::kFP16);} else if (m_options.precision == Precision::INT8) {if (numInputs > 1) {auto msg = "Error, this implementation currently only supports INT8 ""quantization for single input models";spdlog::error(msg);throw std::runtime_error(msg);}// Ensure the GPU supports INT8 Quantizationif (!builder->platformHasFastInt8()) {auto msg = "Error: GPU does not support INT8 precision";spdlog::error(msg);throw std::runtime_error(msg);}// Ensure the user has provided path to calibration data directoryif (m_options.calibrationDataDirectoryPath.empty()) {auto msg = "Error: If INT8 precision is selected, must provide path to ""calibration data directory to Engine::build method";throw std::runtime_error(msg);}config->setFlag((nvinfer1::BuilderFlag::kINT8));const auto input = network->getInput(0);const auto inputName = input->getName();const auto inputDims = input->getDimensions();const auto calibrationFileName = engineName + ".calibration";m_calibrator = std::make_unique<Int8EntropyCalibrator2>(m_options.calibrationBatchSize, inputDims.d[3], inputDims.d[2],m_options.calibrationDataDirectoryPath, calibrationFileName, inputName,subVals, divVals, normalize);config->setInt8Calibrator(m_calibrator.get());}// CUDA stream used for profiling by the builder.cudaStream_t profileStream;Util::checkCudaErrorCode(cudaStreamCreate(&profileStream));config->setProfileStream(profileStream);// Build the engine// If this call fails, it is suggested to increase the logger verbosity to// kVERBOSE and try rebuilding the engine. Doing so will provide you with more// information on why exactly it is failing.// 构建引擎并保存到磁盘// 使用 builder->buildSerializedNetwork 构建序列化的 TensorRT 引擎。// 将引擎保存到磁盘，以便后续加载和使用。// 为什么要这么写？// 构建引擎是一个耗时操作，将引擎保存到磁盘可以避免重复构建。// 序列化的引擎文件可以直接加载到内存中，节省时间。std::unique_ptr<nvinfer1::IHostMemory> plan{builder->buildSerializedNetwork(*network, *config)};if (!plan) {return false;}// Write the engine to diskconst auto enginePath = std::filesystem::path(m_options.engineFileDir) / engineName;std::ofstream outfile(enginePath, std::ofstream::binary);outfile.write(reinterpret_cast<const char *>(plan->data()), plan->size());spdlog::info("Success, saved engine to {}", enginePath.string());Util::checkCudaErrorCode(cudaStreamDestroy(profileStream));return true;
}

函数的作用

解析 ONNX 模型文件。

设置优化配置（动态批量大小、动态输入宽度等）。

设置精度模式（FP16 或 INT8）。

构建 TensorRT 引擎并保存到磁盘。

涉及的 TensorRT API

nvinfer1::IBuilder: 用于构建和优化网络。

nvinfer1::INetworkDefinition: 描述神经网络的结构。

nvonnxparser::IParser: 解析 ONNX 模型文件。

nvinfer1::IBuilderConfig: 配置优化选项（如精度模式、优化配置）。

nvinfer1::IOptimizationProfile: 定义动态输入的最小、优化和最大维度。

builder->buildSerializedNetwork: 构建序列化的 TensorRT 引擎。

3.3 loadNetwork，load trt engine函数实现

在buildLoadNetwork函数中，是用build函数通过onnx构建完成trt engine后，紧接着就是loadNetwork的实现。

//省略其他代码auto ret = build(onnxModelPath, subVals, divVals, normalize);if (!ret) {return false;}}return loadNetwork(enginePath, subVals, divVals, normalize);
}

该函数的作用是从磁盘加载一个序列化的 TensorRT 引擎文件（.engine 文件），并将其反序列化为内存中的 TensorRT 引擎。
它还会初始化推理所需的上下文（IExecutionContext）和 GPU 缓冲区。

我们看一下其具体实现，然后一同将代码的具体解释写在注释上。

template <typename T>
bool Engine<T>::loadNetwork(std::string trtModelPath, const std::array<float, 3> &subVals, const std::array<float, 3> &divVals,bool normalize) {//保存归一化参数（subVals 和 divVals），用于后续对输入数据进行归一化处理。// normalize 参数决定是否对输入数据进行归一化。m_subVals = subVals;m_divVals = divVals;m_normalize = normalize;// Read the serialized model from diskif (!Util::doesFileExist(trtModelPath)) {auto msg = "Error, unable to read TensorRT model at path: " + trtModelPath;spdlog::error(msg);return false;} else {auto msg = "Loading TensorRT engine file at path: " + trtModelPath;spdlog::info(msg);}//读取引擎文件到内存//将序列化的 TensorRT 引擎文件读取到内存缓冲区（buffer）中。// 为什么要这么写？// TensorRT 的反序列化 API（deserializeCudaEngine）需要一个内存缓冲区作为输入。// 使用 std::ifstream 以二进制模式读取文件，确保文件内容不被修改。std::ifstream file(trtModelPath, std::ios::binary | std::ios::ate);std::streamsize size = file.tellg();file.seekg(0, std::ios::beg);std::vector<char> buffer(size);if (!file.read(buffer.data(), size)) {auto msg = "Error, unable to read engine file";spdlog::error(msg);throw std::runtime_error(msg);}// Create a runtime to deserialize the engine file.// 创建 TensorRT 运行时// 涉及的 TensorRT API// nvinfer1::createInferRuntime:// 创建一个 TensorRT 运行时（IRuntime）对象，用于反序列化引擎文件。// m_logger 是一个自定义日志器，用于记录 TensorRT 的日志信息。// 作用// 创建一个运行时对象，用于管理 TensorRT 引擎的反序列化和执行。// 为什么要这么写？// TensorRT 的引擎文件是序列化的，需要通过运行时对象将其反序列化为内存中的引擎。m_runtime = std::unique_ptr<nvinfer1::IRuntime>{nvinfer1::createInferRuntime(m_logger)};if (!m_runtime) {return false;}// Set the device indexauto ret = cudaSetDevice(m_options.deviceIndex);if (ret != 0) {int numGPUs;cudaGetDeviceCount(&numGPUs);auto errMsg = "Unable to set GPU device index to: " + std::to_string(m_options.deviceIndex) + ". Note, your device has " +std::to_string(numGPUs) + " CUDA-capable GPU(s).";spdlog::error(errMsg);throw std::runtime_error(errMsg);}// Create an engine, a representation of the optimized model.// 反序列化引擎// 涉及的 TensorRT API// IRuntime::deserializeCudaEngine:// 将序列化的引擎文件反序列化为内存中的 TensorRT 引擎（ICudaEngine）。// 作用// 将序列化的引擎文件加载到内存中，生成一个可用于推理的 TensorRT 引擎。// 为什么要这么写？// TensorRT 的引擎文件是序列化的，需要通过反序列化生成内存中的引擎对象。m_engine = std::unique_ptr<nvinfer1::ICudaEngine>(m_runtime->deserializeCudaEngine(buffer.data(), buffer.size()));if (!m_engine) {return false;}// The execution context contains all of the state associated with a// particular invocation// 创建执行上下文//     涉及的 TensorRT API// ICudaEngine::createExecutionContext:// 创建一个执行上下文（IExecutionContext），用于管理推理时的状态。// 作用// 创建一个执行上下文，用于管理推理时的输入输出绑定和其他状态。// 为什么要这么写？// TensorRT 的推理需要一个执行上下文来管理输入输出缓冲区和执行状态。m_context = std::unique_ptr<nvinfer1::IExecutionContext>(m_engine->createExecutionContext());if (!m_context) {return false;}// Storage for holding the input and output buffers// This will be passed to TensorRT for inference// 初始化 GPU 缓冲区//     清除之前的 GPU 缓冲区，并为当前引擎的输入输出张量分配新的缓冲区。// 为什么要这么写？// 每个 TensorRT 引擎的输入输出张量可能不同，因此需要重新分配缓冲区。clearGpuBuffers();m_buffers.resize(m_engine->getNbIOTensors());m_outputLengths.clear();m_inputDims.clear();m_outputDims.clear();m_IOTensorNames.clear();// Create a cuda streamcudaStream_t stream;Util::checkCudaErrorCode(cudaStreamCreate(&stream));// Allocate GPU memory for input and output buffers// 分配输入和输出缓冲区//     涉及的 TensorRT API// ICudaEngine::getNbIOTensors:// 获取引擎的输入输出张量数量。// ICudaEngine::getIOTensorName:// 获取张量的名称。// ICudaEngine::getTensorShape:// 获取张量的形状。// ICudaEngine::getTensorDataType:// 获取张量的数据类型。// 作用// 遍历引擎的所有输入输出张量，为输出张量分配 GPU 缓冲区。// 为什么要这么写？// TensorRT 的推理需要将输入输出张量绑定到 GPU 缓冲区。m_outputLengths.clear();for (int i = 0; i < m_engine->getNbIOTensors(); ++i) {const auto tensorName = m_engine->getIOTensorName(i);m_IOTensorNames.emplace_back(tensorName);const auto tensorType = m_engine->getTensorIOMode(tensorName);const auto tensorShape = m_engine->getTensorShape(tensorName);const auto tensorDataType = m_engine->getTensorDataType(tensorName);if (tensorType == nvinfer1::TensorIOMode::kINPUT) {// The implementation currently only supports inputs of type floatif (m_engine->getTensorDataType(tensorName) != nvinfer1::DataType::kFLOAT) {auto msg = "Error, the implementation currently only supports float inputs";spdlog::error(msg);throw std::runtime_error(msg);}// Don't need to allocate memory for inputs as we will be using the OpenCV// GpuMat buffer directly.// Store the input dims for later usem_inputDims.emplace_back(tensorShape.d[1], tensorShape.d[2], tensorShape.d[3]);m_inputBatchSize = tensorShape.d[0];} else if (tensorType == nvinfer1::TensorIOMode::kOUTPUT) {// Ensure the model output data type matches the template argument// specified by the userif (tensorDataType == nvinfer1::DataType::kFLOAT && !std::is_same<float, T>::value) {auto msg = "Error, the model has expected output of type float. Engine class template parameter must be adjusted.";spdlog::error(msg);throw std::runtime_error(msg);} else if (tensorDataType == nvinfer1::DataType::kHALF && !std::is_same<__half, T>::value) {auto msg = "Error, the model has expected output of type __half. Engine class template parameter must be adjusted.";spdlog::error(msg);throw std::runtime_error(msg);} else if (tensorDataType == nvinfer1::DataType::kINT8 && !std::is_same<int8_t, T>::value) {auto msg = "Error, the model has expected output of type int8_t. Engine class template parameter must be adjusted.";spdlog::error(msg);throw std::runtime_error(msg);} else if (tensorDataType == nvinfer1::DataType::kINT32 && !std::is_same<int32_t, T>::value) {auto msg = "Error, the model has expected output of type int32_t. Engine class template parameter must be adjusted.";spdlog::error(msg);throw std::runtime_error(msg);} else if (tensorDataType == nvinfer1::DataType::kBOOL && !std::is_same<bool, T>::value) {auto msg = "Error, the model has expected output of type bool. Engine class template parameter must be adjusted.";spdlog::error(msg);throw std::runtime_error(msg);} else if (tensorDataType == nvinfer1::DataType::kUINT8 && !std::is_same<uint8_t, T>::value) {auto msg = "Error, the model has expected output of type uint8_t. Engine class template parameter must be adjusted.";spdlog::error(msg);throw std::runtime_error(msg);} else if (tensorDataType == nvinfer1::DataType::kFP8) {auto msg = "Error, the model has expected output of type kFP8. This is not supported by the Engine class.";spdlog::error(msg);throw std::runtime_error(msg);}// The binding is an outputuint32_t outputLength = 1;m_outputDims.push_back(tensorShape);for (int j = 1; j < tensorShape.nbDims; ++j) {// We ignore j = 0 because that is the batch size, and we will take that// into account when sizing the bufferoutputLength *= tensorShape.d[j];}m_outputLengths.push_back(outputLength);// Now size the output buffer appropriately, taking into account the max// possible batch size (although we could actually end up using less// memory)Util::checkCudaErrorCode(cudaMallocAsync(&m_buffers[i], outputLength * m_options.maxBatchSize * sizeof(T), stream));} else {auto msg = "Error, IO Tensor is neither an input or output!";spdlog::error(msg);throw std::runtime_error(msg);}}// Synchronize and destroy the cuda streamUtil::checkCudaErrorCode(cudaStreamSynchronize(stream));Util::checkCudaErrorCode(cudaStreamDestroy(stream));return true;
}

总结
函数的作用
从磁盘加载序列化的 TensorRT 引擎文件。

初始化运行时、引擎、执行上下文和 GPU 缓冲区。

涉及的 TensorRT API

IRuntime::deserializeCudaEngine: 反序列化引擎文件。

ICudaEngine::createExecutionContext: 创建执行上下文。

ICudaEngine::getNbIOTensors: 获取输入输出张量数量。

ICudaEngine::getTensorShape: 获取张量的形状。

3.4 runInference具体实现

基于前面的build和load trt model，现在就到了runInference的环节。runInference的实现在include/engine/EngineRunInference.inl中。该函数的作用是运行 TensorRT 推理。
它接收输入数据（GpuMat 格式），对其进行预处理，运行推理，并将推理结果拷贝回 CPU。

还是看一下runInference的全部实现，然后把实现过程的解释写在注释上面。

#pragma once
#include <filesystem>
#include <spdlog/spdlog.h>
#include "util/Util.h"template <typename T>
bool Engine<T>::runInference(const std::vector<std::vector<cv::cuda::GpuMat>> &inputs,std::vector<std::vector<std::vector<T>>> &featureVectors) {// First we do some error checking//输入检查//     检查输入数据的合法性，包括：// 输入是否为空。// 输入数量是否与模型的输入张量数量匹配。// 批量大小是否超过模型的最大批量大小。// 如果模型有固定批量大小，检查输入的批量大小是否匹配。// 为什么要这么写？// 确保输入数据的合法性是运行推理的前提条件。// 如果输入数据不合法，推理结果将是无效的。if (inputs.empty() || inputs[0].empty()) {spdlog::error("Provided input vector is empty!");return false;}const auto numInputs = m_inputDims.size();if (inputs.size() != numInputs) {spdlog::error("Incorrect number of inputs provided!");return false;}// Ensure the batch size does not exceed the maxif (inputs[0].size() > static_cast<size_t>(m_options.maxBatchSize)) {spdlog::error("===== Error =====");spdlog::error("The batch size is larger than the model expects!");spdlog::error("Model max batch size: {}", m_options.maxBatchSize);spdlog::error("Batch size provided to call to runInference: {}", inputs[0].size());return false;}// Ensure that if the model has a fixed batch size that is greater than 1, the// input has the correct lengthif (m_inputBatchSize != -1 && inputs[0].size() != static_cast<size_t>(m_inputBatchSize)) {spdlog::error("===== Error =====");spdlog::error("The batch size is different from what the model expects!");spdlog::error("Model batch size: {}", m_inputBatchSize);spdlog::error("Batch size provided to call to runInference: {}", inputs[0].size());return false;}const auto batchSize = static_cast<int32_t>(inputs[0].size());// Make sure the same batch size was provided for all inputsfor (size_t i = 1; i < inputs.size(); ++i) {if (inputs[i].size() != static_cast<size_t>(batchSize)) {spdlog::error("===== Error =====");spdlog::error("The batch size is different for each input!");return false;}}// Create the cuda stream that will be used for inference//创建 CUDA 流//     涉及的 CUDA API// cudaStreamCreate:// 创建一个 CUDA 流，用于异步执行 CUDA 操作。// 作用// 创建一个 CUDA 流，用于管理推理过程中的异步操作（如数据拷贝和推理执行）。// 为什么要这么写？// 使用 CUDA 流可以提高性能，因为它允许多个 CUDA 操作同时执行。cudaStream_t inferenceCudaStream;Util::checkCudaErrorCode(cudaStreamCreate(&inferenceCudaStream));//输入预处理std::vector<cv::cuda::GpuMat> preprocessedInputs;// Preprocess all the inputs//     涉及的 TensorRT API// IExecutionContext::setInputShape:// 设置动态输入张量的形状（如批量大小）。// 作用// 检查输入数据的尺寸是否与模型的输入张量匹配。// 如果模型支持动态输入，设置输入张量的形状。// 调用 blobFromGpuMats 函数，将输入数据从 NHWC 格式转换为 NCHW 格式，并进行归一化。// 为什么要这么写？// TensorRT 的输入张量通常是 NCHW 格式，而 OpenCV 的 GpuMat 通常是 NHWC 格式，因此需要进行格式转换。// 如果模型支持动态输入，必须在推理前设置输入张量的形状。for (size_t i = 0; i < numInputs; ++i) {const auto &batchInput = inputs[i];const auto &dims = m_inputDims[i];auto &input = batchInput[0];if (input.channels() != dims.d[0] || input.rows != dims.d[1] || input.cols != dims.d[2]) {spdlog::error("===== Error =====");spdlog::error("Input does not have correct size!");spdlog::error("Expected: ({}, {}, {})", dims.d[0], dims.d[1], dims.d[2]);spdlog::error("Got: ({}, {}, {})", input.channels(), input.rows, input.cols);spdlog::error("Ensure you resize your input image to the correct size");return false;}nvinfer1::Dims4 inputDims = {batchSize, dims.d[0], dims.d[1], dims.d[2]};m_context->setInputShape(m_IOTensorNames[i].c_str(),inputDims); // Define the batch size// OpenCV reads images into memory in NHWC format, while TensorRT expects// images in NCHW format. The following method converts NHWC to NCHW. Even// though TensorRT expects NCHW at IO, during optimization, it can// internally use NHWC to optimize cuda kernels See:// https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#data-layout// Copy over the input data and perform the preprocessingauto mfloat = blobFromGpuMats(batchInput, m_subVals, m_divVals, m_normalize);preprocessedInputs.push_back(mfloat);m_buffers[i] = mfloat.ptr<void>();}// Ensure all dynamic bindings have been defined.if (!m_context->allInputDimensionsSpecified()) {auto msg = "Error, not all required dimensions specified.";spdlog::error(msg);throw std::runtime_error(msg);}// Set the address of the input and output buffers//设置输入和输出缓冲区地址//     涉及的 TensorRT API// IExecutionContext::setTensorAddress:// 设置输入和输出张量的内存地址。// 作用// 将输入和输出缓冲区的地址绑定到 TensorRT 的执行上下文。// 为什么要这么写？// TensorRT 的推理需要知道输入和输出张量的内存地址。for (size_t i = 0; i < m_buffers.size(); ++i) {bool status = m_context->setTensorAddress(m_IOTensorNames[i].c_str(), m_buffers[i]);if (!status) {return false;}}// Run inference.// 运行推理//     涉及的 TensorRT API// IExecutionContext::enqueueV3:// 在指定的 CUDA 流中异步运行推理。// 作用// 在 GPU 上运行推理。// 为什么要这么写？// 使用异步推理可以提高性能，因为它允许推理和其他操作（如数据拷贝）同时执行。bool status = m_context->enqueueV3(inferenceCudaStream);if (!status) {return false;}// Copy the outputs back to CPUfeatureVectors.clear();//拷贝输出数据到 CPU//     涉及的 CUDA API// cudaMemcpyAsync:// 异步拷贝数据，从 GPU 内存到 CPU 内存。// 作用// 将推理结果从 GPU 内存拷贝到 CPU 内存。// 为什么要这么写？// 推理结果通常需要在 CPU 上进行后处理，因此需要将数据从 GPU 内存拷贝到 CPU 内存。for (int batch = 0; batch < batchSize; ++batch) {// Batchstd::vector<std::vector<T>> batchOutputs{};for (int32_t outputBinding = numInputs; outputBinding < m_engine->getNbIOTensors(); ++outputBinding) {// We start at index m_inputDims.size() to account for the inputs in our// m_buffersstd::vector<T> output;auto outputLength = m_outputLengths[outputBinding - numInputs];output.resize(outputLength);// Copy the outputUtil::checkCudaErrorCode(cudaMemcpyAsync(output.data(),static_cast<char *>(m_buffers[outputBinding]) + (batch * sizeof(T) * outputLength),outputLength * sizeof(T), cudaMemcpyDeviceToHost, inferenceCudaStream));batchOutputs.emplace_back(std::move(output));}featureVectors.emplace_back(std::move(batchOutputs));}// Synchronize the cuda stream//同步和销毁 CUDA 流//     涉及的 CUDA API// cudaStreamSynchronize:// 等待 CUDA 流中的所有操作完成。// cudaStreamDestroy:// 销毁 CUDA 流，释放资源。// 作用// 确保所有异步操作完成，并释放 CUDA 流资源。// 为什么要这么写？// 在销毁 CUDA 流之前，必须确保所有操作完成。// 销毁 CUDA 流可以释放 GPU 资源。Util::checkCudaErrorCode(cudaStreamSynchronize(inferenceCudaStream));Util::checkCudaErrorCode(cudaStreamDestroy(inferenceCudaStream));return true;
}

总结

函数的作用

检查输入数据的合法性。

对输入数据进行预处理。

设置输入和输出缓冲区地址。

在 GPU 上运行推理。

将推理结果拷贝到 CPU。

涉及的 TensorRT API

IExecutionContext::setInputShape: 设置动态输入张量的形状。

IExecutionContext::setTensorAddress: 设置输入和输出张量的内存地址。

IExecutionContext::enqueueV3: 异步运行推理。

cudaMemcpyAsync: 异步拷贝数据。

为什么要这么写？

遵循 TensorRT 的最佳实践，确保推理过程的高效性和正确性。

使用 CUDA 流和异步操作提高性能。

对输入数据进行严格检查，确保推理结果的可靠性。

在runInference的实现中，还有两个点值得看一下：

第一，blobFromGpuMats的实现。此函数的实现在include/engine/EngineUtilities.inl中。该函数的作用是将一批输入图像（GpuMat 格式）转换为 TensorRT 所需的输入张量格式。
它完成以下任务：
将输入图像从 HWC 格式（Height-Width-Channel）转换为 CHW 格式（Channel-Height-Width）。
对图像进行归一化（如果需要）。
应用减均值（subVals）和除以标准差（divVals）的操作。
返回预处理后的 GPU 张量。

template <typename T>
cv::cuda::GpuMat Engine<T>::blobFromGpuMats(const std::vector<cv::cuda::GpuMat> &batchInput, const std::array<float, 3> &subVals,const std::array<float, 3> &divVals, bool normalize, bool swapRB) {CHECK(!batchInput.empty())CHECK(batchInput[0].channels() == 3)//创建目标 GPU 缓冲区//     创建一个目标 GPU 缓冲区（gpu_dst），用于存储批量图像的预处理结果。// 缓冲区大小为：高度 × 宽度 × 批量大小 × 通道数// 其中，CV_8UC3 表示每个像素有 3 个通道，每个通道占用 8 位（无符号整数）。// 为什么要这么写？// TensorRT 的输入张量通常是批量数据，因此需要将所有图像的预处理结果存储在一个连续的缓冲区中。cv::cuda::GpuMat gpu_dst(1, batchInput[0].rows * batchInput[0].cols * batchInput.size(), CV_8UC3);//图像通道分离（HWC -> CHW）//     将每张图像从 HWC 格式转换为 CHW 格式。// 如果 swapRB 为 true，交换 R 和 B 通道（从 BGR 转换为 RGB）。// 关键点// cv::cuda::split:// 将输入图像的通道分离，并存储到 input_channels 中。// 例如，HWC 格式的图像 [H, W, C] 会被分离为 3 个单通道图像 [H, W]。// 通道存储位置:// 每个通道的数据存储在 gpu_dst 的不同位置：// 第 1 个通道：gpu_dst.ptr()[0 + width * 3 * img]// 第 2 个通道：gpu_dst.ptr()[width + width * 3 * img]// 第 3 个通道：gpu_dst.ptr()[width * 2 + width * 3 * img]// 为什么要这么写？// TensorRT 的输入张量通常是 CHW 格式，而 OpenCV 的 GpuMat 默认是 HWC 格式，因此需要进行格式转换。// swapRB 参数允许用户灵活地处理 RGB 和 BGR 格式的图像。size_t width = batchInput[0].cols * batchInput[0].rows;if (swapRB) {for (size_t img = 0; img < batchInput.size(); ++img) {std::vector<cv::cuda::GpuMat> input_channels{cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[width * 2 + width * 3 * img])),cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[width + width * 3 * img])),cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[0 + width * 3 * img]))};cv::cuda::split(batchInput[img], input_channels); // HWC -> CHW}} else {for (size_t img = 0; img < batchInput.size(); ++img) {std::vector<cv::cuda::GpuMat> input_channels{cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[0 + width * 3 * img])),cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[width + width * 3 * img])),cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[width * 2 + width * 3 * img]))};cv::cuda::split(batchInput[img], input_channels); // HWC -> CHW}}cv::cuda::GpuMat mfloat;//数据归一化//     作用// 将图像数据从 uint8 转换为 float32。// 如果 normalize 为 true，将像素值归一化到 [0, 1] 范围。// 关键点// cv::cuda::GpuMat::convertTo:// 将图像数据从一种数据类型转换为另一种数据类型。// 例如，将 CV_8UC3 转换为 CV_32FC3。// 为什么要这么写？// TensorRT 的输入张量通常是 float32 类型，因此需要将图像数据从 uint8 转换为 float32。// 归一化可以提高模型的数值稳定性。if (normalize) {// [0.f, 1.f]gpu_dst.convertTo(mfloat, CV_32FC3, 1.f / 255.f);} else {// [0.f, 255.f]gpu_dst.convertTo(mfloat, CV_32FC3);}// Apply scaling and mean subtraction//减均值和除以标准差//     作用// 对图像数据进行减均值和除以标准差的操作。// 公式：mfloat = (mfloat - subVals) / divVals// 关键点// cv::cuda::subtract:// 对每个像素值减去对应通道的均值（subVals）。// cv::cuda::divide:// 对每个像素值除以对应通道的标准差（divVals）。// 为什么要这么写？// 减均值和除以标准差是常见的图像预处理步骤，可以使输入数据的分布更适合模型的训练分布。// 例如，subVals 和 divVals 通常是根据训练数据集计算的均值和标准差。cv::cuda::subtract(mfloat, cv::Scalar(subVals[0], subVals[1], subVals[2]), mfloat, cv::noArray(), -1);cv::cuda::divide(mfloat, cv::Scalar(divVals[0], divVals[1], divVals[2]), mfloat, 1, -1);return mfloat;
}

函数的作用

将一批输入图像从 GpuMat 格式转换为 TensorRT 所需的输入张量格式。

包括以下步骤：

检查输入合法性。

将图像从 HWC 格式转换为 CHW 格式。

数据归一化。

减均值和除以标准差。

涉及的 OpenCV CUDA API

cv::cuda::GpuMat:
表示存储在 GPU 上的矩阵。

cv::cuda::split:
将多通道图像分离为单通道图像。

cv::cuda::GpuMat::convertTo:
转换图像数据类型。

cv::cuda::subtract:
对每个像素值减去标量。

cv::cuda::divide:
对每个像素值除以标量。

为什么要这么写？
符合 TensorRT 的输入要求（CHW 格式、float32 类型）。

避免数据拷贝，直接在 GPU 上完成所有预处理操作，提高性能。

提供灵活性（支持 RGB/BGR 通道交换、归一化、减均值和除以标准差）。

这里再重点解释下HWC->CHW的代码转换逻辑。

for (size_t img = 0; img < batchInput.size(); ++img) {std::vector<cv::cuda::GpuMat> input_channels{cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[0 + width * 3 * img])),cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[width + width * 3 * img])),cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[width * 2 + width * 3 * img]))};cv::cuda::split(batchInput[img], input_channels); // HWC -> CHW
}

这段代码的核心功能是将输入的图像数据从 HWC 格式（Height, Width, Channels）转换为 CHW 格式（Channels, Height, Width）。这种转换通常用于深度学习模型的输入预处理，因为许多模型（例如 TensorRT）要求输入数据以 CHW 格式排列。

输入数据格式：
batchInput 是一个包含多张图像的 std::vectorcv::cuda::GpuMat，每张图像的格式为 HWC（即高度、宽度和通道）。
每张图像有 3 个通道（通常是 RGB 或 BGR），每个像素由 3 个值表示。
目标数据格式：
gpu_dst 是一个 GPU 上的 cv::cuda::GpuMat，用于存储转换后的数据，格式为 CHW。
它的大小为 (1, rows * cols * batch_size)，其中 rows 和 cols 是单张图像的高度和宽度，batch_size 是图像的数量。
计算偏移量：

width 表示单张图像的像素总数（cols * rows）。
width * 3 * img 是当前图像在 gpu_dst 中的起始位置。
0, width, 和 width * 2 分别对应通道 0（R）、通道 1（G）、通道 2（B）的偏移量。
创建通道视图：

cv::cuda::GpuMat 对象的构造函数调用

cv::cuda::GpuMat(batchInput[0].rows, batchInput[0].cols, CV_8U, &(gpu_dst.ptr()[0 + width * 3 * img])),

这行代码是一个 cv::cuda::GpuMat 对象的构造函数调用，用于在 GPU 上创建一个矩阵（GpuMat 是 OpenCV 的 CUDA 模块中的类，用于处理 GPU 上的矩阵数据）。

batchInput[0].rows:
表示矩阵的行数（高度），这里取的是 batchInput[0] 的行数。
假设 batchInput 是一个包含多张图像的向量，每张图像是一个 cv::cuda::GpuMat。

batchInput[0].cols:
表示矩阵的列数（宽度），这里取的是 batchInput[0] 的列数。

CV_8U：
表示矩阵的类型，CV_8U 是 OpenCV 中的一个常量，表示每个像素是 8 位无符号整数（单通道）。
在这里，虽然图像是 3 通道（RGB 或 BGR），但每个通道的像素值仍然是 8 位无符号整数。

&(gpu_dst.ptr()[0 + width * 3 * img]):
表示矩阵的起始地址，指向 GPU 内存中的某个位置。
gpu_dst.ptr() 返回 gpu_dst 矩阵的首地址（指针）。
偏移量 0 + width * 3 * img 用于定位到 gpu_dst 中当前图像的起始位置。
width 是单张图像的像素总数（cols * rows）。
3 是通道数（RGB）。
img 是当前图像在批次中的索引。

这段代码的作用是：
在 GPU 上创建一个矩阵（GpuMat）。
该矩阵的大小为 batchInput[0].rows x batchInput[0].cols。
数据类型为 CV_8U（8 位无符号整数）。
数据存储在 gpu_dst 的某个偏移位置。

cv::cuda::GpuMat 的构造函数允许通过指针访问 GPU 内存。
&(gpu_dst.ptr()[…]) 指定了每个通道在 gpu_dst 中的起始位置。
input_channels 是一个包含 3 个 GpuMat 的向量，每个 GpuMat 对应一个通道。
通道分离：

cv::cuda::split(batchInput[img], input_channels) 将输入图像的通道分离，并将每个通道的数据写入 input_channels 中。
由于 input_channels 指向 gpu_dst 的不同位置，分离后的数据直接存储在 gpu_dst 中，完成了 HWC -> CHW 的转换。
关键点
HWC 格式
数据在内存中的排列方式是按行优先的，每行包含所有通道的数据。例如：[R1, G1, B1, R2, G2, B2, …, Rn, Gn, Bn]

其中 R1, G1, B1 是第一个像素的 RGB 值。
CHW 格式：

数据在内存中的排列方式是按通道优先的，每个通道的数据连续存储。例如：[R1, R2, …, Rn, G1, G2, …, Gn, B1, B2, …, Bn]

GPU 内存布局：

gpu_dst 是一个线性内存块，分为多个部分，每部分存储一个通道的数据。
通过计算偏移量，将每个通道的数据写入正确的位置。

3.5 resizeKeepAspectRatioPadRightBottom实现

这段代码实现了一个函数 resizeKeepAspectRatioPadRightBottom，用于对输入的 GPU 图像（cv::cuda::GpuMat）进行缩放和填充操作，以保持图像的宽高比，同时将图像调整到指定的目标尺寸。填充部分使用指定的背景颜色（bgcolor）。

template <typename T>
cv::cuda::GpuMat Engine<T>::resizeKeepAspectRatioPadRightBottom(const cv::cuda::GpuMat &input, size_t height, size_t width,const cv::Scalar &bgcolor) {float r = std::min(width / (input.cols * 1.0), height / (input.rows * 1.0));int unpad_w = r * input.cols;int unpad_h = r * input.rows;cv::cuda::GpuMat re(unpad_h, unpad_w, CV_8UC3);cv::cuda::resize(input, re, re.size());//创建目标图像：// 创建一个新的 cv::cuda::GpuMat 对象 out，大小为目标尺寸 height x width，类型为 CV_8UC3。// 使用 bgcolor 初始化图像的所有像素值，作为填充部分的背景颜色。cv::cuda::GpuMat out(height, width, CV_8UC3, bgcolor);//     将缩放后的图像复制到目标图像：// 使用 cv::Rect(0, 0, re.cols, re.rows) 定义一个矩形区域，表示目标图像的左上角区域，其大小与缩放后的图像一致。// 使用 copyTo 将缩放后的图像 re 复制到目标图像 out 的该区域。re.copyTo(out(cv::Rect(0, 0, re.cols, re.rows)));return out;
}

四、main.cpp串联各个部分

我们来看main.cpp中是怎样将tensorrt api串联在一起的。

#include "cmd_line_parser.h"
#include "logger.h"
#include "engine.h"
#include <chrono>
#include <opencv2/cudaimgproc.hpp>
#include <opencv2/opencv.hpp>int main(int argc, char *argv[]) {CommandLineArguments arguments;// 从环境变量中获取日志级别std::string logLevelStr = getLogLevelFromEnvironment();spdlog::level::level_enum logLevel = toSpdlogLevel(logLevelStr);spdlog::set_level(logLevel);// 解析命令行参数if (!parseArguments(argc, argv, arguments)) {return -1;}// 一、配置好构建tensorrt部署model和运行时的Options// 指定 GPU 推理的配置选项Options options;// 指定推理使用的精度// FP16 的速度大约是 FP32 的两倍options.precision = Precision::FP16;// 如果使用 INT8 精度，必须指定校准数据目录路径options.calibrationDataDirectoryPath = "";// 指定优化的批量大小options.optBatchSize = 1;// 指定计划运行的最大批量大小options.maxBatchSize = 1;// 指定保存模型引擎文件的目录options.engineFileDir = ".";// 创建推理引擎// 二、初始化EngineEngine<float> engine(options);// 定义预处理参数// 默认的 Engine::build 方法会将值归一化到 [0.f, 1.f]// 如果将 normalize 标志设置为 false，则值保持在 [0.f, 255.f]std::array<float, 3> subVals{0.f, 0.f, 0.f}; // 减去的均值std::array<float, 3> divVals{1.f, 1.f, 1.f}; // 除以的标准差bool normalize = true; // 是否归一化// 如果模型需要将值归一化到 [-1.f, 1.f]，可以使用以下参数：// subVals = {0.5f, 0.5f, 0.5f};// divVals = {0.5f, 0.5f, 0.5f};// normalize = true;if (!arguments.onnxModelPath.empty()) {// 构建 ONNX 模型为 TensorRT 引擎文件，并加载到内存中//三、构建tensorrt model// 回忆一下，这里面涉及了从onnx->trt,以及加载trt模型、优化trt model、创建上下文、创建cuda stream、给输入输出绑定地址bool succ = engine.buildLoadNetwork(arguments.onnxModelPath, subVals, divVals, normalize);if (!succ) {throw std::runtime_error("无法构建或加载 TensorRT 引擎。");}} else {// 直接加载 TensorRT 引擎文件bool succ = engine.loadNetwork(arguments.trtModelPath, subVals, divVals, normalize);if (!succ) {const std::string msg = "无法加载 TensorRT 引擎。";spdlog::error(msg);throw std::runtime_error(msg);}}// 读取输入图像const std::string inputImage = "../inputs/team.jpg";auto cpuImg = cv::imread(inputImage);if (cpuImg.empty()) {const std::string msg = "无法读取图像路径：" + inputImage;spdlog::error(msg);throw std::runtime_error(msg);}// 将图像上传到 GPU 内存cv::cuda::GpuMat img;img.upload(cpuImg);// 模型需要 RGB 格式的输入cv::cuda::cvtColor(img, img, cv::COLOR_BGR2RGB);// 填充输入向量以供推理使用const auto &inputDims = engine.getInputDims();std::vector<std::vector<cv::cuda::GpuMat>> inputs;// 使用与 Options.optBatchSize 匹配的批量大小size_t batchSize = options.optBatchSize;// 为演示目的，将同一张图像填充到所有输入中for (const auto &inputDim : inputDims) { // 遍历模型的每个输入std::vector<cv::cuda::GpuMat> input;for (size_t j = 0; j < batchSize; ++j) { // 为每个批量元素填充数据// 使用保持宽高比的方式调整图像大小//四、调整图像大小auto resized = Engine<float>::resizeKeepAspectRatioPadRightBottom(img, inputDim.d[1], inputDim.d[2]);input.emplace_back(std::move(resized));}inputs.emplace_back(std::move(input));}// 在正式推理前对网络进行预热spdlog::info("正在预热网络...");std::vector<std::vector<std::vector<float>>> featureVectors;for (int i = 0; i < 100; ++i) {//五、运行推理//回忆一下，这个函数涉及了对输入做预处理、HWC->CHW、调用tensorrt的推理API、把输出转移到RAMbool succ = engine.runInference(inputs, featureVectors);if (!succ) {const std::string msg = "无法运行推理。";spdlog::error(msg);throw std::runtime_error(msg);}}// 基准测试推理时间size_t numIterations = 1000;spdlog::info("正在运行基准测试（{} 次迭代）...", numIterations);preciseStopwatch stopwatch;for (size_t i = 0; i < numIterations; ++i) {featureVectors.clear();engine.runInference(inputs, featureVectors);}auto totalElapsedTimeMs = stopwatch.elapsedTime<float, std::chrono::milliseconds>();auto avgElapsedTimeMs = totalElapsedTimeMs / numIterations / static_cast<float>(inputs[0].size());spdlog::info("基准测试完成！");spdlog::info("======================");spdlog::info("每个样本的平均时间：{} ms", avgElapsedTimeMs);spdlog::info("批量大小：{}", inputs[0].size());spdlog::info("平均 FPS：{} fps", static_cast<int>(1000 / avgElapsedTimeMs));spdlog::info("======================\n");// 打印特征向量for (size_t batch = 0; batch < featureVectors.size(); ++batch) {for (size_t outputNum = 0; outputNum < featureVectors[batch].size(); ++outputNum) {spdlog::info("批次 {}, 输出 {}", batch, outputNum);std::string output;int i = 0;for (const auto &e : featureVectors[batch][outputNum]) {output += std::to_string(e) + " ";if (++i == 10) {output += "...";break;}}spdlog::info("{}", output);}}// 如果模型需要后处理（例如将特征向量转换为边界框），可以在这里实现return 0;
}

五、总结：

使用 TensorRT 的 C++ API 在 Jetson Orin Nano（或其他平台）上部署模型，大体可分为以下五个主要阶段：

1. 配置推理选项

在开始之前，先构造一个 Options 结构体，指定：

精度模式：FP32、FP16 或 INT8
校准数据路径（仅 INT8 时需要）
优化批量大小 (optBatchSize) 和最大批量大小 (maxBatchSize)
设备索引（多 GPU 时指定）
Engine 文件保存目录 等

Options options;
options.precision = Precision::FP16;
options.optBatchSize = 1;
options.maxBatchSize = 1;
options.engineFileDir = ".";

2. 构造并（或）加载 TensorRT 引擎

构建并缓存引擎
```
Engine<float> engine(options);
engine.buildLoadNetwork("model.onnx");
```
- 解析 ONNX：nvonnxparser::IParser 读入到 INetworkDefinition
- 设置动态输入（Batch/Width）优化配置：IOptimizationProfile
- 选择精度：kFP16 或 kINT8 + 校准器
- 调用 builder->buildSerializedNetwork(...) 生成 .engine 文件
加载已有引擎
```
engine.loadNetwork("model.engine");
```
- 反序列化：IRuntime::deserializeCudaEngine
- 创建执行上下文：ICudaEngine::createExecutionContext
- 分配 GPU 缓冲区（输入/输出）

3. 输入预处理（Blob 构造）

在 runInference 调用前，将 OpenCV 的 cv::cuda::GpuMat 图像：

HWC → CHW：利用 cv::cuda::split 并按通道布局到连续 GPU 内存
归一化 & 减均值/除以标准差：convertTo + subtract + divide

直接将预处理后的 GPU 指针绑定到 TensorRT 输入缓冲：

blob = Engine<float>::blobFromGpuMats(batchMats, subVals, divVals, normalize);
context->setTensorAddress(inputName, blob.ptr<void>());

4. 执行推理

// 设置动态形状（如果支持）
context->setInputShape(inputName, Dims4(batchSize, C, H, W));
// 绑定所有 IO 缓冲区
for each tensorName:context->setTensorAddress(tensorName, bufferPtr);
// 异步执行
context->enqueueV3(cudaStream);

使用 CUDA 流并行化 DMA 与计算
通过 enqueueV3 在 GPU 上异步运行

5. 输出后处理与结果回传

拷贝输出：cudaMemcpyAsync 将每个输出缓冲区中的 int/float 数据拷贝回主机内存
格式调整：如果批量大小为 1，可使用 Engine::transformOutput 辅助将三维嵌套向量平铺为二维或一维输出
同步销毁：cudaStreamSynchronize + cudaStreamDestroy

以上流程中，核心类是模板 Engine<T>（继承自 IEngine<T>），向下封装了：

buildLoadNetwork：构建/加载 .engine
loadNetwork：反序列化+缓冲区初始化
runInference：输入预处理 → 推理 → 输出拷贝

这样就完成了一个完整的 C++ + TensorRT 的端到端部署流程。

一、首先是声明我们创建tensorrt model的参数。

二、构建tensorrt推理的engine

三、tensorrt推理Engine的具体实现

3.1 buildLoadNetwork实现

3.2 build函数实现（最重要的函数）

3.3 loadNetwork，load trt engine函数实现

3.4 runInference具体实现

3.5 resizeKeepAspectRatioPadRightBottom实现

四、main.cpp串联各个部分

五、总结：

1. 配置推理选项

2. 构造并（或）加载 TensorRT 引擎

3. 输入预处理（Blob 构造）

4. 执行推理

5. 输出后处理与结果回传

相关文章：