本文介绍 tensorrt推理onnx模型(二)

tensorrt推理onnx模型(二)

This article was original written by Jin Tian, welcome re-post, first come with https://jinfagang.github.io . but please keep this copyright info, thanks, any question could be asked via wechat: jintianiloveu

继续上一篇的探索,实际上由于上一篇的api已经过时,最新的TensorRT5.1的API已经完全变了。

我们依旧以mobilenet为例,实际上我们实现将mobilenet转到trt,也就是直接序列化为一个engine,然后直接load这个engine就可以推理了。其实也很简单直接:

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import cv2
import sys
import os
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine(engine_f):
        with open(engine_f, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())

    
def init(engine_f):
    engine = build_engine(engine_f)
    print(engine.get_binding_shape(0))
    print(engine.get_binding_shape(1))
    # 1. Allocate some host and device buffers for inputs and outputs:
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(trt.float32))
    h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(trt.float32))
    # Allocate device memory for inputs and outputs.
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()
    context = engine.create_execution_context()
    return context, h_input, h_output, stream, d_input, d_output


def load_input_data(img_f, pagelocked_buffer, target_size=(224, 224)):
    upsample_size = [int(target_size[1] / 8 * 4.0), int(target_size[0] / 8 * 4.0)]
    img = cv2.imread(img_f)
    img = cv2.resize(img, target_size, interpolation=cv2.INTER_CUBIC)
    # Flatten the image into a 1D array, normalize, and copy to pagelocked memory.
    np.copyto(pagelocked_buffer, img.ravel())
    return upsample_size


def predict2():
    global context, h_input, h_output, stream, d_input, d_output
    upsample_size = load_input_data(sys.argv[1], h_input, target_size=(224, 224))
    cuda.memcpy_htod_async(d_input, h_input, stream)
    context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    stream.synchronize()
    print(np.argmax(h_output))

if __name__ == "__main__":
    context, h_input, h_output, stream, d_input, d_output = init('mobilenetv2-1.0/mobilenetv2-1.0.trt')
    predict2()

上面这个代码是完整的推理代码,其中很多模块可以复用。

这个只是一次简单的流程,由于速度上并没有优势,因此直接忽略时间上的计算。总结来说,通过tensorrt加速推理分为这么几个步骤:

  • 首先要建造引擎,两种方式,先将onnx转到trt,直接load trt反序列化即可得到引擎,另一种是用onnxparser来得到engine,第二种本质上效率更低,因为有一个步骤可以事先省略;
  • 根据engine的输入输出信息,来开辟cuda内存,其中cuda上的内存叫做d,宿主机的叫做h;
  • 拿到了引擎返回的context,就可以执行了,执行也很简单,把binding传入即可。

这个tensorrt差不多就是这么个意思,但还有一些细化的东西需要理解,比如:

  • 我如果一个输入,三个输出怎么办?
  • 网络输出的结果如何进行后处理?
  • C++要怎么做?