紧跟恺明的步伐记录一下复现行为识别slowfast模型的全流程

　　作者丨付辉辉、周钰臣
　　编辑丨极市平台 前言
　　近年来，基于深度学习的人体动作识别的研究越来越多， slowfast  模型提出了快慢两通道网络在动作识别数据集上表现十分优异，本文介绍了 Slowfast  数据准备，如何训练，以及 slowfast  使用 onnx  进行推理，着重介绍了 Slowfast  使用 Tensorrt  推理，并且使用 yolov5  和 deepsort  进行人物追踪，以及使用 C++   部署。 1.数据准备1.1 剪裁视频
　　准备多组视频数据，其中 IN_DATA_DIR   为原始视频数据存放目录， OUT_DATA_DIR  为目标视频数据存放目录。这一步保证所有视频长度相同 IN_DATA_DIR=＂/project/train/src_repo/data/video＂ OUT_DATA_DIR=＂/project/train/src_repo/data/splitvideo＂ str=＂_＂ if [[ ! -d ＂${OUT_DATA_DIR}＂ ]]; then   echo ＂${OUT_DATA_DIR} doesn＂t exist. Creating it.＂;   mkdir -p ${OUT_DATA_DIR} fi  for video in $(ls -A1 -U ${IN_DATA_DIR}/*) do      for i in {0..10}     do        index=$(expr $i * 10)       out_name=＂${OUT_DATA_DIR}/${i}${str}${video##*/}＂       if [ ! -f ＂${out_name}＂ ]; then         ffmpeg -ss ${index} -t 80 -i ＂${video}＂ ＂${out_name}＂       fi     done done 1.2 提取关键帧
　　关键帧是从视频每一秒中提取一帧， IN_DATA_DIR  为步骤一得到视频的目录， OUT_DATA_DIR  为提取的关键帧的存放目录 #切割图片，每秒1帧 IN_DATA_DIR=＂/project/train/src_repo/data/splitvideo/＂ OUT_DATA_DIR=＂/project/train/src_repo/data/splitimages/＂   if [[ ! -d ＂${OUT_DATA_DIR}＂ ]]; then   echo ＂${OUT_DATA_DIR} doesn＂t exist. Creating it.＂;   mkdir -p ${OUT_DATA_DIR} fi   for video in $(ls -A1 -U ${IN_DATA_DIR}/*) do   video_name=${video##*/}     if [[ $video_name = *＂.webm＂ ]]; then     video_name=${video_name::-5}   else     video_name=${video_name::-4}   fi     out_video_dir=${OUT_DATA_DIR}/${video_name}/   mkdir -p ＂${out_video_dir}＂     out_name=＂${out_video_dir}/${video_name}_%06d.jpg＂     ffmpeg -i ＂${video}＂ -r 1 -q:v 1 ＂${out_name}＂ done   1.3 分割视频
　　将步骤一生成的视频通过 ffmpeg  进行分帧，每秒30帧， IN_DATA_DIR  为存放视频目录， OUT_DATA_DIR  为存放结果目录 IN_DATA_DIR=＂/project/train/src_repo/video＂ OUT_DATA_DIR=＂/project/train/src_repo/spiltvideo＂  if [[ ! -d ＂${OUT_DATA_DIR}＂ ]]; then   echo ＂${OUT_DATA_DIR} doesn＂t exist. Creating it.＂;   mkdir -p ${OUT_DATA_DIR} fi  for video in $(ls -A1 -U ${IN_DATA_DIR}/*) do   out_name=＂${OUT_DATA_DIR}/${video##*/}＂   if [ ! -f ＂${out_name}＂ ]; then     ffmpeg -ss 0 -t 100 -i ＂${video}＂ ＂${out_name}＂   fi done 1.4 文件目录ava  #一级文件夹，用来存放视频信息 —person_box_67091280_iou90 #二级文件夹，用来存放目标检测信息文件夹 ——ava_detection_train_boxes_and_labels_include_negative_v2.2.csv #二级文件夹下文件，用来存放目标检测的信息，用于训练 ——ava_detection_val_boxes_and_labels.csv #二级文件夹下文件，用来存放目标检测的信息，用于测试 —ava_action_list_v2.2_for_activitynet_2019.pbtxt #一级文件夹下的文件，用来存放标签信息 —ava_val_excluded_timestamps_v2.2.csv #一级文件夹下的文件，用来没有人物的帧，在训练过程中会抛弃这些帧 —ava_train_v2.2.csv #一级文件夹下的文件，用来存放训练数据，关键帧的信息 —ava_val_v2.2.csv  #一级文件夹下的文件，用来存放验证数据，关键帧的信息  frame_lists  #一级文件夹，存放1.3中生成的图片的路径 —train.csv —val.csv  frames  #一级文件夹，存放1.3中生成的图片 —A ——A_000001.jpg ——A_0000012.jpg … ——A_000090.jpg —B ——B_000001.jpg ——B_0000012.jpg … ——B_000090.jpg 2.环境准备2.1 环境准备pip install iopath pip install fvcore pip install simplejson pip install pytorchvideo 2.2detectron2安装!python -m pip install pyyaml==5.1 import sys, os, distutils.core # Note: This is a faster way to install detectron2 in Colab, but it does not include all functionalities. # See https://detectron2.readthedocs.io/tutorials/install.html for full installation instructions !git clone ＂https://github.com/facebookresearch/detectron2＂ dist = distutils.core.run_setup(＂./detectron2/setup.py＂) !python -m pip install {＂ ＂.join([f＂＂{x}＂＂ for x in dist.install_requires])} sys.path.insert(0, os.path.abspath(＂./detectron2＂)) 3.slowfast训练3.1 训练python tools/run_net.py --cfg configs/AVA/SLOWFAST_32x2_R50_SHORT.yaml
　　SLOWFAST_32x2_R50_SHORT.yaml  TRAIN:   ENABLE: Fasle   DATASET: ava   BATCH_SIZE: 8 #64   EVAL_PERIOD: 5   CHECKPOINT_PERIOD: 1   AUTO_RESUME: True   CHECKPOINT_FILE_PATH: ＂/content/SLOWFAST_32x2_R101_50_50.pkl＂  #预训练模型地址   CHECKPOINT_TYPE: pytorch DATA:   NUM_FRAMES: 32   SAMPLING_RATE: 2   TRAIN_JITTER_SCALES: [256, 320]   TRAIN_CROP_SIZE: 224   TEST_CROP_SIZE: 224   INPUT_CHANNEL_NUM: [3, 3]   PATH_TO_DATA_DIR: ＂/content/ava＂ DETECTION:   ENABLE: True   ALIGNED: True AVA:   FRAME_DIR: ＂/content/ava/frames＂   #数据准备阶段生成的目录   FRAME_LIST_DIR: ＂/content/ava/frame_lists＂   ANNOTATION_DIR: ＂/content/ava/annotations＂   DETECTION_SCORE_THRESH: 0.5   FULL_TEST_ON_VAL: True   TRAIN_PREDICT_BOX_LISTS: [     ＂ava_train_v2.2.csv＂,     ＂person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv＂,   ]   TEST_PREDICT_BOX_LISTS: [     ＂person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv＂]       SLOWFAST:   ALPHA: 4   BETA_INV: 8   FUSION_CONV_CHANNEL_RATIO: 2   FUSION_KERNEL_SZ: 7 RESNET:   ZERO_INIT_FINAL_BN: True   WIDTH_PER_GROUP: 64   NUM_GROUPS: 1   DEPTH: 50   TRANS_FUNC: bottleneck_transform   STRIDE_1X1: False   NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]   SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]   SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]] NONLOCAL:   LOCATION: [[[], []], [[], []], [[], []], [[], []]]   GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]   INSTANTIATION: dot_product   POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]] BN:   USE_PRECISE_STATS: False   NUM_BATCHES_PRECISE: 20 SOLVER:   BASE_LR: 0.1   LR_POLICY: steps_with_relative_lrs   STEPS: [0, 10, 15, 20]   LRS: [1, 0.1, 0.01, 0.001]   MAX_EPOCH: 20   MOMENTUM: 0.9   WEIGHT_DECAY: 1e-7   WARMUP_EPOCHS: 5.0   WARMUP_START_LR: 0.000125   OPTIMIZING_METHOD: sgd MODEL:   NUM_CLASSES: 1   ARCH: slowfast   MODEL_NAME: SlowFast   LOSS_FUNC: bce   DROPOUT_RATE: 0.5   HEAD_ACT: sigmoid TEST:   ENABLE: False   DATASET: ava   BATCH_SIZE: 8 DATA_LOADER:   NUM_WORKERS: 0   PIN_MEMORY: True NUM_GPUS: 1 NUM_SHARDS: 1 RNG_SEED: 0 OUTPUT_DIR: . 3.2 训练过程常见报错
　　1. slowfast/datasets/ava_helper.py   中 AVA_VALID_FRAMES  改为你的视频长度
　　2. pytorchvideo.layers.distributed  报错 from pytorchvideo.layers.distributed import ( # noqa ImportError: cannot import name ＂cat_all_gather＂ from ＂pytorchvideo.layers.distributed＂  (/site-packages/pytorchvideo/layers/distributed.py)
　　3. pytorchvideo.losses   报错 File ＂SlowFast/slowfast/models/losses.py＂, line 11, in from pytorchvideo.losses.soft_target_cross_entropy import ( ModuleNotFoundError: No module named ＂pytorchvideo.losses＂
　　错误2，3可以通过查看参考链接一来解决 4.slowfast预测
　　第一种：使用官方的脚本进行推理 python tools/run_net.py --cfg demo/AVA/SLOWFAST_32x2_R101_50_50.yaml
　　第二种：由于 detectron2  安装问题，以及之后部署一系列的问题，可以使用 yolov5  加上 slowfast  进行推理
　　首先，先来了解 slowfast  的推理过程
　　Step1：连续读取64帧并且判断是否满足64帧 while was_read:     frames=[]     seq_length=64     while was_read and len(frames) < seq_length:         was_read, frame =cap.read()         frames.append(frame)
　　Step2: 使用yolov5进行目标检测
　　1. yolov5   推理代码，将 sys.path.insert  路径和权重路径 weights  进行更改 import argparse import os import platform import shutil import time from pathlib import Path import sys import json sys.path.insert(1, ＂/content/drive/MyDrive/yolov5/＂) import cv2 import torch import torch.backends.cudnn as cudnn import numpy as np import argparse import time import cv2 import torch import torch.backends.cudnn as cudnn from numpy import random from models.common import DetectMultiBackend from utils.augmentations import letterbox from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging from utils.torch_utils import select_device # ####### 参数设置 conf_thres = 0.6 iou_thres = 0.5 ####### imgsz = 640 weights = ＂/content/yolov5l.pt＂ device = ＂0＂ stride = 32 names = [＂person＂] import os def init():     # Initialize     global imgsz, device, stride     set_logging()     device = select_device(＂0＂)     half = device.type != ＂cpu＂  # half precision only supported on CUDA     model = DetectMultiBackend(weights, device=device, dnn=False)     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine     imgsz = check_img_size(imgsz, s=stride)  # check img_size     model.half()  # to FP16     model.eval()     return model  def process_image(model, input_image=None, args=None, **kwargs):     img0 = input_image     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]     img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB     img = np.ascontiguousarray(img)      img = torch.from_numpy(img).to(device)     img = img.half()     img /= 255.0  # 0 - 255 to 0.0 - 1.0     if len(img.shape) == 3:         img = img[None]     pred = model(img, augment=False, val=True)[0]     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)     result=[]     for i, det in enumerate(pred):  # detections per image         gn = torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh         if det is not None and len(det):             # Rescale boxes from img_size to im0 size             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()             for *xyxy, conf, cls in det:                 if cls==0:                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])     if len(result)==0:       return None     return torch.from_numpy(np.array(result))
　　2 .bbox   预处理 def scale_boxes(size, boxes, height, width):     ＂＂＂     Scale the short side of the box to size.     Args:         size (int): size to scale the image.         boxes (ndarray): bounding boxes to peform scale. The dimension is         `num boxes` x 4.         height (int): the height of the image.         width (int): the width of the image.     Returns:         boxes (ndarray): scaled bounding boxes.     ＂＂＂     if (width <= height and width == size) or (         height <= width and height == size     ):         return boxes      new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))         boxes *= float(new_height) / height     else:         new_width = int(math.floor((float(width) / height) * size))         boxes *= float(new_width) / width     return boxes
　　Step3: 图像预处理
　　1. Resize   图像尺寸 def scale(size, image):     ＂＂＂     Scale the short side of the image to size.     Args:         size (int): size to scale the image.         image (array): image to perform short side scale. Dimension is             `height` x `width` x `channel`.     Returns:         (ndarray): the scaled image with dimension of             `height` x `width` x `channel`.     ＂＂＂     height = image.shape[0]     width = image.shape[1]     # print(height,width)     if (width <= height and width == size) or (         height <= width and height == size     ):         return image     new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))     else:         new_width = int(math.floor((float(width) / height) * size))     img = cv2.resize(         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR     )     # print(new_width, new_height)     return img.astype(np.float32)
　　2.归一化 def tensor_normalize(tensor, mean, std, func=None):     ＂＂＂     Normalize a given tensor by subtracting the mean and piding the std.     Args:         tensor (tensor): tensor to normalize.         mean (tensor or list): mean value to subtract.         std (tensor or list): std to pide.     ＂＂＂     if tensor.dtype == torch.uint8:         tensor = tensor.float()         tensor = tensor / 255.0     if type(mean) == list:         mean = torch.tensor(mean)     if type(std) == list:         std = torch.tensor(std)     if func is not None:         tensor = func(tensor)     tensor = tensor - mean     tensor = tensor / std     return tensor
　　3.构建 slow  以及 fast   输入数据
　　主要思路为从64帧图像数据中选取32帧作为 fast  的输入，再从 fast  中选取8帧作为 slow  的输入，并将  T H W C -> C T H W  .因此最后 fast_pathway  维度为 (b,3,32,h,w)    slow_pathway  的维度为 (b,3,8,h,w)  def process_cv2_inputs(frames):     ＂＂＂     Normalize and prepare inputs as a list of tensors. Each tensor     correspond to a unique pathway.     Args:         frames (list of array): list of input images (correspond to one clip) in range [0, 255].         cfg (CfgNode): configs. Details can be found in             slowfast/config/defaults.py     ＂＂＂     inputs = torch.from_numpy(np.array(frames)).float() / 255     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])     # T H W C -> C T H W.     inputs = inputs.permute(3, 0, 1, 2)     # Sample frames for num_frames specified.     index = torch.linspace(0, inputs.shape[1] - 1, 32).long()     print(index)     inputs = torch.index_select(inputs, 1, index)     fast_pathway = inputs     slow_pathway = torch.index_select(             inputs,             1,             torch.linspace(                 0, inputs.shape[1] - 1, inputs.shape[1] // 4             ).long(),         )     frame_list = [slow_pathway, fast_pathway]     print(np.shape(frame_list[0]))     inputs = [inp.unsqueeze(0) for inp in frame_list]     return inputs 5.slowfast onnx推理5.1 导出onnx文件import os import sys from collections import OrderedDict import torch import argparse work_root = os.path.split(os.path.realpath(__file__))[0] from slowfast.config.defaults import get_cfg import slowfast.utils.checkpoint as cu from slowfast.models import build_model   def parser_args():     parser = argparse.ArgumentParser()     parser.add_argument(         ＂--cfg＂,         dest=＂cfg_file＂,         type=str,         default=os.path.join(             work_root, ＂/content/drive/MyDrive/SlowFast/demo/AVA/SLOWFAST_32x2_R101_50_50.yaml＂),         help=＂Path to the config file＂,     )     parser.add_argument(         ＂--half＂,         type=bool,         default=False,         help=＂use half mode＂,     )     parser.add_argument(         ＂--checkpoint＂,         type=str,         default=os.path.join(work_root,                              ＂/content/SLOWFAST_32x2_R101_50_50.pkl＂),         help=＂test model file path＂,     )     parser.add_argument(         ＂--save＂,         type=str,         default=os.path.join(work_root, ＂/content/SLOWFAST_head.onnx＂),         help=＂save model file path＂,     )     return parser.parse_args()   def main():     args = parser_args()     print(args)     cfg_file = args.cfg_file     checkpoint_file = args.checkpoint     save_checkpoint_file = args.save     half_flag = args.half     cfg = get_cfg()     cfg.merge_from_file(cfg_file)     cfg.TEST.CHECKPOINT_FILE_PATH = checkpoint_file     print(cfg.DATA)     print(＂export pytorch model to onnx! ＂)     device = ＂cuda:0＂     with torch.no_grad():         model = build_model(cfg)         model = model.to(device)         model.eval()         cu.load_test_checkpoint(cfg, model)         if half_flag:             model.half()         fast_pathway= torch.randn(1, 3, 32, 256, 455)         slow_pathway= torch.randn(1, 3, 8, 256, 455)         bbox=torch.randn(32,5).to(device)         fast_pathway = fast_pathway.to(device)         slow_pathway = slow_pathway.to(device)         inputs = [slow_pathway, fast_pathway]         for p in model.parameters():          p.requires_grad = False         torch.onnx.export(model, (inputs,bbox), save_checkpoint_file, input_names=[＂slow_pathway＂,＂fast_pathway＂,＂bbox＂],output_names=[＂output＂], opset_version=12)         onnx_check()   def onnx_check():     import onnx     args = parser_args()     print(args)     onnx_model_path = args.save     model = onnx.load(onnx_model_path)     onnx.checker.check_model(model)   if __name__ == ＂__main__＂:     main() 5.2onnx推理import torch import math import onnxruntime from torchvision.ops import roi_align import argparse import os import platform import shutil import time from pathlib import Path import sys import json sys.path.insert(1, ＂/content/drive/MyDrive/yolov5/＂) import cv2 import torch import torch.backends.cudnn as cudnn import numpy as np import argparse import time import cv2 import torch import torch.backends.cudnn as cudnn from numpy import random from models.common import DetectMultiBackend from utils.augmentations import letterbox from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging from utils.torch_utils import select_device # ####### 参数设置 conf_thres = 0.6 iou_thres = 0.5 ####### imgsz = 640 weights = ＂/content/yolov5l.pt＂ device = ＂0＂ stride = 32 names = [＂person＂] import os def init():     # Initialize     global imgsz, device, stride     set_logging()     device = select_device(＂0＂)     half = device.type != ＂cpu＂  # half precision only supported on CUDA     model = DetectMultiBackend(weights, device=device, dnn=False)     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine     imgsz = check_img_size(imgsz, s=stride)  # check img_size     model.half()  # to FP16     model.eval()     return model  def process_image(model, input_image=None, args=None, **kwargs):     img0 = input_image     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]     img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB     img = np.ascontiguousarray(img)      img = torch.from_numpy(img).to(device)     img = img.half()     img /= 255.0  # 0 - 255 to 0.0 - 1.0     if len(img.shape) == 3:         img = img[None]     pred = model(img, augment=False, val=True)[0]     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)     result=[]     for i, det in enumerate(pred):  # detections per image         gn = torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh         if det is not None and len(det):             # Rescale boxes from img_size to im0 size             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()             for *xyxy, conf, cls in det:                 if cls==0:                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])     if len(result)==0:       return None     for i in range(32-len(result)):       result.append([float(0),float(0),float(0),float(0)])     return torch.from_numpy(np.array(result)) def scale(size, image):     ＂＂＂     Scale the short side of the image to size.     Args:         size (int): size to scale the image.         image (array): image to perform short side scale. Dimension is             `height` x `width` x `channel`.     Returns:         (ndarray): the scaled image with dimension of             `height` x `width` x `channel`.     ＂＂＂     height = image.shape[0]     width = image.shape[1]     # print(height,width)     if (width <= height and width == size) or (         height <= width and height == size     ):         return image     new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))     else:         new_width = int(math.floor((float(width) / height) * size))     img = cv2.resize(         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR     )     # print(new_width, new_height)     return img.astype(np.float32) def tensor_normalize(tensor, mean, std, func=None):     ＂＂＂     Normalize a given tensor by subtracting the mean and piding the std.     Args:         tensor (tensor): tensor to normalize.         mean (tensor or list): mean value to subtract.         std (tensor or list): std to pide.     ＂＂＂     if tensor.dtype == torch.uint8:         tensor = tensor.float()         tensor = tensor / 255.0     if type(mean) == list:         mean = torch.tensor(mean)     if type(std) == list:         std = torch.tensor(std)     if func is not None:         tensor = func(tensor)     tensor = tensor - mean     tensor = tensor / std     return tensor def scale_boxes(size, boxes, height, width):     ＂＂＂     Scale the short side of the box to size.     Args:         size (int): size to scale the image.         boxes (ndarray): bounding boxes to peform scale. The dimension is         `num boxes` x 4.         height (int): the height of the image.         width (int): the width of the image.     Returns:         boxes (ndarray): scaled bounding boxes.     ＂＂＂     if (width <= height and width == size) or (         height <= width and height == size     ):         return boxes      new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))         boxes *= float(new_height) / height     else:         new_width = int(math.floor((float(width) / height) * size))         boxes *= float(new_width) / width     return boxes def process_cv2_inputs(frames):     ＂＂＂     Normalize and prepare inputs as a list of tensors. Each tensor     correspond to a unique pathway.     Args:         frames (list of array): list of input images (correspond to one clip) in range [0, 255].         cfg (CfgNode): configs. Details can be found in             slowfast/config/defaults.py     ＂＂＂     inputs = torch.from_numpy(np.array(frames)).float() / 255     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])     # T H W C -> C T H W.     inputs = inputs.permute(3, 0, 1, 2)     # Sample frames for num_frames specified.     index = torch.linspace(0, inputs.shape[1] - 1, 32).long()     print(index)     inputs = torch.index_select(inputs, 1, index)     fast_pathway = inputs     slow_pathway = torch.index_select(             inputs,             1,             torch.linspace(                 0, inputs.shape[1] - 1, inputs.shape[1] // 4             ).long(),         )     frame_list = [slow_pathway, fast_pathway]     print(np.shape(frame_list[0]))     inputs = [inp.unsqueeze(0) for inp in frame_list]     return inputs #加载模型 yolov5=init() slowfast = onnxruntime.InferenceSession(＂/content/SLOWFAST_32x2_R101_50_50.onnx＂) #加载数据开始推理 cap = cv2.VideoCapture(＂/content/atm_125.mp4＂) was_read=True while was_read:     frames=[]     seq_length=64     while was_read and len(frames) < seq_length:         was_read, frame =cap.read()         frames.append(frame)          bboxes = process_image(yolov5,frames[64//2])     if bboxes is not None:       frames = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]       frames = [scale(256, frame) for frame in frames]       inputs = process_cv2_inputs(frames)       if bboxes is not None:           bboxes = scale_boxes(256,bboxes,1080,1920)           index_pad = torch.full(               size=(bboxes.shape[0], 1),               fill_value=float(0),               device=bboxes.device,           )           # Pad frame index for each box.           bboxes = torch.cat([index_pad, bboxes], axis=1)       for i in range(len(inputs)):         inputs[i] = inputs[i].numpy()       if bboxes is not None:           outputs = slowfast.run(None, {＂slow_pathway＂: inputs[0],＂fast_pathway＂:inputs[1],＂bbox＂:bboxes})           for i in range(80):             if outputs[0][0][i]>0.3:               print(i)           print(np.shape(prd))     else:         print(＂没有检测到任何人物＂) 6slowfastpythonTensorrt推理6.1 导出Tensorrt
　　接下来，为本文的创新点
　　一开始，本文尝试使用直接将 onnx  导出为 Tensorrt  ，导出失败，查找原因是因为 roi_align  在 Tensorrt  中还未实现（ roi_align   将在下个版本的 Tensorrt  中实现）。
　　查看导出的 onnx  图，会发现 roi_align  只在 head  部分用到。
　　于是我们提出以下思路，如下图所示，将 roi_ailgn  模块单独划分出来，不经过 Tensorrt  加速，将 slowfast  分成为两个网络，其中主体网络用于提取特征， head  网络部分负责进行动作分类.。
　　6.2Tensorrt推理代码import ctypes import os import numpy as np import cv2 import random import tensorrt as trt import pycuda.autoinit import pycuda.driver as cuda import threading import time   class TrtInference():     _batch_size = 1     def __init__(self, model_path=None, cuda_ctx=None):         self._model_path = model_path         if self._model_path is None:             print(＂please set trt model path!＂)             exit()         self.cuda_ctx = cuda_ctx         if self.cuda_ctx is None:             self.cuda_ctx = cuda.Device(0).make_context()         if self.cuda_ctx:             self.cuda_ctx.push()         self.trt_logger = trt.Logger(trt.Logger.INFO)         self._load_plugins()         self.engine = self._load_engine()         try:             self.context = self.engine.create_execution_context()             self.stream = cuda.Stream()             for index, binding in enumerate(self.engine):                 if self.engine.binding_is_input(binding):                     batch_shape = list(self.engine.get_binding_shape(binding)).copy()                     batch_shape[0] = self._batch_size                     self.context.set_binding_shape(index, batch_shape)             self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings = self._allocate_buffers()         except Exception as e:             raise RuntimeError(＂fail to allocate CUDA resources＂) from e         finally:             if self.cuda_ctx:                 self.cuda_ctx.pop()      def _load_plugins(self):         pass      def _load_engine(self):         with open(self._model_path, ＂rb＂) as f, trt.Runtime(self.trt_logger) as runtime:             return runtime.deserialize_cuda_engine(f.read())      def _allocate_buffers(self):         host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings =              [], [], [], [], []         for index, binding in enumerate(self.engine):             size = trt.volume(self.context.get_binding_shape(index)) *                     self.engine.max_batch_size             host_mem = cuda.pagelocked_empty(size, np.float32)             cuda_mem = cuda.mem_alloc(host_mem.nbytes)             bindings.append(int(cuda_mem))             if self.engine.binding_is_input(binding):                 host_inputs.append(host_mem)                 cuda_inputs.append(cuda_mem)             else:                 host_outputs.append(host_mem)                 cuda_outputs.append(cuda_mem)         return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings      def destroy(self):         ＂＂＂Free CUDA memories and context.＂＂＂         del self.cuda_outputs         del self.cuda_inputs         del self.stream         if self.cuda_ctx:             self.cuda_ctx.pop()             del self.cuda_ctx      def inference(self, inputs):         np.copyto(self.host_inputs[0], inputs[0].ravel())         np.copyto(self.host_inputs[1], inputs[1].ravel())         if self.cuda_ctx:             self.cuda_ctx.push()         cuda.memcpy_htod_async(             self.cuda_inputs[0], self.host_inputs[0], self.stream)         cuda.memcpy_htod_async(             self.cuda_inputs[1], self.host_inputs[1], self.stream)         self.context.execute_async(             batch_size=1,             bindings=self.bindings,             stream_handle=self.stream.handle)         cuda.memcpy_dtoh_async(             self.host_outputs[0], self.cuda_outputs[0], self.stream)         cuda.memcpy_dtoh_async(             self.host_outputs[1], self.cuda_outputs[1], self.stream)         self.stream.synchronize()         if self.cuda_ctx:             self.cuda_ctx.pop()         output = [self.host_outputs[0],self.host_outputs[1]]         return output   class TrtInference_head():     _batch_size = 1     def __init__(self, model_path=None, cuda_ctx=None):         self._model_path = model_path         if self._model_path is None:             print(＂please set trt model path!＂)             exit()         self.cuda_ctx = cuda_ctx         if self.cuda_ctx is None:             self.cuda_ctx = cuda.Device(0).make_context()         if self.cuda_ctx:             self.cuda_ctx.push()         self.trt_logger = trt.Logger(trt.Logger.INFO)         self._load_plugins()         self.engine = self._load_engine()         try:             self.context = self.engine.create_execution_context()             self.stream = cuda.Stream()             for index, binding in enumerate(self.engine):                 if self.engine.binding_is_input(binding):                     batch_shape = list(self.engine.get_binding_shape(binding)).copy()                     batch_shape[0] = self._batch_size                     self.context.set_binding_shape(index, batch_shape)             self.host_inputs, self.host_outputs, self.cuda_inputs, self.cuda_outputs, self.bindings = self._allocate_buffers()         except Exception as e:             raise RuntimeError(＂fail to allocate CUDA resources＂) from e         finally:             if self.cuda_ctx:                 self.cuda_ctx.pop()      def _load_plugins(self):         pass      def _load_engine(self):         with open(self._model_path, ＂rb＂) as f, trt.Runtime(self.trt_logger) as runtime:             return runtime.deserialize_cuda_engine(f.read())      def _allocate_buffers(self):         host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings =              [], [], [], [], []         for index, binding in enumerate(self.engine):             size = trt.volume(self.context.get_binding_shape(index)) *                     self.engine.max_batch_size             host_mem = cuda.pagelocked_empty(size, np.float32)             cuda_mem = cuda.mem_alloc(host_mem.nbytes)             bindings.append(int(cuda_mem))             if self.engine.binding_is_input(binding):                 host_inputs.append(host_mem)                 cuda_inputs.append(cuda_mem)             else:                 host_outputs.append(host_mem)                 cuda_outputs.append(cuda_mem)         return host_inputs, host_outputs, cuda_inputs, cuda_outputs, bindings      def destroy(self):         ＂＂＂Free CUDA memories and context.＂＂＂         del self.cuda_outputs         del self.cuda_inputs         del self.stream         if self.cuda_ctx:             self.cuda_ctx.pop()             del self.cuda_ctx      def inference(self, inputs):         np.copyto(self.host_inputs[0], inputs[0].ravel())         np.copyto(self.host_inputs[1], inputs[1].ravel())         if self.cuda_ctx:             self.cuda_ctx.push()         cuda.memcpy_htod_async(             self.cuda_inputs[0], self.host_inputs[0], self.stream)         cuda.memcpy_htod_async(             self.cuda_inputs[1], self.host_inputs[1], self.stream)         self.context.execute_async(             batch_size=1,             bindings=self.bindings,             stream_handle=self.stream.handle)         cuda.memcpy_dtoh_async(             self.host_outputs[0], self.cuda_outputs[0], self.stream)         self.stream.synchronize()         if self.cuda_ctx:             self.cuda_ctx.pop()         output = self.host_outputs[0]         return output  import torch import math from torchvision.ops import roi_align import argparse import os import platform import shutil import time from pathlib import Path import sys import json sys.path.insert(1, ＂/content/drive/MyDrive/yolov5/＂) import cv2 import torch import torch.backends.cudnn as cudnn import numpy as np import argparse import time import cv2 import torch import torch.backends.cudnn as cudnn from numpy import random from models.common import DetectMultiBackend from utils.augmentations import letterbox from utils.general import check_img_size, non_max_suppression, scale_coords, set_logging from utils.torch_utils import select_device # ####### 参数设置 conf_thres = 0.89 iou_thres = 0.5 ####### imgsz = 640 weights = ＂/content/yolov5l.pt＂ device = ＂0＂ stride = 32 names = [＂person＂] import os def init():     # Initialize     global imgsz, device, stride     set_logging()     device = select_device(＂0＂)     half = device.type != ＂cpu＂  # half precision only supported on CUDA     model = DetectMultiBackend(weights, device=device, dnn=False)     stride, pt, jit, engine = model.stride, model.pt, model.jit, model.engine     imgsz = check_img_size(imgsz, s=stride)  # check img_size     model.half()  # to FP16     model.eval()     return model  def process_image(model, input_image=None, args=None, **kwargs):     img0 = input_image     img = letterbox(img0, new_shape=imgsz, stride=stride, auto=True)[0]     img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB     img = np.ascontiguousarray(img)      img = torch.from_numpy(img).to(device)     img = img.half()     img /= 255.0  # 0 - 255 to 0.0 - 1.0     if len(img.shape) == 3:         img = img[None]     pred = model(img, augment=False, val=True)[0]     pred = non_max_suppression(pred, conf_thres, iou_thres, agnostic=False)     result=[]     for i, det in enumerate(pred):  # detections per image         gn = torch.tensor(img0.shape)[[1, 0, 1, 0]]  # normalization gain whwh         if det is not None and len(det):             # Rescale boxes from img_size to im0 size             det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img0.shape).round()             for *xyxy, conf, cls in det:                 if cls==0:                     result.append([float(xyxy[0]),float(xyxy[1]),float(xyxy[2]),float(xyxy[3])])     if len(result)==0:       return None     for i in range(32-len(result)):       result.append([float(0),float(0),float(0),float(0)])     return torch.from_numpy(np.array(result)) def scale(size, image):     ＂＂＂     Scale the short side of the image to size.     Args:         size (int): size to scale the image.         image (array): image to perform short side scale. Dimension is             `height` x `width` x `channel`.     Returns:         (ndarray): the scaled image with dimension of             `height` x `width` x `channel`.     ＂＂＂     height = image.shape[0]     width = image.shape[1]     # print(height,width)     if (width <= height and width == size) or (         height <= width and height == size     ):         return image     new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))     else:         new_width = int(math.floor((float(width) / height) * size))     img = cv2.resize(         image, (new_width, new_height), interpolation=cv2.INTER_LINEAR     )     # print(new_width, new_height)     return img.astype(np.float32) def tensor_normalize(tensor, mean, std, func=None):     ＂＂＂     Normalize a given tensor by subtracting the mean and piding the std.     Args:         tensor (tensor): tensor to normalize.         mean (tensor or list): mean value to subtract.         std (tensor or list): std to pide.     ＂＂＂     if tensor.dtype == torch.uint8:         tensor = tensor.float()         tensor = tensor / 255.0     if type(mean) == list:         mean = torch.tensor(mean)     if type(std) == list:         std = torch.tensor(std)     if func is not None:         tensor = func(tensor)     tensor = tensor - mean     tensor = tensor / std     return tensor def scale_boxes(size, boxes, height, width):     ＂＂＂     Scale the short side of the box to size.     Args:         size (int): size to scale the image.         boxes (ndarray): bounding boxes to peform scale. The dimension is         `num boxes` x 4.         height (int): the height of the image.         width (int): the width of the image.     Returns:         boxes (ndarray): scaled bounding boxes.     ＂＂＂     if (width <= height and width == size) or (         height <= width and height == size     ):         return boxes      new_width = size     new_height = size     if width < height:         new_height = int(math.floor((float(height) / width) * size))         boxes *= float(new_height) / height     else:         new_width = int(math.floor((float(width) / height) * size))         boxes *= float(new_width) / width     return boxes def process_cv2_inputs(frames):     ＂＂＂     Normalize and prepare inputs as a list of tensors. Each tensor     correspond to a unique pathway.     Args:         frames (list of array): list of input images (correspond to one clip) in range [0, 255].         cfg (CfgNode): configs. Details can be found in             slowfast/config/defaults.py     ＂＂＂     inputs = torch.from_numpy(np.array(frames)).float() / 255     inputs = tensor_normalize(inputs, [0.45,0.45,0.45], [0.225,0.225,0.225])     # T H W C -> C T H W.     inputs = inputs.permute(3, 0, 1, 2)     # Sample frames for num_frames specified.     index = torch.linspace(0, inputs.shape[1] - 1, 32).long()     print(index)     inputs = torch.index_select(inputs, 1, index)     fast_pathway = inputs     slow_pathway = torch.index_select(             inputs,             1,             torch.linspace(                 0, inputs.shape[1] - 1, inputs.shape[1] // 4             ).long(),         )     frame_list = [slow_pathway, fast_pathway]     print(np.shape(frame_list[0]))     inputs = [inp.unsqueeze(0) for inp in frame_list]     return inputs #加载模型 yolov5=init() slowfast = TrtInference(＂/content/SLOWFAST_32x2_R101_50_50.engine＂,None) head = TrtInference_head(＂/content/SLOWFAST_head.engine＂,None)  #加载数据开始推理 cap = cv2.VideoCapture(＂/content/atm_125.mp4＂) was_read=True while was_read:     frames=[]     seq_length=64     while was_read and len(frames) < seq_length:         was_read, frame =cap.read()         frames.append(frame)          bboxes = process_image(yolov5,frames[64//2])     if bboxes is not None:       frames = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]       frames = [scale(256, frame) for frame in frames]       inputs = process_cv2_inputs(frames)       print(bboxes)       if bboxes is not None:           bboxes = scale_boxes(256,bboxes,1080,1920)           index_pad = torch.full(               size=(bboxes.shape[0], 1),               fill_value=float(0),               device=bboxes.device,           )           # Pad frame index for each box.           bboxes = torch.cat([index_pad, bboxes], axis=1)       for i in range(len(inputs)):         inputs[i] = inputs[i].numpy()       if bboxes is not None:           outputs=slowfast.inference(inputs)           outputs[0]=outputs[0].reshape(1,2048,16,29)           outputs[1]=outputs[1].reshape(1,256,16,29)           outputs[0]=torch.from_numpy(outputs[0])           outputs[1]=torch.from_numpy(outputs[1])           outputs[0]=roi_align(outputs[0],bboxes.to(dtype=outputs[0].dtype),7,1.0/16,0,True)           outputs[1]=roi_align(outputs[1],bboxes.to(dtype=outputs[1].dtype),7,1.0/16,0,True)           outputs[0] = outputs[0].numpy()           outputs[1] = outputs[1].numpy()           prd=head.inference(outputs)           prd=prd.reshape(32,80)           for i in range(80):             if prd[0][i]>0.3:               print(i)     else:         print(＂没有检测到任何人物＂)
　　通过阅读上述的代码
　　slow_pathway   与 fast_pathway   经过 slowfast  主体模型，通过 reshape  成 roi_align   需要的维度，将 reshape  后的结果， bbox  以及相应的参数带入到 roi_align  中得到 head  模型需要的输入。 7.slowfastC++tensorrt部署7.1yolov5C++目标检测
　　yolov5   本文就不介绍了，我直接使用平台自带的 yolov5    tensorrt   代码 https://github.com/ExtremeMart/ev_sdk_demo4.0_pedestrian_intrusion_yolov5 7.2deepsortC++目标追踪
　　本文参考以下的 deepsort  代码 https://github.com/RichardoMrMu/deepsort-tensorrt
　　由于这部分不是本文的重点，只需要知道怎么使用这部分的代码，写好CmakeLists文件，在代码中可以按照以下的方式使用 deepsort  #include ＂deepsort.h＂  /**  DeepSortBox 为yolov5识别的结果  DeepSortBox 结构  {   x1,   y1,   x2,   y2,   score,   label,   trackID  }  img 为原始的图片  最终结果存放在DeepSortBox中 */ DS->sort(img, DeepSortBox);  7.3slowfastC++目标动作识别
　　运行环境：
　　Tensorrt8.4
　　opencv4.1.1
　　cudnn8.0
　　cuda11.1
　　文件准备：
　　body.onnx
　　head.onnx
　　slowfast推理流程图
　　我们还是按照预测的流程图来实现 Tensorrt  推理代码
　　通过 onnx  可视化查看 body.onnx  输入以及输出
　　head.onnx  的输入以及输出
　　Step1：模型加载
　　将 body.onnx  以及 head.onnx   通过 Tensorrt  加载，并且开辟 Tensorrt  推理运行空间，代码如下 void loadheadOnnx(const std::string strModelName) {     Logger gLogger;     //根据tensorrt pipeline 构建网络     IBuilder* builder = createInferBuilder(gLogger);     builder->setMaxBatchSize(1);     const auto explicitBatch = 1U << static_cast(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);       INetworkDefinition* network = builder->createNetworkV2(explicitBatch);     nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, gLogger);     parser->parseFromFile(strModelName.c_str(), static_cast(ILogger::Severity::kWARNING));     IBuilderConfig* config = builder->createBuilderConfig();     config->setMaxWorkspaceSize(1ULL << 30);         m_CudaheadEngine = builder->buildEngineWithConfig(*network, *config);          std::string strTrtName = strModelName;     size_t sep_pos = strTrtName.find_last_of(＂.＂);     strTrtName = strTrtName.substr(0, sep_pos) + ＂.trt＂;     IHostMemory *gieModelStream = m_CudaheadEngine->serialize();     std::string serialize_str;     std::ofstream serialize_output_stream;     serialize_str.resize(gieModelStream->size());        memcpy((void*)serialize_str.data(),gieModelStream->data(),gieModelStream->size());     serialize_output_stream.open(strTrtName.c_str());     serialize_output_stream<createExecutionContext();     parser->destroy();     network->destroy();     config->destroy();     builder->destroy(); }
　　Step2: 为输入输出数据开辟空间
　　body.onnx   输入为 slow_pathway  和 fast_pathway  的维度为 (B,C,T,H,W)  ，其中 slow_pathway  的T为8，输出为 (B,2048,16,29)  ， fast_pathway  的维度为32，输出为 (B,256,16,29)``,head  的输入(32,2048,7,7)与(32,256,7,7)，输出为(32,80),具体代码实现如下：  slow_pathway_InputIndex = m_CudaslowfastEngine->getBindingIndex(slow_pathway_NAME);     fast_pathway_InputIndex = m_CudaslowfastEngine->getBindingIndex(fast_pathway_NAME);     slow_pathway_OutputIndex = m_CudaslowfastEngine->getBindingIndex(slow_pathway_OUTPUT);     fast_pathway_OutputIndex = m_CudaslowfastEngine->getBindingIndex(fast_pathway_OUTPUT);      dims_i = m_CudaslowfastEngine->getBindingDimensions(slow_pathway_InputIndex);     SDKLOG(INFO)<getBindingDimensions(fast_pathway_InputIndex);     SDKLOG(INFO) << ＂fast_pathway dims ＂ << dims_i.d[0] << ＂ ＂ << dims_i.d[1] << ＂ ＂ << dims_i.d[2] << ＂ ＂ << dims_i.d[3]<< ＂ ＂ << dims_i.d[4];     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3]* dims_i.d[4];     cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_InputIndex], size * sizeof(float));     slowfast_ArrayHostMemory[fast_pathway_InputIndex] = malloc(size * sizeof(float));     slowfast_ArraySize[fast_pathway_InputIndex]=size* sizeof(float);               dims_i = m_CudaslowfastEngine->getBindingDimensions(slow_pathway_OutputIndex);     SDKLOG(INFO) << ＂slow_out dims ＂ << dims_i.d[0] << ＂ ＂ << dims_i.d[1] << ＂ ＂ << dims_i.d[2] << ＂ ＂ << dims_i.d[3];     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];     cudaMalloc(&slowfast_ArrayDevMemory[slow_pathway_OutputIndex], size * sizeof(float));     slowfast_ArrayHostMemory[slow_pathway_OutputIndex] = malloc(size * sizeof(float));     slowfast_ArraySize[slow_pathway_OutputIndex]=size* sizeof(float);                    dims_i = m_CudaslowfastEngine->getBindingDimensions(fast_pathway_OutputIndex);     SDKLOG(INFO) << ＂fast_out dims ＂ << dims_i.d[0] << ＂ ＂ << dims_i.d[1] << ＂ ＂ << dims_i.d[2] << ＂ ＂ << dims_i.d[3];     size = dims_i.d[0] * dims_i.d[1] * dims_i.d[2] * dims_i.d[3];     cudaMalloc(&slowfast_ArrayDevMemory[fast_pathway_OutputIndex], size * sizeof(float));     slowfast_ArrayHostMemory[fast_pathway_OutputIndex] = malloc(size * sizeof(float));     slowfast_ArraySize[fast_pathway_OutputIndex]=size* sizeof(float);                    size=32*2048*7*7;     cudaMalloc(&ROIAlign_ArrayDevMemory[0], size * sizeof(float));     ROIAlign_ArrayHostMemory[0] = malloc(size * sizeof(float));     ROIAlign_ArraySize[0]=size* sizeof(float);          size=32*256*7*7;     cudaMalloc(&ROIAlign_ArrayDevMemory[1], size * sizeof(float));     ROIAlign_ArrayHostMemory[1] = malloc(size * sizeof(float));     ROIAlign_ArraySize[1]=size* sizeof(float);               size=32*80;     cudaMalloc(&ROIAlign_ArrayDevMemory[2], size * sizeof(float));     ROIAlign_ArrayHostMemory[2] = malloc(size * sizeof(float));     ROIAlign_ArraySize[2]=size* sizeof(float);     size=32*5;     boxes_data= malloc(size * sizeof(float));     dims_i = m_CudaheadEngine->getBindingDimensions(0);
　　Step3：输入数据预处理
　　首先由于我导出 onnx  文件没有使用动态尺寸，导致input 图片大小已经确定了， size=256*455  (这个结果是1080*1920等比例放缩)， slowfast  模型要求为 RGB  ，需要将图片从 BGR  转换为 RGB  ，之后进行 resize  到256*455，具体代码实现如下   cv::Mat framesimg = img.clone();         cv::cvtColor(framesimg, framesimg, cv::COLOR_BGR2RGB);         int height = framesimg.rows;         int width = framesimg.cols;         // 对图像进行预处理         //cv2.COLOR_BGR2RGB         int size=256;         int new_width = width;         int new_height = height;         if ((width <= height && width == size) || (height <= width and height == size)){                      }         else{             new_width = size;             new_height = size;             if(width(h, w)[c]) / 255.0f;                     v -= 0.45;                     v /= 0.225;                     data[c*32*256*455+fast_index* new_width * new_height + h * new_width + w] =v;                 }             }         }         fast_index++;         if(frames==0||frames==8||frames==16||frames==26||frames==34||frames==44||frames==52||frames==63){             data=(float *)slowfast_ArrayHostMemory[slow_pathway_InputIndex];             for (size_t c = 0; c < 3; c++)             {                 for (size_t  h = 0; h < new_height; h++)                 {                     for (size_t w = 0; w < new_width; w++)                     {                        float v=((float)framesimg.at(h, w)[c]) / 255.0f;                         v -= 0.45;                         v /= 0.225;                         data[c*8*256*455+slow_index* new_width * new_height + h * new_width + w] =v;                     }                 }             }               slow_index++;         }
　　Step4:  roi_align  实现
　　正如上一节所描述一样，roi_align在当前版本中的Tensorrt中并没有实现，而在torchvision.ops中实现了roi_align，python推理代码可以直接调用。而C++代码必须要实现roi_align，具体原理这里不讲解了，可以简单认为roi_align具体过程就是crop和resize的过程，从特征图中提取bbox对应的特征，将提取到的特征resize到7*7。具体代码实现如下 void ROIAlignForwardCpu(const float* bottom_data, const float spatial_scale, const int num_rois,                      const int height, const int width, const int channels,                      const int aligned_height, const int aligned_width, const float * bottom_rois,                      float* top_data) {     const int output_size = num_rois * aligned_height * aligned_width * channels;      int idx = 0;     for (idx = 0; idx < output_size; ++idx)     {         int pw = idx % aligned_width;         int ph = (idx / aligned_width) % aligned_height;         int c = (idx / aligned_width / aligned_height) % channels;         int n = idx / aligned_width / aligned_height / channels;            float roi_batch_ind = 0;          float roi_start_w = bottom_rois[n * 5 + 1] * spatial_scale;         float roi_start_h = bottom_rois[n * 5 + 2] * spatial_scale;         float roi_end_w = bottom_rois[n * 5 + 3] * spatial_scale;         float roi_end_h = bottom_rois[n * 5 + 4] * spatial_scale;          float roi_width = fmaxf(roi_end_w - roi_start_w + 1., 0.);         float roi_height = fmaxf(roi_end_h - roi_start_h + 1., 0.);         float bin_size_h = roi_height / (aligned_height - 1.);         float bin_size_w = roi_width / (aligned_width - 1.);          float h = (float)(ph) * bin_size_h + roi_start_h;         float w = (float)(pw) * bin_size_w + roi_start_w;          int hstart = fminf(floor(h), height - 2);          int wstart = fminf(floor(w), width - 2);          int img_start = roi_batch_ind * channels * height * width;          if (h < 0 || h >= height || w < 0 || w >= width)           {             top_data[idx] = 0.;          }         else         {             float h_ratio = h - (float)(hstart);              float w_ratio = w - (float)(wstart);             int upleft = img_start + (c * height + hstart) * width + wstart;                          int upright = upleft + 1;             int downleft = upleft + width;              int downright = downleft + 1;               top_data[idx] = bottom_data[upleft] * (1. - h_ratio) * (1. - w_ratio)                 + bottom_data[upright] * (1. - h_ratio) * w_ratio                 + bottom_data[downleft] * h_ratio * (1. - w_ratio)                 + bottom_data[downright] * h_ratio * w_ratio;           }     } }
　　Step5：推理
　　首先将  Step3  中准备好的数据使用 body  进行推理，将推理结果使用 Step4  中的 roi_align  函数进行提取 bbox  对应的特征，最后将提取的特征使用 head  模型进行推理，得到 output  。具体代码实现如下 cudaMemcpyAsync(slowfast_ArrayDevMemory[slow_pathway_InputIndex], slowfast_ArrayHostMemory[slow_pathway_InputIndex], slowfast_ArraySize[slow_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);     cudaMemcpyAsync(slowfast_ArrayDevMemory[fast_pathway_InputIndex], slowfast_ArrayHostMemory[fast_pathway_InputIndex], slowfast_ArraySize[fast_pathway_InputIndex], cudaMemcpyHostToDevice, m_CudaStream);     m_CudaslowfastContext->enqueueV2(slowfast_ArrayDevMemory , m_CudaStream, nullptr);        cudaMemcpyAsync(slowfast_ArrayHostMemory[slow_pathway_OutputIndex], slowfast_ArrayDevMemory[slow_pathway_OutputIndex], slowfast_ArraySize[slow_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);     cudaMemcpyAsync(slowfast_ArrayHostMemory[fast_pathway_OutputIndex], slowfast_ArrayDevMemory[fast_pathway_OutputIndex], slowfast_ArraySize[fast_pathway_OutputIndex], cudaMemcpyDeviceToHost, m_CudaStream);     cudaStreamSynchronize(m_CudaStream);       data=(float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex];     ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[slow_pathway_OutputIndex], 0.0625, 32,16,29, 2048,7, 7, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[0]);     ROIAlignForwardCpu((float*)slowfast_ArrayHostMemory[fast_pathway_OutputIndex], 0.0625, 32,16,29, 256,7, 7, (float*)boxes_data,       (float*)ROIAlign_ArrayHostMemory[1]);     data=(float*)ROIAlign_ArrayHostMemory[0];     cudaMemcpyAsync(ROIAlign_ArrayDevMemory[0], ROIAlign_ArrayHostMemory[0], ROIAlign_ArraySize[0], cudaMemcpyHostToDevice, m_CudaStream);     cudaMemcpyAsync(ROIAlign_ArrayDevMemory[1], ROIAlign_ArrayHostMemory[1], ROIAlign_ArraySize[1], cudaMemcpyHostToDevice, m_CudaStream);     m_CudaheadContext->enqueueV2(ROIAlign_ArrayDevMemory, m_CudaStream, nullptr);      cudaMemcpyAsync(ROIAlign_ArrayHostMemory[2], ROIAlign_ArrayDevMemory[2], ROIAlign_ArraySize[2], cudaMemcpyDeviceToHost, m_CudaStream);     cudaStreamSynchronize(m_CudaStream);  参考链接1. https://blog.csdn.net/y459541195/article/details/126278476 2. https://blog.csdn.net/WhiffeYF/article/details/115581800 3. https://github.com/facebookresearch/SlowFast

苹果的大招还没放完文祝彰编辑嘉辛出品数智界苹果公司是如今最能在产品上带给用户惊喜的科技公司，这是它过去多年积累下的牢不可破的优势。这种优势最赤裸裸的一个体现是，新品发布会结束后，在社交媒体的评论中，全家都能用的千元级词典笔好选择有道词典笔X5在前面的话大家吼，我系老爹，我们又见面了！不知不觉中，酷热的暑假即将过去了，又到开学季，孩子要进入毕业班了，课程也会越来越难，作为许久不碰课本的家长甲，辅导作业也是越来越力不从心，卫星通信阉割？苹果iPhone14，芯片也耍心机了配备新款芯片又如何？苹果iPhone14系列，真的是毫无新意了嗯，个别的忠实果粉儿用户别急眼，这话可不是什么外人说出来的。而是苹果公司主要创始人之一，史蒂夫乔布斯小女儿的吐槽！19抢到官网崩了！iPhone14系列预售太火爆，Pro发货已推迟至10月下旬点蓝字关注，不迷路iPhone新品又遭抢！9月9日晚8点整，iPhone14系列正式开启预售，有部分网友反馈苹果官网崩了，另外证券时报记者从苹果官网亲自体验了解到，iPhone14你的手机可能被监听，教你关闭这6个开关，避免隐私泄露你的手机可能被人监听，你知道吗？赶紧关闭手机里这六个危险的开关，避免隐私泄露！你是不是也是经常遇到这种情况？刚刚和好友聊过要买什么东西，打开手机，手机就收到了铺天盖地的这种东西，那读创公司调研丨拓尔思正共同研究探索人形机器人应用场景及相关技术拓尔思8月26日发布投资者关系活动记录表，公司于2022年8月25日接受多家机构单位调研，副总经理董事会秘书李党生等参与接待介绍，调研通过电话会议进行。拓尔思信息技术股份有限公司（长春汽开区奋力跑出百亿项目加速度奥迪一汽新能源汽车项目侧记2022年8月，长春汽开区。奥迪一汽新能源汽车有限公司PPE项目的施工现场，塔吊高耸，长臂挥舞，如火如荼的建设场面随处可见车辆穿梭，工人忙碌，节节攀升的建设速度催人奋进。要看银山拍一张地图这样诞生地图作为国际三大通用语言之一，是人们认识世界改造世界从事社会活动的重要工具，是社会文化现象的一部分。可以说，人们日常的出行工作旅游学习都离不开各式各样的地图产品。地图通过丰富的线条什么时候去门源看油菜花才是盛放期？祁连大草原怎么玩？大家好！我是陈探长！今天是我们大西北环线六天自驾游行程的第六天。今日我们穿越了扁都口大峡谷，这里有蜚声中外的远东第一大牧场山丹军马场。峡谷两侧随处可见满山的羊群！祁连山大草原，中国祁连山脚下绝美赏秋地！景色绚烂，鲜少人知的五彩世界虏获秋日盛宴有一些绝美的风光就隐匿在你不怎么留意的地方那里景色绚烂惊艳了整个秋天这里是传说中的秘境所知的人较少淡淡的名气却暗藏着难以想象的美在那儿摄影师舍不得眨眼画家停不下笔这个地国家公园，绽放在三江源头来源人民网人民日报海外版图为巡守在黄河源头的生态管护员。图为生态秀美的澜沧江大峡谷。图为位于长江源头地区的斑德湖。党的十八大以来，青海生态文明建设如江河浩荡，奔涌向前，那勇立潮头的

<<<<<<－>>>>>>

发脾气的孩子有四大类，你家孩子属于哪一类？养娃最让人上火的是什么？熊孩子发脾气！明明刚才还好好的，突然就翻脸不认人，上一秒还晴空万里，下一秒就狂风暴雨。跟孩子讲道理他不听，哇哇地哭个不听，做妈的，心头那个火顿时蹭蹭往上转。想要好孕，这3个谣言不要信，费钱不说还影响身体健康2021年，我国不孕不育家庭达18。也就是说每7个家庭中，就有一家存在生育困境。为了生一个健康的孩子，很多人拼尽了全力，见过一对夫妻，为了生孩子不仅花光了家里所有的积蓄，还欠了不少年龄大了，应该怎样吃，才能保证身体健康随着生活水平的提高，全民养生意识日益增强，人们普遍关注一日三餐，吃什么怎么吃，越来越多的老年人以素食为主，岂不知，这样的饮食习惯给身体留下了隐患。我有一个同学是大学老师，特别注意养卡塔尔世界杯卡塔尔队主帅回应三连败出局从未将小组出线设为目标新华社多哈11月29日电（记者王浩宇岳东兴肖世尧）三战全败小组赛出局，东道主卡塔尔队29日以一份不太体面的成绩单告别本届世界杯，该队主教练桑切斯赛后表示这是一个正常结果。11月29腾讯叫停数字藏品业务，兜售行业红利期的日子过去了焦点分析文周鑫雨编辑苏建勋2022年11月16日，腾讯TME旗下QQ音乐叫停TME数字藏品业务，原团队部分成员进行了内部活水。这是继幻核和腾讯新闻后，腾讯系业务与数字藏品再次割席。今年7月明年起，山东调整城乡居民社保缴费政策，五类人不缴费可领养老金2022年11月15日，山东省人力资源和社会保障厅等四部门联合印发了关于完善居民基本养老保险政策有关问题的通知，根据通知，明年起，山东省城乡居民养老保险缴费标准将发生变化，鼓励长缴夏日重现漫改游戏小舟澪角色PV明年1月26日发售日本游戏厂商MAGES。今日公开了漫改ADV游戏夏日重现AnotherHorizon的角色小舟澪角色的介绍视频。本作将于2023年1月26日登陆PS4和Switch平台。普通版售价小雪过后，建议多吃六道肉类美食，强身健体精神好，健康度过寒冬大家好，这里是邱哥说美食，小雪时节过后，天气变得越来越冷了，这个时候，我们的日常饮食生活需要注意，少吃生冷刺激食物，比较烧烤火锅和炸鸡汉堡之类的食品等。多吃暖心暖胃的美食，比如说鸡智信精密自动屏蔽式选取同行关联方存业务竞争或瓜分重要客户作为苹果公司上游环节的深圳市智信精密仪器股份有限公司（以下简称智信精密），此次上市背后，或难掩其对苹果产业链依赖的风险，近三年其最终用于苹果产品为其创收的比例均超97。不仅如此，智五步法对APP产品深度分析APP不知道如何去分析他的好坏，又或者总是分析的很片面，有什么好方法能够帮助我们更全面更系统的分析一个APP产品的好坏呢？不知道你是经常会遇到这样的情况拿到一个APP产品不知道如何日本队创造奇迹又变成奇迹的背景日本0比1哥斯达黎加上一场，日本的21战胜了德国，让全世界都为之瞩目。创造了一个神话奇迹，自己享受神话主角般的待遇，各大媒体人纷纷预测日本将和西班牙一起出线，德国即将淘汰。而第二轮