Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

1Beijing Institute of Technology 2Northwestern Polytechnical University 3The Hong Kong Polytechnic University

Arena: our patch-of-interest ViT inference acceleration system for edge-assisted video analytics. Due to the limited computing power of the camera, the extracted patches-of-interest are offloaded to an edge server for processing with its more powerful GPUs. MTPs stands for Memory Token Pools.


Abstract

The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to 1.58\times1.58\times and 1.82\times1.82\times on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.

Method

Overview of proposed Arena. Given \(K\) continuous frames \(\{\hat{\mathbf{x}}^1, \mathbf{x}^2, \ldots, \mathbf{x}^K\}\) in an interval, Arena periodically operates in two distinct phases: keyframe inference (Left) for the first frame \(\hat{\mathbf{x}}^1\) and non-keyframe inference (Right) for the rest of the frames. Both two phases use the same network architecture with shared weights. Notably, we split the frame into nine patches only for demonstration..

Accuracy

The accuracy of different methods on two datasets. Arena can maintain accuracy losses within 1% and 4%.

Bindwidth Usage

The normalized bandwidth usage of different methods on two datasets.

End-to-end Latency

The average end-to-end latency per frame of different methods on two datasets. End-to-end latency includes a breakdown of preprocessing, transmission, and inference time.


Visualization

Visualization of Arena on two videos. In these two scenes, with a frame interval of 5, $m$ is set to 1 and 3 for MOT17 and AIC22, respectively, $p=0.9$, and $F=200$. Only the red patches in non-keyframe are used for transmission and inference.

Heatmaps of patches identified as PoIs, where darker areas indicate a higher frequency of offloading to the edge server.

BibTeX

@misc{peng2024arena,
      title={Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics}, 
      author={Haosong Peng and Wei Feng and Hao Li and Yufeng Zhan and Qihua Zhou and Yuanqing Xia},
      year={2024},
      eprint={2404.09245},
      archivePrefix={arXiv},
      primaryClass={cs.MM}
}