Chapter 5

Software

There are different software solutions for building computer-vision systems. The software for this work encompasses several interdependent parts: image processing and transmission, data processing, and neural networks.

For image processing, several modules are available (OpenCV [36 ], Scikit-Image [37 ], SciPy [38 ], Pillow [39 ]), of which OpenCV and Scikit-Image were built specifically for machine learning. To solve this task, the OpenCV module was used instead of Scikit-Image, because the former also supports reading video via Gstreamer [40 ], which allows use of the Nvidia video decoder [41 ] integrated into the Jetson. This saves the CPU when reading the image stream and does so significantly more efficiently, leaving more headroom on the system’s CPU for image processing and the computer-vision model’s work.

For data processing the Pandas [42 ] module is widely used, providing simple and intuitive commands to analyse and manipulate data. This module performs operations on the CPU, and for extreme data volumes data processing can become time-consuming. In addition to Pandas, several alternative modules have been built that are similarly or marginally better optimised, the closest in operating principle and writing style being cuDF [44 ]. cuDF is essentially most similar to Pandas, but uses the GPU instead of the CPU for data processing, which – especially when handling large data volumes – gives a significant time advantage over Pandas, because processing takes place with much greater parallelism than on the CPU. At the time of writing, the Pandas module is in use, because the thesis author has broader experience working with this module. There are later plans to switch to development with cuDF for the purpose of software optimisation.

For neural networks there are several platforms and frameworks: TensorFlow [45 ], Keras [46 ], PyTorch [47 ], Scikit-Learn [48 ], of which the most popular and widely used are TensorFlow and PyTorch. This work used PyTorch, because compared with TensorFlow, PyTorch’s syntax is more Pythonic, which makes it easier to integrate with Python objects and makes error messages easier to understand. The thesis author also has prior experience developing with PyTorch.

Computer-vision neural network model

There are several models available for object detection, differing from one another in accuracy, input resolution, size, and speed. Choosing a model means accounting for the system’s throughput: more capable systems such as servers can run massive, accurate models in parallel with minimal impact on input-processing speed. Smaller systems such as laptops and embedded systems, by contrast, are better suited to running smaller, faster models.

ModelMax input (px)Speed (ms)Accuracy (mAP)
FasterRCNN [49 ]1000x60019873.2 (PASCAL VOC 2007)
YOLOv5 [50 ]1280x128026.2 (V100 GPU)72.7 (MSCOCO)
MobileNet SSD V2 [51 ]320x320200 (Google Pixel 1)22.2 (MSCOCO)
EfficientDet [52 ]1536x1536153 (V100 GPU)55.1 (MSCOCO)

Overview of the parameters of the most popular models

For this system, the model had to be chosen with the best accuracy-to-speed ratio in mind. The chosen model also had to be capable of processing information from several video inputs at once (see Table 4).

The YOLOv5 architecture network was selected to solve the task; it was built by Ultralytics [53 ] as an evolution of YOLOv4 [54 ] and is faster and more accurate than its predecessors. It is also a compact model that is easy to run on embedded systems. The model operates on the single-pass frame-processing philosophy (You Only Look Once) – the image passes through the network only once (unlike other models), and within that single pass object detection and classification happen together, which is why the model’s detection operation is also faster (see Figure 10).

Figure 10
Figure 10.

Architecture of the YOLOv5 models [55 ]

The model consists of three parts: backbone, neck, and head.