Stage 1 - deadline November 30, 2017

1.1. State-of-the-art on distributed and embedded processing for security video data.

We identified the current established technology for parallel and distributed processing, the techniques to improve the amount of operations performed by a computation system such as CPU - Central Processing, GPU - Graphical Processing Unit, GPGPU - General Purpose GPU or DSP - Digital Signal Processing, as well as the advantages of adapting and implementing such technology into embedded systems. With the introduction of such technologies, many APIs (Application Programming Interfaces) that allow access to specialized processing resources have appeared on hardware platforms and programming language integrations. In this category we mention: CUDA, OpenCL, OpenCV, C++ AMP, WEB service, etc. In embedded systems, ARM (Advanced RISC Machine) is a family of RISC processor architectures embedded in many System on Chip (SoC) or System on Modules (SoM). ARM processors have been integrated over time into various solutions such as common phones and smartphones, specialized embedded systems, various electronic devices, security processing systems as IP video cameras, etc., representing a very efficient, low-power solution with small factor size. The latest technological development is demonstrating and consolidating the fact that embedded systems, although having a low processing power compared to classical systems (e.g. vs PCs - Personal Computers), are a viable solution for processing video data acquired from complex surveillance systems (such as CCTV – Closed Circuit TeleVision). Such progress is possible due to improvements in embedded hardware processing architectures that allow achieving high performance in terms of computing speed and precision while making use of modern high-level development techniques specialized in parallel and distributed processing.

1.2. State-of-the-art on IP video camera platforms.

We identified the current embedded technologies, such as IP video cameras, and pinpointed the main worldwide main industrial suppliers of video surveillance systems such as HikVision and Axis companies. Such manufacturers are providing support for the development and integration into their IP video cameras hardware of 3rd party algorithms using an API / SDK, namely HiKVision through the HEOP program and Axis through the VAPIX technology support. With this regard, several IP cameras are identified, considering criteria such as image quality, resolution, frame rate and hardware processing capabilities. However, access to resources is difficult or even impossible under certain conditions. To cope with that, a viable solution is being explored and adopted for our current project, which is based on off-the-shelf embedded processing systems such as the Raspberry PI platform. It provides clear advantages over common IP video camera platforms, such as: (i) integration with video sources can be done very easily, not requiring access to the camera using a proprietary protocol; (ii) once integrated, the Raspberry platform can access any video source of the same manufacturer (any other IP video camera on the network); (iii) the platform offers similar or equivalent hardware capabilities and resources in terms of CPU power and available RAM compared to the existing integrated camera hardware; and (iv) once the algorithm has been developed and validated, it can easily be deployed to any ARM based platform (such as IP video cameras).

1.3. State-of-the-art on DROP (distinct areas of interest or patterns) based techniques.

We identified the current scientific and technological progress on designing real-time surveillance systems for the detection of interest regions (Distinctive regions of interest or patterns - DROP), specialized for running on dedicated embedded platforms. We identified state-of-the-art methods and algorithms for detecting objects using descriptors based on color, texture and shape information, as well as more evolved techniques using invariant features such as key points. Finally, we identified state-of-the-art techniques emerging from natural inspirational algorithms like Deep Learning Neural Networks. The latter can be employed for tasks such as generic object detection, classification (decision making), or recognition of a particular instance of an object, yielding a very high precision.

1.4. Development of DROP based techniques for embedded processing.

We identified and implemented the following solutions. First, we adapted validated existing solutions achieved by the project's team, namely: (i) the SCOUTER system – specialized for searching for people within offline video records; and (ii) the SCOUTER-DROP system – a customized system for extracting DROP instances (logos or other distinct areas of interest). Further, to provide even higher performance, we implemented techniques based on Deep Learning Neural Architectures. Several state-of-the-art DNN (Deep Neural Networks) are identified and implemented, such as YOLO, SqueezeNET (Alexnet based) and a new network architecture, denoted LiviuNET, designed and implemented within the current project. All mentioned networks were pre-validated and tested on the SCOUTER dataset and the Raspberry PI hardware platform. The operating system installed on Raspberry PI 3 is Raspbian. Initially, Ubuntu Mate was tried, but the latter was rather unstable with slow response to the given commands. The YOLO network running time with the standard model (trained model has 260 MB) took about 38 seconds to return the results. YOLO runtime with tiny model (trained model has 40 MB) is about 4 seconds to return the results. By distributing one image per available CPU core, a rate of about 1.5 processed images per second is obtained. All the processing flow is focused on searching for objects (people) as well as searching for distinct instances of objects, like DROP and until now, the best results so far are being obtained with the YOLO framework. These networks are implemented using C/C++, whereas third-party processing libraries (employed by LiviuNET or SqueezeNET networks) that are based on complex frameworks (such as TORCH, OpenCV, or Tensorflow) are not required. The partial results obtained with YOLO on the processing system composed of an IP video camera and the Raspberry PI platform are encouraging to adapt and optimize YOLO as final solution of the DROP searching and retrieval system. Another important result for the experimental part is the realization of the annotated SPOTTER dataset.

We created the data and manual annotations for a large set of data issued from a real-world video surveillance system, namely the system used by the research center CAMPUS. This comprises 137,000 annotated frames and 12 different people scenarios. The image resolution is 800x600, 1280x720, 1280x800, or 1280x960 pixels (varies based on IP video camera model), at 15 fps.

Stage 2 - deadline June 29, 2018

2.1. Benchmarking and optimization of the proposed approach

The main objective of this activity is the benchmarking and optimization of the proposed methods suitable for resource-limited deployments. In this regard, we studied, implemented and optimized complex algorithms for DROP retrieval, namely methods based on hand-crafted features and deep learning methods. The entire system was developed to run near real-time on embedded platforms that impose hardware constraints, such as Raspberry PI3, with low memory footprint and computational cost. The optimizations include: (i) the development of the systems for fast processing and fine-grained parallelism, tasks delegated to the LAPACK (Linear Algebra Package) and OpenBLAS (Basic Linear Algebra Subprograms); (ii) the development of the algorithms against packages for neural network acceleration computations such as NNPACK; (iii) testing a series of good practices in learning deep neural networks; (iv) exploiting the cache capacity of a single-chip multi-core processor; and (v) adding CPU instructions for faster inference time.

2.2. System architecture and the experimental demonstrator specifications

In this activity, we present the system architecture and the software and hardware specifications of the demonstrator. Firstly, the hardware architecture is composed of an IP surveillance camera, an embedded system, namely Raspberry PI3, Model B, and a switch for two-way communication between the two systems. The IP camera has adaptable angle of IR illumination, a varifocal, P-iris lens, and HDTV 1080p resolution, offering multiple H.264 and Motion JPEG streams. The streams can be individually optimized for bandwidth and storage efficiency with support for edge storage that allows recording video directly to a storage such as a microSD/SD/SDHC card, and are flexible regarding the wide range of computer languages that can be used for deployments. The processing embedded platform has the advantage of high processing power, helped by the four available cores, ideal for deep learning based applications and a stable operating system compatible with multiple software packages. Secondly, the software specification of the demonstrator includes the Python language, an interpreted, object-oriented, high-level programming language, with an automatic memory management, easy to deploy and scale. From the beneficiary point of view, the demonstrator is a solution that implements the basic functions namely, live visualization of the video stream and live selections for DROPs queries and retrieval.

2.3. Development and implementation of the proposed methods

We identified and studied various algorithms based on deep learning approaches for person re-identification and DROPs retrieval in video streams, deployed on an embedded device. In this regard, we benchmark a set of deep neural networks architectures, namely LiviuNET, GoogleNet, DarkNET, Tiny DarkNET, and Network in Network, that showcase different types of parameterization for fast processing and low memory footprint. We report the accuracy, processing time and the implementation challenges on embedded devices. The report is also focused on the limitations of each approach.

2.4. Algorithms optimization for real-time processing

In this activity, we define, implement and optimize the final algorithms for the demonstrator's architecture. Due to the processing time, memory footprint and deployment limitations of benchmarked algorithms, we corroborate previous achievements and lessons learned in a new deep neural network architecture, named SPOTTER, that uses in particular GoogleNet optimizations. The SPOTTER deep neural network obtained top-1 accuracy of 80% and top-5 accuracy of 95%, which is comparable to GoogleNet, which obtained a top-1 accuracy of 81% and a top-5 accuracy of 97% using the same experimental protocol. From the inference point of view, SPOTTER network runs with 0.28 seconds/frame, in comparison to GoogleNet which obtained a processing time of 3.5 seconds/frame. Therefore, we were able to maintain high performance, but accelerating significantly the processing and making it suitable for embedded implementations.


The SPOTTER project is funded under research grant PN-III-P2-2.1-PED-2016-1065, agreement 30PED/2017 and is spanning over 18 months (January 2017 to June 2018).

The proposed technology will be capable of automatically finding the occurrence of a DROP instance by running specialized algorithms embedded directly on the IP video cameras.


  • Advanced DROP retrieving algorithms running on low resources embedded hardware platforms

  • Techniques for real-time identification and tracking capabilities from multiple video sources

  • Techniques for algorithms which can dynamically adjust the running parameters according to task variables

Contact Us

Address: Splaiul Independentei nr. 313, sector  6, Bucuresti, Romania

Telephone: + 4021-402 48 72
FAX: + 4021-402 48 21

E-mail: bionescu at alpha dot imag dot pub  dot ro