Real-time Smart Situation Awareness based on Camera Array and Machine Learning

This paper proposes a real-time situation awareness system based on a surveillance camera array and machine learning techniques. The surveillance camera array consists of multiple PTZ cameras which are divided into two classes: large area cameras and local area cameras. Large area cameras are used to detect whole moving targets (e.g. persons, cars) and local area cameras are used to detect a specific area of the moving targets (e.g. person faces, car license plates). The behaviors of the detected moving targets are recognized to be normal or abnormal by combining shortterm and long-term video sequence analyses. Furthermore, the detected faces are fed to a face recognition subsystem to determine whether the person is known or not. The experimental results show the proposed situation awareness system is effective and useful.


Introduction
Situational awareness or situation awareness (SA) is the perception of environmental elements and events with respect to time or space, the comprehension of their meaning, and the projection of their future status. Situation awareness has been recognized as a critical, yet often elusive, foundation for successful decision-making across a broad range of situations, many of which involve the protection of human life and property, including aviation, air traffic control, ship navigation, health care, and emergency response. To surveil a wide area, such as tracking a vehicle traveling through a network of roads in a city or analyzing the overall activities happening in an airport, multiple sensors such as electro-optical (EO) or infrared (IR) cameras are used because the view of a single camera is finite and limited by scene structures. Many situation awareness systems based on multiple EO/IR sensors have been developed (1)(2)(3)(4)(5)(6) . Situation awareness systems are based on a multidisciplinary field related to computer vision, pattern recognition, signal processing, communication, embedded computing, and image processing. The key technologies relevant to situation awareness systems from the perspective of computer vision are: (i) multi-camera calibration, (ii) topology of a camera network, (iii) object re-identification, (iv) multi-camera tracking, and (v) multi-camera activity analysis.
Camera calibration is a fundamental problem in computer vision and is indispensable in many video surveillance applications. There is a vast amount of literature on calibrating camera views with respect to a 3D world coordinate system (7)(8)(9) . The topology of a camera network identifies whether camera views are overlapped or spatially adjacent and describes the transition time of objects between camera views. The knowledge of the topology is important in tracking objects across camera views for object reidentification (10)(11) . This is the process of matching two image regions observed in different camera views and recognizing whether they belong to the same object or not, Face Database purely based on appearance information without spatiotemporal reasoning (12)(13) . Multi-camera tracking is used to track the behavior of objects across camera views (14)(15) . Multi-camera activity analysis can automatically recognize different types of activities and detect abnormal activities in a large area by fusing information from multiple camera views (16)(17) . Activity analysis and object recognition are key components in any situation awareness system. This paper proposes a real-time situation awareness system based on a camera array and machine learning techniques that performs both behavior analysis and target recognition. The remainder of this paper is organized as follows. Sections 2 introduces the architecture of the proposed system. Section 3 discusses the behavior recognition and object recognition capabilities. Section 4 describes the experiment result, and lastly, Section 5 presents the conclusion and future works.

System Architecture
The overall architecture of the system is shown in Fig.  1. The system employs a camera array with multiple PTZ (pan-tilt-zoom) cameras which are divided into two classes: large area cameras (zoomed out) and local area cameras (zoomed in). The videos from the camera array are fed to the change detection sub-system, which detects the targets (e.g., persons, cars). If the detected target is a person, the local area camera is controlled to zoom in and detect the person's face. The image of the person's face is then sent to the face recognition sub-system. If the detected face is unknown, it will be added to the face database. Simultaneously, the moving target is tracked by using template matching, and its trajectory is extracted as motion information. If the motion pattern is not recognized in behavior database, it will be recorded as a new behavior. The situation semantics generation sub-system generates the scene semantics based on the object the system recognized.

Implementation
The implementation of the proposed system contains (i) behavior recognition, (ii) face recognition, and (iii) situation semantics generation.

Motion Behavior Recognition
The behavior recognition process flow is shown in Fig.  2 and is summarized in the following.
(1) The target detection is based on Yolo (18) . Yolo is a pretrained CNN object detector which uses the COCO (19) database. COCO is a large-scale object detection, segmentation, and captioning dataset, containing 88 objects. Fig. 3 shows some examples of detection results at an airport. The detected objects are marked by the rectangles.

Face Recognition
The face recognition sub-system contains (i) face detection, (ii) face feature detection, and (iii) face recognition.
Once a human target is detected, the system will use the local area camera to detect the face. The face detection is based on MTCNN (20) which is a deep cascaded multi-task framework that exploits the inherent correlation between detection and alignment to boost up their performance.
The face feature detection employs VGGFace2 to generate face embeddings. The VGGFace2, which has been trained by using 3.31 million images of 9131 subjects, accepts 224x224 input images and outputs 2048 face embeddings (21) .
The embeddings are then normalized. The normalized face embeddings for the same person are averaged to make a standard face embedding for that person. The face recognition calculates the cosine distance between the input face embeddings and the standard face embeddings. If the matching measure is less than the threshold, it is considered as a matched face.

Situation Semantics Generation
Situation semantics generation attempts to provide a solid theoretical basis of reasoning about common-sense and real-world situations. This project generates the semantics of a video stream from the camera array. For example, the video stream could show that a person is breaking a car window. The system in this project will analyze the video and generate the situation semantics: "A person is breaking a car window". Situation semantics generation employs SimpleNLG (22) . It is the most widely used open-source natural language generation (NLG) library. SimpleNLG is used to generate grammatically correct English sentences. For example, if the system detects a person from the input video, it generates a word "person", and if the system further detects the change of the person's location via motion pattern analysis, it generates another word "walk". When SimplNLG receives the words "person" and "walk", it can generate a sentence like "a person is walking". Furthermore, based on the face recognition result, the system can identify who is breaking the car window.

Experiment Results
This project's system is built on a Windows 10 platform using the Python programming language. Fig. 5 shows the camera array that contains two Wanscam metal PTZ IP Dome outdoor cameras. These wireless PTZ cameras have full HD 1080P (1920x1080) resolution and clear night vision (80 meters).

Behavior recognition experiment:
To develop the behavior recognition algorithm, 12 scenarios were developed related to a person and a car in a

Local-area camera
Large-area camera parking lot. They are summarized in Table 1 and illustrated in Fig. 6. Each scenario was recorded 10 times.   8. shows a part of tracking result. The first column is the frame number. The second, third, and fourth columns display information of the detected objects in the frames, Each column is a tuple of that object's label, center, width and height.
Face detection and semantics generation experiment: Fig.9 (a) -(c) shows the face image database for Person 1, 2, and unknown. Fig. 9 (d) shows the face recognition rate for the test image set 1, 2 and 3. Test image set 1 contains 10 images of Person 1, test image set 2 contains 10 images of Person 2, and test image set 3 contains 10 images of unknown persons. Fig. 10 shows the experiment that uses a camera array to detect the target. Fig. 10. (a) shows that a person and three cars are detected by the large area camera of the camera array. (b) shows a person's face is detected by the local area camera of the camera array. (c) shows the semantics generated by the system based on the detection result.

Conclusions
This paper describes a real-time situation awareness system based on a camera array and deep learning techniques. The average processing time for target detection and recognition as well as face detection is 833 milliseconds per    (1) A person walks to a car and gets inside the car.
(2) A person gets off from a car and walk away. frame. The scene understanding result, i.e., the generation of situation semantics: "The system detected a person and three cars", takes 306 milliseconds on average. In total, the system takes 1139 milliseconds to process and understand one frame. It can potentially be improved by using a computer cluster and GPUs. Moreover, the system currently employs a small face database and a small behavior database, both of which can be expanded upon in future works.

Copyright
This paper, submitted to ICIAE, is the original and unpublished research. It is not submitted to other conferences or journals.