Detection of Different Behavior from The Majority in A Public Space

A measurement method was proposed for observing movements of persons in a public space such as a railway station. The method was based on the fact that a person occupied a 3D region and his or her sequential images captured by a security camera formed a wormhole in the spatiotemporal image space. Direction of the majority of those wormholes was estimated as movement direction of the majority by a projection algorithm. After filling the majority wormholes, remaining ones were searched and recognized as different behavior persons. A high-accuracy person extraction algorithm was also developed for preventing wormholes enlargement mainly caused by persons’ shadow. An inter-frame difference technique was adopted for realizing the quick updating the background. The method was successfully applied to a security camera video for detecting different behaviors from the majority.


Introduction
The importance of securing safety in a crowded public space has been increasing in recent years.Large number of security cameras have also introduced into those places.The explosion of those video data to be processed became a moment for developing methods to recognize human behavior and to detect persons automatically who took suspicious behaviors, and many methods have been developed (1,-4) .Most of them were based on the person tracking (5)(6)(7) , where person one-by-one detection and/or recognition were required before the tracking.They often adopted a background subtraction algorithm for the person detection in video scenes (7)(8)(9) .
The background subtraction, however, has some problems to be solved, especially in producing/updating the background image.In order to extract a person exactly, the background image should be produced faithfully and be updated dynamically for preventing false area extraction due to such as daily changes of lighting condition.Actually, some of the methods use the mode of pixel density in a time window as an estimate of pixel density in the background image (here, we call it as MODE algorithm).A wider window gives us a more stable extraction but a slower updating.The slow updating is a cause of low followability for the change of lighting condition (10)(11)(12) .Unfortunately, MODE algorithm has less performance for the crowded scenes where the mode does not necessarily represent the background.
Thus, the background subtraction algorithm itself extracted persons with low precision even in a dense situation, but the conventional person detection methods were not able to process overlapped persons separately.A new person detection method with robustness for the crowded scenes has been required.
Here, we propose a new method, Detecting Different Behavior from the Majority (DDBM), with robustness against the crowded scene such as a railway station.The DDBM adopts a wormhole algorithm in a spatiotemporal space (13)(14)(15) and represents person movements as a bunch of wormholes.A projection technique enables us to get average direction corresponding to that of the movement of the majority people.The different behavior person is detected as a wormhole having another direction from the average.Separability among the wormholes controls the performance of DDBM in detecting person one-by-one.So, an improved algorithm for the background subtraction is also introduced into DDBM.The algorithm, Background Quick Updating (BQU), is characterized by its high speed updating the background.The BQU well prevents the enlargement of the wormholes due to persons' shadows.
In this paper, we describe the principle and procedures of DDBM and BQU, respectively.Application results of DDBM to an actual crowded scene are also shown.

Methodology
The proposed method DDBM is based on the fact that a person occupies a 3D region and his or her sequential images captured by a camera formed a wormhole in the spatiotemporal image space.This means that those wormholes may have contact but never merge.They are separated from each other even if in the crowded scenes.Figure 1 shows the schematic diagram of the proposed method DDBM.Video data captured by a camera at a fixed point are stored and used for the processing.Firstly, we separate moving persons as areas having value 0 from unmoved background having value 1.This process is applied to every scene in the video.Then we pile up these binary images to produce a spatiotemporal space.After searching the major direction of the moving persons in the spatiotemporal space, wormholes with the direction are filled by value 1.The residual wormholes indicate us that the different behaviors from the majority of people are detected.

A Wormhole Algorithm
We assume that persons move straight at least in the field of view of a camera used.Figure 2 (a) simplifies the movements of persons binary extracted from the video scenes.On the assumption, they form straight wormholes in the time sequence of the binarized scenes, i.e. a spatiotemporal space as shown in Fig. 2 (b).The direction of the majority of wormholes are searched, and the major wormholes are filled as shown Fig. 2(c).The residual wormholes are detected as the different behaviors from the majority as shown Fig. 2 (d).
The spatiotemporal model S l has size of W l × H l × D l in the coordinates (X, Y, T) as shown in Fig. cubic model S c be defined in S l .The cube S c , whose center corresponds to that of S l , has size of In order to obtain the direction of the majority of the wormholes, we produce projection images for various directions.The major direction is obtained as one which gives the maximum contrast among them.The maximum contrast is achieved when the projection direction matches to that of the majority of wormholes.Let ,  and  be rotation angles in axis X, Y and T, respectively.Pixel densities of the projection image I p (u,v) is calculated as , , min 3 where, g w (u, v) represents voxel density at a position (u, v, w) in S c .The contrast is defined as where, B means number of pixels having value 0 in the projection image I p .
Coordinates (u, v, w) also calculated from (x, y, t) as , where, R T , R Y , R X are rotation matrices, and P C , P -C parallel shift ones, respectively, for example, Angles  m ,  m and  m corresponding to the direction of the majority of the wormholes is obtained by using a steepest descent method.Considering C in (3) as a function of ,  and , we find the maximum direction as In the projection image at the angle  m ,  m and  m , we call low pixel density areas as wormhole areas.We define that the wormhole area has larger extent than E p and mean density lower than B p .Every wormhole is filled with a polygonal prism whose base corresponds to the wormhole area recognized in the projection image (see Fig. 2 (c)).After filling them, residual wormholes are detected and recognized one-by-one from the most prominent ones by applying the same procedures above.

Background Quick Updating algorithm
In a fixed camera system, we define pixels having almost the same density in two successively captured images as background pixels, though most of the conventional methods adopted the definition that pixels having the mode density in a time window were the background pixels.Figure 4 shows the processes of the BQU algorithm, where A t-1 and A t are successive images, and S a switch for updating the background pixel density.The switch S is closed (ON) when the Manhattan distance M h in the RGB feature space between densities of pixels at the same position (x, y) in the successive images as On the other hand, when the Manhattan distance is equal to or greater than 3, the switch S is opened (OFF) and the background is not updated.The BG is used as the background and difference as the result from the background subtraction is obtained as FG.In the RGB feature space, we measure the Euclidian distance E d between a background pixel BG(x,y) and the current observation A t (x, y), where x and y means positon of the pixel.We regard that a moving person is detected when the distance exceeds a threshold R, as The zeros in FG(x, y) produce a part of the wormhole in the spatiotemporal model S l .

Experiments and Discussion
In order to evaluate the performances of the BQU and the DDBM, a 20 minute-length video captured at a railway station was processed.The video was divided into color images with size of 720 × 480 and stored in a folder.A Windows-10 PC with 3.60 GHz clock CPU and 32 GB memories was used for the processing in the experiments.In the experiments, instead of reading images from a camera, images were sequentially read from the folder.

Performance Evaluation of BQU algorithm
The performance of the Background Quick Updating algorithm was evaluated by comparing the that of the conventional MODE algorithm, especially in extracting accuracy and in eliminating foot shadows.In the MODE, we set the time window as 16.7 [s] ( = 1/30 [s]×500 frames).
Figure 5 shows an image in the video captured at a crowded railway station we processed.Red box at the left-hand side indicates the area in which we numerically evaluated the performance of BQU and MODE as shown in Figs. 6.We manually separated persons (black area) from the background (white) as shown in Fig. 6 (a), and used it as the truth.Figures 6 (b) and (c) were the areas extracted as moving persons by BQU and by MODE, respectively.The precision and the recall were calculated from those images as shown in Table 1.We see the BQU have better performance in both the precision and the recall than the MODE does.
On the other hand, red box at the right-hand side in Fig. 5 indicates the area in which we can see the superiority of the BQU as shown in Figs. 7. Actually, the BQU clearly extracted two persons without shadowing at their feet, though the MODE extracted them with the shadows.The shadow decreased the performance in recognizing persons separately.Thus, the performance was confirmed that the BQU was able to extract moving persons satisfying the conditions necessary for the DDBM.The direction searching process was applied to the model, and the projection image was obtained as shown in Fig. 9, which maximized the contrast C in (3).In the process, a lattice point search using the model S c was applied firstly, then the maximum contrast lattice point was used as the starting point in the steepest descent method.Actually, the contrast C(, , ) was calculated for a total of 27 combinations of ,  and  selected from 45, 90 and 135 degrees.The steepest descent method gave us the maximum contrast at angles ( m ,  m ,  m ) = (134.02,44.55, 89.37) degrees, respectively.Figure 10 (a) shows the 3D view of the model S l from the direction of ( m ,  m ,  m ) with removing duct of the ceiling for better understanding the inside.Black regions are the so-called wormholes yielded by movements of the majority persons.In order to fill the   In order to recognize the different behavior person from the majority, we applied the same procedures as the previous majority movement search.Figure 11 (  that dark regions were separated from each other, and therefore, we could recognize the person one-by-one.We selected a patch marked by an ellipse, and applied the steepest descent method so that the area of the patch had the minimum extension.Figure 11 (b) shows the patch minimized, then binarized with the mean density.The direction angles were obtained as ( m ,  m ,  m ) = (-110.53,-0.37, 39.89) degrees, respectively.The patch was projected back onto S l , and we got the 3D region corresponding to the trajectory of the person in the spatiotempral space.The 3D region was used as a window in the spatiotemporal space.

Performance Evaluation of DDBM
We regarded the 3D region as the 2D mask moving at a constant speed in the original video scenes, and considered that a person was recognized when the mask included him/her in the video scenes during his or her appearance.Figures 12 show three scenes of the original video with superimposing the mask.We see that the mask includes an identical person in every scene.Thus, the proposed method DDBM correctly recognized a different behavior person from the majority.The proposed method was currently inefficient.The method took 95 [s] for extracting moving persons and building the 3D spatiotemporal model S l from the video scenes, 255 [s] for the lattice point search using the model S c , and 3403 [s] for the fine search for the movement direction of the majority persons using the steepest descent method with 24 iterations.After filling the model, it took 1103 [s] for getting the direction of a different behavior person, and 126 [s] for producing the 2D mask for marking up the person recognized.
In the results shown in Fig. 12, the shape of the mask is not similar with that of the person recognized.We consider the phenomenon is caused by the complexity of persons walking in a crowded area.They may change the walking direction or stop to avoid colliding with others.Those behaviors make their trajectories different from lines, though the proposed method DDBM assumes linear trajectories.So the estimated trajectory patch becomes wider polygonal (c) at time t 3 Fig. 12 A person recognized as one taking different behavior from the majority (t 1 < t 2 < t 3 ).
prism enclosing the entire complex trajectory.The projection technique also may include a part of the trajectory yielded by another person's body.

Conclusions
As a preliminary step of detecting suspicious individuals in crowded places, a measurement method was proposed for observing persons' movements in a public space such as a railway station.The method was based on the fact that a person occupied a 3D region and his or her sequential images captured by a security camera formed a wormhole in the spatiotemporal image space.The proposed method was divided into two sub-methods; Background Quick Updater (BQE) and Detecting Different Behavior from the Majority (DDBM).The former introduced an interframe difference technique into producing the background and well extracted moving persons from the original video scenes with avoiding their shadows.The latter built a 3D spatiotemporal model from the extracted areas with binarization, in which the persons' movements were recorded as so-called wormholes.On the assumption that persons in a video scene move straightly with a constant speed, the direction of the majority of the wormholes was searched so that the contrast in the projection image reached the maximum by using a projection technique.After filling wormholes along the major direction, remaining ones were searched and recognized as the different behavior persons.The proposed method successfully detected a person crossing the stream of the majority persons in an actual video captured at a railway station.
To improve the method especially in efficiency, to develop an automated algorithm for separately recognition of remaining wormholes after filling the majority persons' trajectories and to extend the method from detection of different behavior persons to finding suspicious individuals are the subjects for a future study.Development of an intelligent algorithm for the background subtraction is also one of future tasks.

Fig. 1
Fig. 1 Schematic diagram of the proposed method DDBM.
3. Let another (a) person movements (b) wormholes in the spatiotemporal space (c) filling major holes (d) residual wormhole Fig. 2 Extracted persons in a video scene, their wormholes in the spatiotemporal space and filtering in it.

Fig. 3
Fig. 3 Spatiotemporal models S l and S c .

Figures 8 (
Figures 8 (b) shows a 3D spatiotempral model S l built from the binary images (see Fig.8 (a)) in which moving persons were extracted by the BQU.Considering the symmetry, the size of the 3D model was determined as 480 × 480 × 480 [pixel 3 ], where the size was limited by the smaller one of the width and the height of the original images.The smaller model S c was also defined with size of 270 × 270 × 270 [pixel 3 ], used for searching the moving direction of the majority of persons captured in the video scenes.The direction searching process was applied to the model, and the projection image was obtained as shown in Fig.9, which maximized the contrast C in (3).In the process, a lattice point search using the model S c was applied firstly, then the maximum contrast lattice point was used as the starting point in the steepest descent method.Actually, the contrast C(, , ) was calculated for a total of 27 combinations of ,  and  selected from 45, 90 and 135 degrees.The steepest descent method gave us the maximum contrast at angles ( m ,  m ,  m ) = (134.02,44.55, 89.37) degrees, respectively.Figure10 (a)shows the 3D view of the model S l from the direction of ( m ,  m ,  m ) with removing duct of the ceiling for better understanding the inside.Black regions are the so-called wormholes yielded by movements of the majority persons.In order to fill the

Fig. 5
Fig. 5 An image in the video used.
9) was binarized by using the threshold of the mean density.The binary image was back-projected onto the model S l as shown Fig.10 (b).The white blocks filled the dark regions in S l , i.e. movements of the majority persons, as shown in Fig.10 (c).Remaining black regions in the model S l are the part of trajectories of persons taking different behavior from the majority to be detected.
a) shows the projection image giving us the maximum contrast for the remaining wormholes.The projection image suggested us (a) A binary Image of persons (b) 3D spatiotemporal model Fig. 8 A binary image of persons extracted by BQU (a) and the 3D spatiotemporal model built by piling the images (b) with coordinates (X, Y, T).

Fig. 9
Fig. 9 Projection image along the major direction.
(a)possible trajectories (b) one of them with binarization Fig.11 Projection image for finding possible holes of person moving.(a) at time t 1 (b) at time t 2

Table 1
Numerical evaluation of the person extraction.