Object Extraction Focused on Holding Action of Human Being

In recent years, robots have become a part of the household. To facilitate the greater integration of robots into human life, it has become necessary to develop highrecognition functions for robots. We believe that the usage of objects, i.e., the mutual relationship between a human and an object should be a feature of the human recognition function. In this study, we detect a basic interaction between a human and an object, and develop an object recognition method using an image and interaction (action) when the interaction occurs. The general flow of image recognition processing is as follows: feature extraction, learning, and recognition. There is a problem, however: it is difficult to extract features by occlusion and similar colors during feature extraction. In this paper, we focus on the interaction between a human and an object as a method of obtaining an effective image for feature extraction. We discuss a robust object extraction method using the joint position information of the human and depth information of similar colors and occlusion.


Introduction
Recent years have seen considerable progress in the development of robots and in facilitating their coexistence with humans.To achieve the latter, it has become necessary to develop high-recognition functions such as object recognition.Object recognition by a computer in a complicated environment is difficult because it requires the extraction of an effective object characteristic to detect and recognize an object using only physical factors.On the other hand, humans obtain a lot more information through vision, and they also recognize objects by association and personal experience.This association is the interaction between a human and an object.
In this paper, an interaction refers to an interactive action, i.e., one that is a specific action of a human touching an object.For example, consider a pencil, which is a stick, lying on a desk.The action of "writing" occurs when a human holds the pencil and moves its tip in contact with a piece of paper; the human recognizes the stick to be a pencil.Therefore, a human's interaction with an object provides important information in recognizing the object.We believe that usage of an object, i.e., the mutual relationship between a human and an object, should be a feature of the human recognition function.In this study, we detect a basic interaction between a human and an object, and develop an object recognition method using an image and interaction (action) when interaction occurs.In this paper, we focus on the holding action of a human being, and discuss an object extraction method using joint position information and depth information.

Conventional Method and Problems
The general flow of image recognition processing is as follows: input data, feature extraction from images, learning, and recognition.In addition, images are used as input data in this case.Feature extraction and learning from the image, to be used as the input data, have a great influence on recognition (1) .Feature extraction is dependent on the input data.It is difficult to extract features in the image by using similar colors and occlusions that overlap the object and object.As for the image extracting feature, an image in which only a recognition object appears is appropriate.Therefore, it is necessary to extract only the recognition target.Object extraction by edge and color is used as a conventional method.However, it is difficult to extract an object by occlusion, and there are color problems in the conventional method when edges and colors are used.
In a prior study (2) , we performed object extraction not based on edges and colors but the joint position information of a human and depth information using an RGB-D sensor.This method can extract only an object, but occlusions owing the hand are difficult to extract after a holding action.

Proposal Method
The action of "having" or "holding" is performed when a human touches an object.Therefore, it is expected that the object exists near the hand.However, occlusions resulting from the hand influence the following recognition processing when an object is extracted after an action.In this paper, we assumed that the target object exists near the hand.We focus on the action of "having" or "holding" just before the scene.We believe that we can acquire an image without occlusions owing the hand.This is effective for recognition by extracting an object just before an action.In this study, we performed an object extraction around the hand using an RGB-D sensor (Microsoft Corporation, Kinect for Windows v2).Fig. 1 and 2 show, respectively, an overview of the proposed method and the appearance of the Kinect v2.Table 1 and 2 list the specifications for the Kinect v2 and the PC that were used for our experiments.
The processing procedure for the object extraction method proposed in this paper is shown below.The resolution of the image is 512 × 424 pixels because we use a depth image.

I.
Acquisition of the depth image II.Eliminate the human area III.Acquire the joint coordinates of the human IV.Calculate the end of the hand coordinates by using hand and elbow coordinates V. Acquire depth information at the hand coordinates VI.Segment by threshold processing using the depth information mentioned above VII.Acquire the Region Of Interest (ROI) around the end of the hand coordinates VIII.Perform contour detection in the ROI IX.Calculate an area in the biggest contour X. Extract the object by threshold processing using the area At first, we used Kinect library for acquisition depth image.In addition, we detected a human at the same time, and eliminated the human area by painting white the pixels of detected human areas.Next, we acquired the skeleton coordinates of the human.We extracted hand and elbow coordinates from the skeleton coordinates.We calculated the end-of-hand coordinates P(px, py) using equations ( 1) and (2).We identified (x1,y1) as the elbow coordinates, (x2,y2) as the believe that because could not segmentation with human and object in the cause of object region was completely included in a human region in the direction ④ of the sitting state.Therefore, we believe that object region was detected as a human region, and thus we could not extract the object by the being eliminated that region at the same time.

Conclusion
In this study, we assumed that a target object exists around the hand.We focused on the action of "having" or "holding" just before a scene.We tried to acquire an image without occlusions a resulting from the hand.This was effective for recognition by extracting an object just before the action.Object extraction was performed based on the joint position information of the human, and by depth information.As a result, we could acquire an image without occlusions caused by the hand.Therefore, the joint position information of a human and depth information are effective for object extraction.In addition, it was sometimes protrudes depending on the size of the object in an extraction region.Therefore, it is necessary to change the range of the extraction region and the threshold value of the area.In our future studies, we will examine these values.