
Citation: | Hao Wu, Zhao-Wei Chen, Guo-Hui Tian, Qing Ma, Meng-Lin Jiao. Item Ownership Relationship Semantic Learning Strategy for Personalized Service Robot[J]. Machine Intelligence Research, 2020, 17(3): 390-402. DOI: 10.1007/s11633-019-1206-7 |
With the increasing integration of service robots into ordinary family life, users have higher and higher requirements for the standards of robotic personalized services[1-5]. The tasks of robotic personalized service are set up to meet the more specific demands of different users. For example, in the past, we let the robot take a cup, and now we ask the robot to take my cup. This requires the robot not only to have the ability to identify the object and person, but more importantly, the robot needs to know which cup is mine, which requires the robot to construct the ownership relationship between the person and their exclusive objects. This means that service tasks that the robot needs to perform are upgraded from "taking objects" to "taking a specified user′s exclusive objects". Robots need to have knowledge of the ownership relationship between different users and their personal exclusive objects. Therefore, learning the relationship between human and their carrying items is particularly important.
At present, the recognition of human carrying articles is mainly the recognition of hand-held objects. Previously, most of the datasets of human hand-held object recognition[6, 7] were collected based on the first-person perspective. In contrast, our task requires the use of a camera equipped with a robot to capture images in real time from a second person perspective. The angle of the shot is not fixed and the occlusion problem needs to be solved. Hsieh et al.[8] proposed a novel ratio histogram to find important color bins to locate handheld objects and their trajectories via a code book technique. Lv et al.[9] proposed an RGB-D hand-held object recognition method based on heterogeneous feature fusion for the task close to us. For an open-ended and dynamic real-life environment, Li et al.[10] proposed a hand-held object recognition system which could incrementally enhance its recognition ability from beginning during the interaction with humans. However, in the above methods, people are in a standard stance or sitting position, which is not ideal for a complicated environmental background. Meanwhile, much work[11-15] has recently been done to realize human clothing parsing, which also provides a reference for our work. However, the instance recognition of the whole body′s carrying articles has not been involved, and the learning task of the attribution relationship between human and their carrying articles is a new research topic. This study will enrich the recognition task to a new instance-level of semantic relationship, which is more in line with the user′s description of the tasks that the robot needs to perform and has higher practical application value.
Motivated by the previous discussion, we propose a simple but effective self-learning framework for the attribution relationships of humans wearing or carrying items and realizing instance recognition: 1) Locating the human carrying items and realizing instance recognition; 2) Completing user identification; 3) Selecting an appropriate ownership relationship learning algorithm to establish the ownership relationship between the service objects and their carrying items.
In our proposed method, we combine the human pose estimation model with the global object detection model to realize the localization of humans carrying items. Then, we use a transferred convolutional neural network to extract the characteristics of the objects and use a distance-metric classifier to recognize the object instance. At the same time, the face detection and recognition model are used to identify the service individual. Finally, on the basis of the former two, we proposed an autonomous learning strategy for the task of self-learning of the ownership relationship between the human and the items they are carrying.
The chapters of this paper are organized as follows. Section 2 shows the overall ownership relationship semantic learning strategy, which includes localization of human carrying items, instance recognition method, user identification method and the learning strategy of the ownership relationship. Section 3 summarizes the experiments and the results. Section 4 is the conclusions.
Obtaining knowledge of the ownership relationship between humans and the items they are carrying is one of the urgent problems that the robot needs to solve to complete the personalized service. The robot can combine this knowledge to perform specific task execution according to the commands of different users, thereby completing the personalized service accurately and efficiently. For this problem, we design an ownership relationship semantic learning strategy, which is shown in Fig. 1.
Firstly, the method of combining human pose estimation model OpenPose[16] and object detection model SSD (single shot multi-box detector)[17] is used to obtain the position information and the category of the human carrying items. The attribute of the object instance is then retrieved based on the instance identification model. Secondly, we design a face detection and recognition model based on a convolutional neural network. The former is used to detect the face area and the latter is used to extract face features. The support vector machine (SVM)[18] classifier is used to realize the identification of the service object′s identity.
Finally, a novel concept of an ownership relationship memory matrix is introduced to describe the ownership relationship, and the ownership relationship learning strategy based on a similarity measure is proposed to guide robots to complete autonomous learning of ownership relationship.
The object detection and localization are composed of a global object detection model and a human pose estimation model. Fig. 2 shows the pipeline of human carrying items detection and localization.
The object detection algorithms based on neural networks have been greatly improved in terms of accuracy and other evaluation performance in recent years. Traditional object detection methods[8] have problems such as high time complexity and proposal redundancy, which result in low detection accuracy and speed. The two-stage detection algorithms represented by R-CNN (regions with multi-task cascaded convolutional networks) series algorithms[19-21] and the one-stage detection algorithms represented by YOLO (you only look once)[22] and SSD[17] are more targeted to the region proposal strategy, which reduces the time complexity. Among them, the SSD detection algorithm is based on different convolutional layers′ feature maps, so as to improve the detection accuracy of small objects while ensuring faster detection speed. Therefore, this paper selects the SSD object detection model as the basic global object detector to detect humans carrying items.
The global object detection model is built based on a SSD network structure. The SSD model leverages different scale feature maps to detect the default boxes to achieve multi-scale and different shape object detection. The VGG-16[23] model is selected to realize feature extraction. We use convolutional layer to replace the fully connected layers in the original network structure and add four convolutional layers as the auxiliary structure to complete the network structure construction after the new convolution layer. The output maps of six different convolutional layers are convoluted with two different 3×3 convolution kernels. The class confidence and default boxes of each category are outputs. The former is used for category classification and the latter for bounding box coordinate regression. Besides, the focal loss[24] function is used to solve the problem of low accuracy due to the imbalance between different categories. The joint loss function (1) is defined to update the parameters.
L(x,c,l,g)=1N(Lconf(x,c)+αLloc(x,l,g)) | (1) |
where N is the number of matched default boxes, Lconf(x, c) is a typical softmax classification loss function and Lloc(x, l, g) is a smooth L1 loss between the predicted box (l) and the ground truth box (g) parameters.
Lconf(x,c)=−N∑i∈Posxpijlog(ˆcpi)−∑i∈Neglog(ˆc0i) |
where
ˆcpi=exp(cpi)∑pexp(cpi) |
Lloc(x,l,g)=N∑i∈Pos∑m∈{cx,cy,w,h}xkijsmoothL1(lmi−ˆgmj) | (2) |
in which
smoothL1(x)={0.5x2,if|x|<1|x|−0.5,otherwise. | (3) |
Considering the slow convergence of model training caused by the change of sample distribution, we introduce the batch normalization (BN) algorithm after the convolutional layer to normalize the data distribution and eliminate the internal covariate transfer problem, thereby improving the speed of network training.
However, the global object detection model can only detect the position and category information of the items in the environment. It cannot judge whether the object is a carrying items. Therefore, it is necessary to use the human pose estimation model to constrain the positions of objects spatially.
Human pose estimation is the process of detecting the location of key points of human bones in an image. The research on the estimation of human pose mainly contain the top-down[25, 26] and bottom-up[27, 28] approaches. The top-down approach is to extract the proposal bounding boxes containing the human body from the image, and then use the existing technology to estimate the posture of the human in each proposal box. This method is greatly affected by the extraction precision of the region proposals in the first step. The bottom-up idea is to detect the global key points first, and then connect the key points by individual grouping according to the heat map, partial affinity fields (PAFs) or other technologies. In this paper, we use a bottom-up human pose estimation model OpenPose to complete the human pose estimation task, which achieves a higher performance on speed and accuracy. At the same time, with the object detection model used to complete the positioning of the human carrying items, it has a good real-time performance. Fig. 3 illustrates the overall pipeline of the OpenPose model.
The entire model is divided into two branches. The CNN model branch in the upper part is used to generate part confidence maps (PCM) and the lower part is used to generate part affinity fields (PAFs), then they are connected as the input data of the latter model. The network first uses the first 10 layers of the fine-tuned VGG-19 model to extract features from the input image to obtain the feature map F and a series of part confidence maps and part affinity fields are generated from the feature map F. The subsequent six processes take the generated PCM, APFs and feature map F generated by the previous process as input and continuously perform position correction and relationship constraint to obtain corresponding outputs St and Lt, so as to complete the human pose estimation, where St and Lt are respectively defined as (4) and (5).
St=ρt(F,St−1,Lt−1),∀t≥2 | (4) |
Lt=φt(F,St−1,Lt−1),∀t≥2 | (5) |
where
The generated part confidence maps and partial affinity fields are generated for the latter model. On this basis, the bipartite matching method in graph theory is used to spatially constrain the position information of joints and thus the clustering of joints of different people is completed. Fig. 4 describes the part confidence maps generated by the odd-numbered flow in the seven processes. It can be clearly seen from Fig. 5 that the positioning range of the joint point is continuously corrected and the positioning accuracy is continuously improved.
The object detection model first performs global detection of the human carrying items on the environmental images collected in real time, but the detection results have the problems of misdetection and missing detection, it is impossible to distinguish whether it is human carrying items. There is a case of missing detection for items with inconspicuous features or small shapes. Combining the human pose estimation model with our proposed re-detection algorithm can effectively reduce the probability of these problems occurring.
Firstly, all the detected items are matched with the key points of the human. It is assumed that the detected items in the vicinity of the ankle joint, the neck joint, the head and the wrist joint R(s) are humans carrying items and should be added to the collections of human carrying items O, then add the corresponding key points to the key points of the carrying items I and add the keypoints without carrying items to the set N. For the case of missed detection, we propose a re-detection algorithm. Traversing the set N, select the coordinate of keypoint ni as the center, take the distance between ni and the adjacent key points as the side length to intercept the image and re-send the intercepted image R(nj) into the detection model for re-detection D(R(nj)), if there is a human carrying item, it is added to the set O. Otherwise, it is determined that there is no human carrying items at this position. The above steps are repeated until all the elements in the set N are processed. After the above steps are processed, the detection and localization of human carrying items is completed, the human carrying items collection O is obtained. The re-detection algorithm is described in Algorithm 1.
Algorithm 1. Re-detection algorithm
Input: Image captured by camera equipped with a robot; detected object proposals
Output: Image collection of human carrying items O.
1) Initialize
2) for s in S do
3) for b in B do
4) if b around R(s) then
5) O.append(b), S.remove(s)
6) for s in S do
7) Select image regions in R(s) as input of object detection model
8) if D(R(s)) is True then
9) O.append(R(s)).
The instance identification of human carrying items is completed by the category-instance two-stage classification model. The category classification model is responsible for completing the classification of the category attribute of the items. According to the obtained category information, the corresponding instance identification model is obtained through the model mapping table to obtain the instance attribute of the items.
The human visual system can actively perceive and use the attention mechanism to select the image region of interest and ignore the background information. It is inspired by the neural structure and behavior of the visual system that introduces the attention mechanism to extract the saliency region in the image. The image obtained by the foreground extraction is used as the input image of the instance classification model. Fig. 5 is a diagram showing the effect of image segmentation using an automatic segmentation algorithm.
The saliency region detection and foreground extraction are the preconditions of the object instance recognition. Based on this, an instance recognition model is used to extract the characteristics of the foreground region and use the back-end classifier to implement the object instance classification. Fig. 6 shows the pipeline of instance recognition.
After the robot captures the real-time environment image information, the global object detection and positioning model is then used to obtain collection of humans carrying items. Next, the category information of the item is obtained by the global object detection model, an instance recognition model of the corresponding category is obtained by querying the model mapping table. On the other hand, the above-mentioned automatic segmentation algorithm is used to segment the image containing the human carrying items to obtain the foreground image. The instance classification model extracts the features from the foreground to obtain high-dimensional features which are used to obtain the instance attributes of the items.
As an important identification technology, face recognition is widely used, which also has important reference significance for our work. Fig. 7 is the pipeline of face recognition adopted in this paper. We use WIDER FACE[29] and CelebA[30] datasets as training datasets, effectively reducing the influence of pose and illumination changes on face recognition. The face image normalization method further weakens the influence of gesture on face recognition.
In the pre-processing stage, face normalization can perform pose and scale normalization on faces of different sizes and angles. Normalizing it to a more standard face image helps to extract facial image features, thus enhancing the recognition effect of the face recognition model. In this paper, the affine transformation method is used to normalize the face image and the size normalization is performed by smooth interpolation to adjust the image.
The multi-task cascaded convolutional networks (MTCNN) model, which is a multi-task cascade convolutional neural network model is used as a face detection model. The face detection task is implemented by three sub-network models. The network is trained using the cross-entropy loss function, and the Euclidean distance (as shown in (6)) is defined to calculate the regression loss function of the task. The prediction of the face bounding box
Kboxi=‖ | (6) |
Fig. 8 shows the detection results of the subnetworks P-Net, R-Net, and O-Net in the MTCNN model.
Choosing the proper representation of the ownership relationship can improve the storage and querying efficiency. In this paper, we propose an ownership relationship memory matrix C to represent the ownership relationship, as shown in (7).
{{C}} = \left[ {\begin{array}{*{20}{c}} {{c_{00}}}&{{c_{01}}}& \cdots &{{c_{0m}}}\\ {{c_{10}}}&{{c_{11}}}& \cdots &{{c_{1m}}}\\ \vdots & \vdots &{{c_{ij}}}& \vdots \\ {{c_{n0}}}&{{c_{n1}}}& \cdots &{{c_{nm}}} \end{array}} \right]. | (7) |
In order to facilitate robots to represent the ownership relationship in different learning cycles, the ownership relationship memory matrix is divided into two specific forms: the short-term ownership matrix and the long-term ownership matrix. Therefore, Cij has different meanings. The former indicates the frequency of ownership of the service object i and the carrying item j, while the latter indicates the instance number corresponding to category j of the carrying item belonging to the service object i. Figs. 9 and 10 present the short-term and long-term memory matrix storage structures and corresponding specific instances.
The short-term memory matrix consists of six elements, namely the user's identification (uid) and five item instances. The unsigned integer data is used as the field data type. In Fig. 10, there are three service objects and three cell phones and two caps in the short-term memory matrix. Each column represents the frequency of ownership of an instance to each service object, such as the ownership frequency of phone 1 to three service objects is 0, 0 and 17. The largest value is taken as the belonging service object of phone 1. It can be seen that the belonging service object of phone 1 is service object 2 in the current learning cycle.
After multiple learning cycles, a long-term memory matrix is finally obtained for the ownership relationship of carrying items. The storage structure is shown in Fig. 10. Fig. 10 is a specific instance example of a long-term memory matrix. Three service objects may have mobile phones, cups, hats, shoes, the corresponding instances for No. 0 service objects are No. 1 mobile phone, No. 0 water-glass and No. 0 shoes, respectively. Because No. 0 service object does not have a hat, the instance number corresponding to the hat is –1, etc. By using a simple query statement, the instance number of a certain type of an item owned by a specific service object can be obtained, thereby realizing the query of the ownership relationship of humans carrying items.
During the learning process, service robots may have deviations or errors in the learned short-term ownership memory matrix due to environmental disturbances or human subjective factors. Therefore, corresponding mechanisms need to be adopted to exclude such situations. The vector representation of the short-term memory matrix is shown in (8).
{{U}} = {\left[ {{u_1},{u_2}, \cdots, {u_i}, \cdots ,{u_n}} \right]^{\rm{T}}}. | (8) |
In (8), ui represents the user number corresponding to instance i. The ownership vector is calculated from the short-term memory matrix. When each learning cycle is over, the corresponding short-term memory matrix is obtained. The belonging object of each item instance is obtained by statistical calculation, and then the ownership vector of the service period is obtained. Taking Fig. 10 as an example, the ownership vector is calculated as [2 0 1 2 0], which corresponds to phones 1–3 and hats 1–2, respectively.
The similarity between different ownership vectors is determined according to the angle between different ownership vectors in the vector space. A set of ownership vectors with the highest degree of similarity is returned to the robot as a higher-reliance ownership vector, thereby obtaining a short-term memory matrix with higher reliability. Using the angle cosine of the vector to represent the similarity between different ownership vectors, the similarity measure of the vectors vi and vj is defined as shown in (9).
{\rm{sim}}\left\langle {{v_i},{v_j}} \right\rangle = \cos\theta = \frac{{\mathop \sum \limits_{k = 1}^n v_i^k \times v_j^k}}{{\sqrt {\mathop \sum \limits_{k = 1}^n {{\left( {v_i^k} \right)}^2}} \times \sqrt {\mathop \sum \limits_{k = 1}^n {{\left( {v_j^k} \right)}^2}} }}. | (9) |
In addition, the set of short-term ownership matrices is obtained after learning through multiple learning cycles, which is defined as T. O is the final short-term ownership matrix obtained through the selection algorithm. The selection algorithm is shown in Fig. 11.
The advantage of the similarity representation based on the vector space is that the ownership relationship can be transformed into a vector form, which is easy to express and calculate, so that the ownership relationship reliability problem can be transformed into a vector operation problem in vector space. The weight of the ownership vector can be calculated by a simple statistical calculation of the short-term memory matrix, it is a quantitative numerical calculation problem.
The robot acquires the instance information and the user identity information through the two modules of human carrying items detection, identification and the service object identification. The short-term memory matrix in multiple learning cycles is obtained by correlating and updating the relationship between these two models and a representative short-term memory matrix is obtained according to the ownership relationship selection algorithm. Based on this, the long-term memory matrix is calculated, so as to complete the entire learning process of the relationship between human and their carrying items.
At the beginning of each learning cycle, the robot uses the video capture device to collect the image of home environment in real time and sends the acquired image to the global object positioning and recognition module and the face recognition module respectively to obtain an instance of human carrying items and the identity information of the service object, which is stored in the ownership relation database in the form of short-term memory matrix. The above process is repeated until the learning cycle is completed. Then, the short-term memory matrix is persisted and stored in the ownership database. If all the learning cycles have been completed, the obtained short-term memory matrix is selected by using the ownership relationship selection algorithm and the ownership relationship is statistically calculated according to the result. Then, a long-term memory matrix with high reliability is obtained and stored in order to implement the affiliation query function.
This task simulates the family environment in the laboratory and guides the robot to learn the ownership relationship autonomously. In the experiment, a robotic front-end and back-end combined experimental platform was used. The robot's front-end is a TurTleBot mobile robot platform, as shown in Fig. 12. As a video capture platform, the front-end mainly collects real-time environment images. The TurTleBot robot consists of a kobuki mobile base, Microsoft Kinect1, RPLidar lidar and a computer notebook. The RPLidar Lidar is responsible for the 3D environment scanning to realize obstacle avoidance; the kobuki mobile base is responsible for moving the robot platform according to the instructions. The RGB image with the size of 640 pixels × 480 pixels is collected by Microsoft Kinect1 in the family environment and transmitted to the background server through the network for the subsequent detection and identification of the human carrying items, identification of the service object, thereby implementing the learning of ownership relationship. The back-end server configuration is as follows: Ubuntu 16.04 operating system, i7-8700k processor, 32G memory and a NVIDIA GTX 1080Ti GPU. All tasks except the image acquisition in the entire ownership relationship learning process are completed on the backend server. Data transmission between the front-end and the back-end is performed through the network.
In the family environment, there may be only one person or multiple people at the same time. Therefore, experiments are carried out on the detection and localization of items in two cases, as shown in Fig. 13. From left to right, Fig. 13 shows the overall of the object detection results, human pose estimation results and object localization results. It can be seen that the object detection and localization method proposed in this paper can achieve a good effect.
Insufficient training data has always been one of the important factors affecting the performance of the model. Therefore, the expansion of the original dataset by means of data augmentation can alleviate this problem. The constructed dataset is amplified by the methods of data augmentation including rotation, scaling, flipping and brightness changing. Fig. 14 shows an example of a data augmentation sample for a partial dataset.
After object detection and localization, the carried items were identified using the object instance identification model presented in this paper. Fig. 15 shows the results of some experiments.
As can be seen from Fig. 15, the model can not only identify the instance attribute of the object correctly but also has a confidence level of 0.98 or more for the identified object. In addition, this task identifies approximately 1 000 instances and obtains the recognition accuracy of the model, as shown in Fig. 16. It can be seen from the above that the item's instance identification model has an accuracy of more than 0.91 for each type of instance and has a very high degree of credibility. The model shows good feasibility and accuracy.
Our face detection datasets contain the cleaned WIDER FACE dataset and the face dataset constructed in the simulated family environment. The CelebA dataset containing the face keypoint location information is used to train the sub-network O-Net. The face detection dataset is divided into positive faces, negative faces, and partial faces according to the intersection over union (IoU) of the randomly tailored face candidate and the ground truth boundary. The corresponding IoU are 0.65, 0.3, 0.4–0.65, respectively. The samples of face-landmarks are generated directly using the CelebA dataset. These four parts correspond to different learning tasks. Positive face samples and negative samples are used for face classification. Positive face samples and partial face samples are used for face bounding box regression. In the same training batch, all kinds of samples are allocated according to the ratio of 1:3:1:2. The training of each model needs to be based on the training of the previous network model so data processing and model training are alternated throughout the training process.
After detecting by the face detection, sub-network P-Net, a large number of face candidate bounding boxes are generated along with a large number of non-face candidate bounding boxes, which will affect the speed of the subsequent face detection. The non-maximum suppression (NMS) algorithm is used to further reduce the number of face candidate frames. Fig. 17 shows the results of the candidate box generated before and after the algorithm is used. By using the NMS algorithm, the number of candidate bounding boxes is reduced from 234 to 179.
Before using the NMS algorithm, the average number of candidate bounding boxes generated by all pictures is about 148, and the number is reduced to 117 after using the algorithm. Fig. 18 shows a comparison of the number of candidate bounding boxes before and after using the NMS algorithm for 200 pictures.
In an unrestricted scene, the face will be in different poses and scales, which has a great impact on the accuracy of the face recognition. Face normalization can normalize the pose and scale of faces in different scales and poses. Normalizing them to more standard face images can help extract face image features, thus improving the accuracy of face recognition. In this paper, the affine transformation method is used to normalize the face image and the scale normalization is performed by smooth interpolation to adjust the image.
For about 80 pictures with large posture transformation in the test picture, the face is corrected by using affine transformation and 20 groups of images before and after correction are selected to obtain the picture histogram 20. It can be clearly observed from Fig. 19 that the confidence of the face image after the correction is greatly improved.
Face recognition experiments are performed on about 200 test images of 4 users captured under different conditions and the confidence threshold is set to 0.9. Fig. 20 shows the resulting confusion matrix. Each user's recognition accuracy rate reaches a high level, and the face recognition model shows a good recognition effect.
There are four users in the simulated family environment, numbered as 0, 1, 2 and 3, and taking five categories of mobile phones, cups, hats, shoes, and watches as examples to calculate the ownership matrix and construct ownership relationship, giving a total of 12 item instances. Fig. 21 shows partially acquired instance example images. Taking 30 minutes as a learning cycle, the robot gets 5 short-term ownership memory matrices through 5 learning cycles, as shown in Fig. 22.
According to (8), the vector representation of the short-term memory matrix is shown as (10). Each row represents the frequency of ownership.
{{{U}}_{0 \le {{i}} \le 5,0 \le {{j}} \le 12}}=\left\{\begin{aligned} & [0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;1\;\;2\;\;0]\\ & [0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;1\;\;2\;\;0]\\ & [0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;2\;\;2\;\;0]\\ & [0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;1\;\;2\;\;0]\\ & [2\;\;1\;\;2\;\;0\;\;1\;\;2\;\;0\;\;1\;\;2\;\;2\;\;1\;\;0]. \end{aligned}\right. | (10) |
The short-term ownership memory matrix is selected by using the selection algorithm proposed in this paper, the 5th ownership relationship is excluded. The final long-term ownership memory matrix is obtained by statistics on these four short-term ownership matrices, as shown in Fig. 23.
The ownership relationship memory matrix is introduced to solve the representation problem of the ownership relationship and introduce different implementation manners according to different stages of the ownership relationship learning. The learning process of ownership relationship is continuous. At the same time, considering the possible interference in the process of autonomous learning, a selection algorithm based on vector space similarity is proposed. Partial interference information is removed from the ownership relationship obtained in the autonomous learning process and effective learning of the ownership relationship is realized.
This paper proposes a robotic autonomous cognitive system for ownership relationship of a human carrying items in response to the demand for personalized robotic services under the family environment, which provides a solution for the service robot to learn the ownership relationship between the human and the items they are carrying, independently. Firstly, the object detection model is combined with the human pose estimation model to locate the human carrying items. Then, a convolutional neural network is constructed based on transfer learning to extract object features to complete instance recognition of human carrying items. At the same time, it identifies the user's identity. Lastly, the long-term and short-term memory matrix is used to realize the self-learning of the ownership relationship of the human carrying items and the database query tool is used to complete the inquiry of the ownership relationship. The experimental results show that the framework proposed in this paper can guide the service robots to complete the self-learning of the ownership relationship with high efficiency. In future research, for the task description of the service objects, we will make a more reasonable plan of the personalized tasks performed by the service robots in the dynamic environment and further improve the standards of the robotic personalized service.
This work was supported by the Joint Funds of National Natural Science Foundation of China (Nos. U1813215 and 2018YFB1307101), National Natural Science Foundation of China (Nos. 61603213, 61773239, 61973187, 61973192 and 91748115), Shandong Provincial Natural Science Foundation, China (No. ZR2017MF014), Jinan Technology project (No. 20150219) and Taishan Scholars Programme of Shandong Province.
H. Wu, X. J. Wu, Q. Ma, G. H. Tian. Cloud robot: Semantic map building for intelligent service task. Applied Intelligence, vol. 49, no. 2, pp. 319–334, 2019. DOI: 10.1007/s10489-018-1277-0.
|
W. He, Z. J. Li, C. L. P. Chen. A survey of Human-centered Intelligent robots: Issues and challenges. IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 4, pp. 602–609, 2017. DOI: 10.1109/JAS.2017.7510604.
|
Y. Yang, F. Qiu, H. Li, L. Zhang, M. L. Wang, M. Y. Fu. Large-scale 3D semantic mapping using stereo vision. International Journal of Automation and Computing, vol. 15, no. 2, pp. 194–206, 2018. DOI: 10.1007/s11633-018-1118-y.
|
E. Daǧlarlı, S. F. Dağlarlı, G. Ö. Günel, H. Köse. Improving human-robot interaction based on joint attention. Applied Intelligence, vol. 47, no. 1, pp. 62–82, 2017. DOI: 10.1007/s10489-016-0876-x.
|
T. M. Wang, Y. Tao, H. Liu. Current researches and future development trend of intelligent robot: A review. International Journal of Automation and Computing, vol. 15, no. 5, pp. 525–546, 2018. DOI: 10.1007/s11633-018-1115-1.
|
J. Rivera-Rubio, S. Idrees, I. Alexiou, L. Hadjilucas, A. A. Bharath. Small hand-held object recognition test (SHORT). In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Steamboat Springs, USA, pp. 524–531, 2014. DOI: 10.1109/WACV.2014.6836057.
|
J. Rivera-Rubio, S. Idrees, I. Alexiou, L. Hadjilucas, A. A. Bharath. A dataset for hand-held object recognition. In Proceedings of IEEE International Conference on Image Processing, IEEE, Paris, France, pp. 5881–5885, 2014. DOI: 10.1109/ICIP.2014.7026188.
|
J. W. Hsieh, J. C. Cheng, L. C. Chen, C. H. Chuang, D. Y. Chen. Handheld object detection and its related event analysis using ratio histogram and mixture of HMMs. Journal of Visual Communication and Image Representation, vol. 25, no. 6, pp. 1399–1415, 2014. DOI: 10.1016/j.jvcir.2014.05.009.
|
X. Lv, S. Q. Jiang, L. Herranz, S. Wang. RGB-D hand-held object recognition based on heterogeneous feature fusion. Journal of Computer Science and Technology, vol. 30, no. 2, pp. 340–352, 2015. DOI: 10.1007/s11390-015-1527-0.
|
X. Li, S. Q. Jiang, X. Lv, C. P. Chen. Learning to recognize hand-held objects from scratch. In Proceedings of the 17th Pacific-Rim Conference on Multimedia, Springer, Xi′an, China, pp. 527–539, 2016. DOI: 10.1007/978-3-319-48896-7_52.
|
K. Yamaguchi, M. Hadi Kiapour, L. E. Ortiz, T. L. Berg. Parsing clothing in fashion photographs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, USA, pp. 3570–3577, 2012. DOI: 10.1109/CVPR.2012.6248101.
|
X. D. Liang, C. Y. Xu, X. H. Shen, J. C. Yang, S. Liu, J. H. Tang, L. Lin, S. C. Yan. Human parsing with contextualized convolutional neural network. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 1, pp. 115–127, 2017. DOI: 10.1109/TPAMI.2016.2537339.
|
K. Gong, X. D. Liang, D. Y. Zhang, X. H. Shen, L. Lin. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6757–6765, 2017. DOI: 10.1109/CVPR.2017.715.
|
X. J. Chen, R. Mottaghi, X. B. Liu, S. Fidler, R. Urtasun, A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1979–1986, 2014. DOI: 10.1109/CVPR.2014.254.
|
J. S. Li, J. Zhao, Y. C. Wei, C. Y. Lang, Y. D. Li, T. Sim, S. C. Yan, J. S. Feng. Multiple-human parsing in the wild, Available: https://arxiv.org/abs/1705.07206, March 2018.
|
Z. Cao, T. Simon, S. E. Wei, Y. Sheikh. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 1302–1310, 2017. DOI: 10.1109/CVPR.2017.143.
|
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg. SSD: Single shot MultiBox detector. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 21–37, 2016. DOI: 10.1007/978-3-319-46448-0_2.
|
P. Felzenszwalb, D. McAllester, D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Anchorage, USA, 2008. DOI: 10.1109/CVPR.2008.4587597.
|
R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 580–587, 2014. DOI: 10.1109/CVPR.2014.81.
|
R. Girshick. Fast R-CNN. In Proceedings of 2015 IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1440–1448, 2015. DOI: 10.1109/ICCV.2015.169.
|
S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 91–99, 2015.
|
J. Redmon, S. Divvala, R. Girshick, A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 779–788, 2016. DOI: 10.1109/CVPR.2016.91.
|
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition, Available: https://arxiv.org/abs/1409.1556, April 2015.
|
T. Y. Lin, P. Goyal, R. Girshick, K. M. He, P. Dollar. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, published online. DOI: 10.1109/TPAMI.2018.2858826.
|
L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, B. Schiele. Articulated people detection and pose estimation: Reshaping the future. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, USA, pp. 3178–3185, 2012. DOI: 10.1109/CVPR.2012.6248052.
|
G. Gkioxari, B. Hariharan, R. Girshick, J. Malik. Using k-poselets for detecting people and localizing their keypoints. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 3582–3589, 2014. DOI: 10.1109/CVPR.2014.458.
|
L. Pishchulin, E. Insafutdinov, S. Y. Tang, B. Andres, M. Andriluka, P. Gehler, B. Schiele. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 4929–4937, 2016. DOI: 10.1109/CVPR.2016.533.
|
E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 34–50, 2016. DOI: 10.1007/978-3-319-46466-4_3.
|
S. Yang, P. Luo, C. C. Loy, X. O. Tang. WIDER FACE: A face detection benchmark. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 5525–5533, 2016. DOI: 10.1109/CVPR.2016.596.
|
Z. Liu, P. Luo, X. G. Wang, X. O. Tang. Deep learning face attributes in the wild. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 3730–3738, 2015. DOI: 10.1109/ICCV.2015.425.
|
[1] | Huan-Zhao Chen, Guo-Hui Tian, Guo-Liang Liu. A Selective Attention Guided Initiative Semantic Cognition Algorithm for Service Robot[J]. Machine Intelligence Research, 2018, 15(5): 559-569. DOI: 10.1007/s11633-018-1139-6 |
[2] | Wen-Juan Fan, Shan-Lin Yang, Harry Perros, Jun Pei. A Multi-dimensional Trust-aware Cloud Service Selection Mechanism Based on Evidential Reasoning Approach[J]. Machine Intelligence Research, 2015, 12(2): 208-219. DOI: 10.1007/s11633-014-0840-3 |
[3] | Hai-Peng Zhang, Bao-Qun Yin, Xiao-Nong Lu. Modeling and Analysis for Streaming Service Systems[J]. Machine Intelligence Research, 2014, 11(4): 449-458. DOI: 10.1007/s11633-014-0812-7 |
[4] | Bing Li, Bu-Qing Cao, Kun-Mei Wen, Rui-Xuan Li. Trustworthy Assurance of Service Interoperation in Cloud Environment[J]. Machine Intelligence Research, 2011, 8(3): 297-308. DOI: 10.1007/s11633-011-0585-1 |
[5] | Zhuo Zhang, Dong-Dai Zhou, Hong-Ji Yang, Shao-Chun Zhong. A Service Composition Approach Based on Sequence Mining for Migrating E-learning Legacy System to SOA[J]. Machine Intelligence Research, 2010, 7(4): 584-595. DOI: 10.1007/s11633-010-0544-2 |
[6] | Hui-Fang Deng, Wen Deng, Han Li, Hong-Ji Yang. Authentication and Access Control in RFID Based Logistics-customs Clearance Service Platform[J]. Machine Intelligence Research, 2010, 7(2): 180-189. DOI: 10.1007/s11633-010-0180-x |
[7] | Efosa Emmanuel Uyiomendo, Tore Markeset. Subsea Maintenance Service Delivery:Mapping Factors Influencing Scheduled Service Duration[J]. Machine Intelligence Research, 2010, 7(2): 167-172. DOI: 10.1007/s11633-010-0167-7 |
[8] | Jing Sun, Ying-Jie Xing. An Effective Image Retrieval Mechanism Using Family-based Spatial Consistency Filtration with Object Region[J]. Machine Intelligence Research, 2010, 7(1): 23-30. DOI: 10.1007/s11633-010-0023-9 |
[9] | Jian-Zhi Li, Zhuo-Peng Zhang, Bing Qiao, Hong-Ji Yang. A Component Mining Approach to Incubate Grid Services in Object-Oriented Legacy Systems[J]. Machine Intelligence Research, 2006, 3(1): 47-55. DOI: 10.1007/s11633-006-0047-3 |
[10] | Shung-Bin Yan, Feng-Jian Wang. CA-PLAN, a Service-Oriented Workflow[J]. Machine Intelligence Research, 2005, 2(2): 195-207. DOI: 10.1007/s11633-005-0195-x |
1. | Juan Jesus Ojeda-Castelo, Maria de Las Mercedes Capobianco-Uriarte, Jose Antonio Piedra-Fernandez, et al. A Survey on Intelligent Gesture Recognition Techniques. IEEE Access, 2022, 10: 87135. DOI:10.1109/ACCESS.2022.3199358 |
2. | Youjiang Gao, Hongfei Liu. Artificial intelligence-enabled personalization in interactive marketing: a customer journey perspective. Journal of Research in Interactive Marketing, 2022. DOI:10.1108/JRIM-01-2022-0023 |
3. | Yue-Yan Qin, Jiang-Tao Cao, Xiao-Fei Ji. Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3. International Journal of Automation and Computing, 2021, 18(2): 300. DOI:10.1007/s11633-020-1269-5 |
4. | Zhen-Wei He, Lei Zhang, Fang-Yi Liu. DiscoStyle: Multi-level Logistic Ranking for Personalized Image Style Preference Inference. International Journal of Automation and Computing, 2020, 17(5): 637. DOI:10.1007/s11633-020-1244-1 |