Skill Learning for Robotic Insertion Based on One-shot Demonstration and Reinforcement Learning

Ying Li De Xu

Citation: Y. Li, D. Xu. Skill learning for robotic insertion based on one-shot demonstration and reinforcement learning. International Journal of Automation and Computing. http://doi.org/10.1007/s11633-021-1290-3 doi:  10.1007/s11633-021-1290-3
Citation: Citation: Y. Li, D. Xu. Skill learning for robotic insertion based on one-shot demonstration and reinforcement learning. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1290-3 doi:  10.1007/s11633-021-1290-3

doi: 10.1007/s11633-021-1290-3

Skill Learning for Robotic Insertion Based on One-shot Demonstration and Reinforcement Learning

More Information
    Author Bio:

    Ying Li received the B. Sc. degree in control science and engineering from North China Electric Power University (Baoding), China in 2016. He is a Ph. D. degree candidate at Institute of Automation, Chinese Academy of Sciences (IACAS), China.His research interests include visual measurement, visual control, micro-assembly and machine learning. E-mail: liying2016@ia.ac.cn. ORCID iD: 0000-0002-0213-9247

    De Xu He received his B. Sc. degree in control science and engineering and M. Sc. degree in control science and engineering from Shandong University of Technology, China in 1985 and 1990, respectively, and received the Ph. D. degree in control science and engineering from Zhejiang University, China in 2001. He is a professor at the Institute of Automation, Chinese Academy of Sciences (IACAS), China.His research interests include visual measurement, visual control, intelligent control, visual positioning, microscopic vision, and micro-assembly. E-mail: de.xu@ia.ac.cn. (Corresponding author) ORCID iD: 0000-0002-7221-1654

图(9) / 表(3)
计量
  • 文章访问数:  13
  • HTML全文浏览量:  20
  • PDF下载量:  15
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-10-09
  • 录用日期:  2021-03-02
  • 网络出版日期:  2021-03-24

Skill Learning for Robotic Insertion Based on One-shot Demonstration and Reinforcement Learning

doi: 10.1007/s11633-021-1290-3
    作者简介:

    Ying Li received the B. Sc. degree in control science and engineering from North China Electric Power University (Baoding), China in 2016. He is a Ph. D. degree candidate at Institute of Automation, Chinese Academy of Sciences (IACAS), China.His research interests include visual measurement, visual control, micro-assembly and machine learning. E-mail: liying2016@ia.ac.cn. ORCID iD: 0000-0002-0213-9247

    De Xu He received his B. Sc. degree in control science and engineering and M. Sc. degree in control science and engineering from Shandong University of Technology, China in 1985 and 1990, respectively, and received the Ph. D. degree in control science and engineering from Zhejiang University, China in 2001. He is a professor at the Institute of Automation, Chinese Academy of Sciences (IACAS), China.His research interests include visual measurement, visual control, intelligent control, visual positioning, microscopic vision, and micro-assembly. E-mail: de.xu@ia.ac.cn. (Corresponding author) ORCID iD: 0000-0002-7221-1654

English Abstract

Citation: Y. Li, D. Xu. Skill learning for robotic insertion based on one-shot demonstration and reinforcement learning. International Journal of Automation and Computing. http://doi.org/10.1007/s11633-021-1290-3 doi:  10.1007/s11633-021-1290-3
Citation: Citation: Y. Li, D. Xu. Skill learning for robotic insertion based on one-shot demonstration and reinforcement learning. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1290-3 doi:  10.1007/s11633-021-1290-3
    • Recently, precision assembly and manipulation have attracted much attention, and been widely used in micro-electromechanical systems (MEMS) and industrial applications[1-4]. The contact force between components, provided with the force sensor, should be kept within a limited range to guarantee the safety.

      Peg in hole assembly is a common assembly task and automatic assembly methods have attracted much attention. Generally, the assembly control methods can be classified into model-based methods and model-free methods. The former have been widely used in assembly tasks. In order to accomplish dual peg in hole assembly, Zhang et al.[5] analyzed the contact states and established the relationship between contact state and contact force. The jamming state was analyzed quantitatively and the corresponding control strategies were developed. In order to assemble three objects together, Liu et al.[6] modelled the contact states between each two components as a probability distribution. The three objects were adjusted simultaneously. Chen et al.[7] developed an error recovery method for wiring harness assembly. The dynamic model of mating connectors on printed circuit board (PCB) is established and the smooth insertion is achieved with moment control. The components in the above methods are rigid. If the components are deformable, the insertion tasks will become more difficult and the contact states will be more complex. Xing et al.[8] presented an efficient assembly method for multiple components connected parallel by spring. An optimization method was developed based on the spring model. Efficiency is an important factor and a passive alignment principle-based method was employed to accomplish assembly tasks with deformable components[9].

      However, the above methods are mainly based on mathematical description of contact states which may contain errors. The real contact states are far more complex and the precise model can hardly be obtained. Therefore, model-free precision insertion methods are highly needed.

      Recently, reinforcement learning (RL), combined with deep learning, has shown its great potential in the field of artificial intelligence[10] and continuous control has been realized in simulated physics tasks[11]. RL becomes a promising approach for robotic precision assembly[12,13]. Inoue et al.[14] proposed a Q-learning-based method for assembly with robot arm, and long short term memory (LSTM) layers were used to approximate the Q-function. Li et al.[15] presented a robot acquisition method for assembly process. Unlike others, the reward function employed a two-classification support vector machine (SVM) model to determine whether the assembly is successful. However, the actions in the above methods are discrete and continuous actions are more suitable for the assembly process. Fan et al.[16] presented a learning framework for high precision industrial assembly, which combined the supervised learning and deep deterministic policy gradient (DDPG) based RL. Trajectory optimization served as a semi supervisor to provide initial guidance for the actor-critic. In order to improve training efficiency, Vecerik et al.[17] developed a DDPG-based insertion method, which introduced the human demonstrations into the learning process. Guided by the behavior cloning loss, the actor network can imitate the actions from demonstrations. As for deformable objects, Luo et al.[18] proposed a mirror descent guided policy search (MDGPS) method to insert a rigid peg into a deformable hole. Moreover, available priori knowledge can further improve the performance of RL. For example, Thomas et al.[19] developed a computer aided design (CAD) based RL method for robotic assembly. CAD data can be used to guide the RL by a geometric motion plan.

      The above RL-based methods can be used in real robotic assembly tasks. However, there are still some problems should to be solved. Firstly, expert demonstrations can improve the training efficiency[17], but it is tedious and not safe to collect much demonstration data on real robotic systems. It is valuable to use as little demonstration data as possible to guide the training process of RL. Secondly, in order to obtain the optimal action policy, abundant exploration is needed. An efficient exploration strategy can accelerate the training process and random exploration in action space is not enough[14]. Therefore, an efficient exploration strategy for robotic assembly is highly needed. Thirdly, the model is usually trained for specific components[19] and should be retrained when meeting new components in real assembly tasks. The adaptability is an important factor and should be improved to meet the requirements of the real assembly tasks. Furthermore, training RL model on real robotic system is time-consuming and low-efficiency. There is a gap between simulation and real robotic systems. In other words, the models trained in simulation environments cannot be directly used on real robotic systems[20]. The training efficiency can be improved if the gap is bridged.

      In this paper, a DDPG-based insertion skill learning framework is proposed for robotic assembly. The main contributions of this work are as follows. 1) The final executed action consists of an expert action learned from one demonstration and a refinement action learned from RL, to improve the insertion efficiency. 2) An episode-step based exploration strategy is proposed to explore state space more efficiently, which views the expert action as a benchmark and adjusts the exploration intensity dynamically. 3) A skill saving and selection mechanism is proposed to improve the adaptability of our method. Trained models for several typical components are saved in the skill pool and the most appropriate model will be selected for insertion tasks for a new component. 4) A simulation environment is established with the help of force Jacobian matrix, which avoids tedious training process on real robotic system.

      The rest of this paper is organized as follows. Section 2 introduces the system configuration and problem formation. The insertion skill learning framework is detailed in Section 3. Sections 4 and 5 present the simulation and experiment results, respectively. Finally, this paper is concluded in Section 6.

    • The automated precision assembly system is designed as shown in Fig. 1. It consists of a 4 degree of freedom (DOF) adjustment platform, a 3-DOF manipulator, three microscopic cameras, lighting system and a host computer. The optical axes of three microscopic cameras are approximately orthogonal to each other and the three microscopic cameras can move along their moving platform to adjust the distance between objective lens and objects for capturing clear images. The microscopic camera 1−3 provide mid-view, side-view and up-view of objects.

      Figure 1.  System configuration and task description

      The world coordinate {W} is established on the base of manipulator. The manipulator can move along Xw, Yw and Zw axes. The platform coordinate {P} is established on the adjustment platform. The 4-DOF adjusting platform consists of three rotation DOFs around Xp, Yp and Zp axes, respectively, and a translation DOF along Zp axis. The camera coordinates {C1}, {C2} and {C3} are established on the three cameras, respectively. The force coordinate {F} is established on the force sensor.

    • For precision assembly, the goal is to learn the insertion policy through interacting with environment. The insertion process can be modeled as a Markov decision process (MDP). At each time step t, the agent observes a state st $\in $ S, executes an appropriate action a $\in $ A, and receives a reward R. Then the state is transferred to st+1 $\in $ S. The desired insertion policy μ, mapping from states to actions, is obtained by maximizing the sum of expected discounted rewards.

      In the insertion task, the state s is defined as

      $$ {s_t} = {\left[ {{f_x},{f_y},{f_z},{p_z}} \right]^{\rm{T}}} $$ (1)

      where fx, fy, and fz are the contact forces along Xf, Yf, and Zf axes, respectively; pz is the insertion depth along Zw axis. The action is defined as

      $$ {a_t} = {\left[ {{d_x},{d_y},{d_z}} \right]^{\rm{T}}} $$ (2)

      where dx and dy are the compliant adjustments along Xw and Yw axes, respectively; and dz is the insertion step along Zw axis.

    • The insertion skill learning framework is based on DDPG, whose framework is shown in Fig. 2. The training process is divided into two stages: expert action learning and self-learning. The final action is composed of two parts: an expert action and a refinement action, which are obtained in the two stages, respectively. In the expert action learning stage, only one expert demonstration is collected and the expert action is learned. In the self-learning stage, the actor and critic networks are further trained with RL through interacting with environment. Besides, efficient exploration strategy is developed to accelerate the training process. After training, the agent obtains the insertion skill and accomplishes the insertion tasks. A skill saving and selection mechanism is designed to improve the adaptability of insertion tasks with different components.

      Figure 2.  Framework of insertion skill learning

    • In RL training process, random actions might be harmful to the safety of the robotic system. It is beneficial to learn a stable and safe insertion skill from expert demonstrations. Therefore, a novel framework is proposed to leverage demonstrations and acquire better efficiency.

      1) Learning expert action from one-shot demonstration.

      A common method to accelerate the RL training process is to pretrain the networks with demonstrations from experts. A large number of demonstrations are usually needed in the pretrain state to obtain adequate performance. However, data collection on real robotic systems is tedious and time-consuming. Therefore, a novel method is proposed to learn a stable and safe expert action from only one demonstration which records the states and the corresponding actions.

      In the insertion process, the relationship between contact force and relative translation movements can be modeled with a force Jacobian matrix JF $\in $ R2×2:

      $$\left[ \begin{array}{l} {d_x} \\ {d_y} \end{array} \right] = {J_F}\left[ \begin{array}{l} {f_x} \\ {f_y} \end{array} \right].$$ (3)

      JF can be calibrated with the least square method according to the demonstration. The expert action is represented as

      $$ a_t^e = [{d_{ex}},{d_{ey}},{d_{ez}}]^{\rm{T}} $$ (4)

      where dex and dey are the adjustments along Xw and Yw axes, respectively; and dez is the insertion step along Zw axis. dex and dey can be calculated by (5).

      $$ \left[ \begin{array}{l} {d_{ex}}\\ {d_{ey}} \end{array} \right] = - \alpha {J_F}\left[ \begin{array}{l} {f_x}\\ {f_y} \end{array} \right] $$ (5)

      where α$\in $[0, 1] is a constant. In practice, dez can be set as a small constant value for convenience.

      Then an expert action ate is gotten with only one demonstration. The demonstration data is mainly used to obtain the properties of components by calibrating JF. It is not important whether the demonstration is optimal or not. Therefore, our method is more efficient and convenient than the traditional pretrain-based methods.

      2) Neural networks based refinement action

      The proposed framework contains two main networks: an actor network and a critic network. The actor network takes a state as input and outputs the refinement action μ(st | θμ) with parameter θμ. There are five fully connected layers in the actor network. The ReLU activation function is used in the first four layers and tanh activation function is used in the last output layer, whose output is the refinement action atr. The final output action at is combined by the expert action ate and refinement action atr,

      $$ {a_t} = {a_t}^e + {a_t}^r. $$ (6)

      The actions along Xw and Yw axes of final action at are normalized within [−1, 1] and action along Zw axis of at are normalized within [0, 1]. The expert action ate is explainable, safe but not optimal. The refinement action works to improve the insertion efficiency. Then the final action at can meet the requirement of safety and high efficiency.

      The critic network takes the state and the refinement action atr as input and outputs the action value Q(st, atr | θQ) with parameter θQ. Two fully connected layers are employed to fuse the state and the action. And two other fully connected layers are used to approximates action value.

      A target actor network μ′(st | θμ) with parameter θμ and a target critic network Q′(st, atr | θQ) with parameter θQ are employed to calculate the target values. Their structures are the same as the actor and critic networks, respectively.

      3) Preliminaries

      During training, the agent samples a minibatch of N state transitions from the replay buffer M to update the parameters of actor and critic networks.

      The critic network is trained by minimizing the loss L with parameter θQ:

      $$L\left( {{\theta ^Q}} \right) = \frac{{\rm{1}}}{N}\sum\limits_{i = 1}^N {{{\left( {{y_i} - Q\left( {{s_i},{a_i}|{\theta ^Q}} \right)} \right)}^2}} $$ (7)

      where yi is computed by

      $${y_i} = {R_{Mi}} + \gamma {Q^\prime }\left( {{s_{i + 1}},{\mu ^\prime }\left( {{s_{i + 1}}} \right)|{\theta ^Q}} \right).$$ (8)

      RMi is the reward which is detailed in Section 3.2; γ is a discount factor.

      The action network is trained by maximizing J(θμ) with parameter θμ:

      $$J\left( {{\theta ^\mu }} \right){\rm{ = }}E\left[ {Q\left( {{s_t},\mu \left( {{s_t}|{\theta ^\mu }} \right)} \right)} \right].$$ (9)

      The parameters of the actor network are updated by computing the policy gradient with the chain rule:

      $${\nabla _{{\theta ^\mu }}}J\left( {{\theta ^\mu }} \right){\rm{ = }}\frac{{\rm{1}}}{N}\sum\limits_{i = 1}^N {\left[ \begin{split} &{\nabla _a}Q\left( {s,a|{\theta ^Q}} \right){|_{s = {s_i},a = \mu \left( {{s_i}} \right)}} \times \\ &{\nabla _{{\theta ^\mu }}}\mu \left( {s|{\theta ^\mu }} \right){|_{s = {s_i}}} \end{split} \right]}. $$ (10)

      The parameters of target networks are updated by slowly tracking the learned networks:

      $$\left\{ \begin{aligned} &{\theta ^{Q'}} = \tau {\theta ^Q} + \left( {1 - \tau } \right){\theta ^{Q'}} \\ &{\theta ^{\mu '}} = \tau {\theta ^\mu } + \left( {1 - \tau } \right){\theta ^{\mu '}} \end{aligned} \right.$$ (11)

      where τ is a factor between 0 and 1.

      4) Self-learning stage

      During self-learning, the state transitions are collected by interacting with the environment and stored in the memory replay buffer M. The pseudo code of the self-learning stage is given Algorithm 1.

      The actor network is trained with ${\nabla _{{\theta ^\mu }}}J\left( {{\theta ^\mu }} \right)$ and the critic network is trained with loss L(θQ) as given in (9) and (7) with batch-size N. The dynamic exploration strategy used in the training process is given in Section. 3.3.

    • For precision assembly, safety is important and the contact force should be kept within a safe range. Besides, the efficiency is another key factor and it is expected to finish the process with as few insertion steps as possible. Therefore, the designed reward function consists of two parts: the safety reward R1t and the efficiency reward R2t as given in (12).

      $$\left\{ \begin{aligned} &{R_{1t}} = 1 - \frac{{{f_{rt}}}}{{{f_T}}} \\ &{R_{2t}} = - \left| {{d_{zt}} - {R_{1\left( {t - 1} \right)}}{D_T}} \right|{\rm{/}}{D_T} \end{aligned} \right.$$ (12)

      where fT is the maximum allowed radial contact force; DT is the maximum allowed insertion depth; frt is the radial contact force after executing the t-th action.

      $${f_{rt}}{\rm{ = }}\sqrt {f_{xt}^2 + f_{yt}^2} .$$ (13)

      Then the reward function RMt is calculated by

      $$ {R_{Mt}} = {R_{1t}} + {R_{2t}}. $$ (14)

      The reward R1t means that the agent will receive a small reward if the contact force is large. fr(t-1) can be viewed as the contact force before executing the t-th action. The term R1(t-1)DT provides an expected insertion depth. Larger the current contact force is, smaller the insertion depth should be. And the reward R2t indicates that larger the difference between the real and expected insertion depths is, smaller the reward will be.

      Algorithm 1. Self-learning with dynamic exploration

      Initialize σa←0.1

      Initialize replay buffer M

      For episode =1, 2, ···, do:

       Reset the initial state s0

       For t=1, 2, ···, do:

         Compute the expert action ate and refinement action atr

         Compute the actions at and atc with (6) and (15)

         Execute action atc, observe reward RMt and the next state st+1

         If st+1 is a termination state do:

           break

         End if

         Store transition (st, atr, RMt, st+1) in M

         Sample a random minibatch of N transitions from M

         Calculate gradients and update parameters

         Update σa with (17)

         stst+1

       End for

       Calculate the cumulative reward and update σa with (16)

      End for

    • When training the RL model, the state space should be explored to improve the performance of the action policy. However, random exploration might be harmful to the safety of the robotic system. For example, the radial contact force may exceed the allowed range. An appropriate exploration strategy can encourage the agent to explore the state space more efficiently. Therefore, we develop an episode-step exploration strategy and the exploration intensity is adjusted online according to the current performance of agent.

      Gaussian noise is added to the action for random exploration.

      $$a_t^c = {a_t} + N\left( {0,{\sigma _a}I} \right)$$ (15)

      where σa is the standard deviation; atc is the output action with Gaussian noise.

      The parameter σa determines the exploration intensity. Generally, the exploration should be increased when the performance of action policy is unsatisfactory. The average episode reward can indicate the performance of action policy. Then a simple but effective episode-based exploration method is given as

      $${\sigma _a} = \left\{ \begin{aligned} & {\sigma _{t1}},\;{\rm{if}}\;\frac{1}{{{N_s}}}\sum\limits_{t = 0}^{{N_s}} {{R_{Mt}}} < 0 \\ & {\sigma _{t2}},\;{\rm{otherwise }} \end{aligned} \right.$$ (16)

      where Ns is the number of steps in the episode; σt1 and σt2 are two thresholds where σt1 > σt2; σa is adjusted after each episode.

      The episode-based method is insufficient because σa is only updated after one episode finishes, which is delayed. Then a step-based exploration method is developed which works as a supplement to the episode-based method.

      In general, the performance of action atc is expected better than which of the sole expert action ate. Therefore, the expert action ate can be used as an appropriate benchmark to evaluate the performance of the agent after each step. Specifically, if the performance of atc is better than that of ate, the exploration should be decreased for generating stable output. On the contrary, the exploration should be increased for generating a better policy. There is another problem that the real executed action is atc rather than ate, which means the reward Re generated by ate cannot be obtained directly. The reward Re can be estimated with the state before executing action atc. The efficiency part R2t can be calculated by (12). The safety part R1t is calculated with the contact force before rather than after executing atc. Generally, the radial contact force will decrease after executing ate, which means the reward Re calculated by the above method is worse than the real ones. Therefore, the reward Re is competent to work as the benchmark to guide exploration. Then the step exploration method is given as

      $${\sigma _a} \leftarrow {\sigma _a} - {\sigma _b}\tanh \left( {{R_{Mt}} - {R_e}} \right)$$ (17)

      where σb is a constant. And σa will be limited within [σmin, σmax].

    • In real robotic assembly tasks, the properties of components are different. The model trained with one kind of component might not be suitable for other components. And it is tedious to train a new model for each new component. In order to solve this problem, an insertion skill saving and selection mechanism is developed and the flow chart is given in Fig. 3. Firstly, several typical components are selected and used to train the corresponding models with the proposed methods. Then the trained parameters are saved in a skill pool Sp. The force Jacobian matrix of the i-th model in Sp is denoted as JFi.

      Figure 3.  Flow chart of insertion skill saving and selection mechanism

      Given a new component, one demonstration should be firstly conducted and the force Jacobian matrix JFnew is calibrated. The distance Di between JFnew and JFi is computed by

      $${D_i} = \left\| {{J_{Fi}} - {J_{Fnew}}} \right\|_F^2.$$ (18)

      The model with the minimal distance is chosen as the appropriate model. And the corresponding insertion skill is restored from the skill pool and employed to guide the insertion task with the new component. Therefore, the adaptability for different components is improved.

    • This section demonstrates the feasibility of the proposed insertion skill learning method in a peg-in-hole assembly simulation environment.

    • The simulation environment used in this experiment is the same as [21]. The friction coefficient is set to 0.3. The Hookean coefficient is set to 3.3 mN/μm. There are two cylindrical components to be assembled. The heights of the two components are 4 mm. The diameters of the peg and the hole are 4 mm and 4.01 mm respectively.

      The training parameters of the proposed insertion skill learning method is given in Table 1. To guarantee the safety of insertion task, the insertion step length dz is limited within [0, 150 μm] and the adjustments dx and dy are limited within [−5 μm, 5 μm]. The maximum allowed radial contact force fT is set to 80 mN. The maximum insertion steps are set to 1000. If radial contact force fr exceeds fT, the insertion task fails. The insertion task succeeds when |fz| > 1000 mN.

      Table 1.  Training parameters

      ParametersValuesParametersValues
      Delayed update rate τ0.1Discount factor γ0.99
      Learning rate0.001Self-learning episodes200
      Batch size N32Size of M200
      Constant α0.15Thresholds σt1 and σt20.3, 0.1
      Thresholds σmin, σmax0.1, 0.5Constant σb0.3

      The initial orientation and position errors are set within 0.3 degree and 10 μm, respectively. During self-learning, initial states are sampled randomly within the pose errors range. The test dataset includes 100 initial states with random pose errors. And the trained model is evaluated in the test dataset.

    • Firstly, one insertion demonstration is conducted. The expert action is learned with the method detailed in Section. 3.1. and evaluated in the test dataset. The success rate is 100%. The mean reward is 0.52. The distribution of radial contact force is shown in Fig. 4. The number of insertion steps is about 54. And the contact force descends below 10 mN after about 10 steps since the beginning of task.

      Figure 4.  Distribution of radial contact force

      It can be seen the expert action can meet the requirement of safety, but the efficiency is low. Therefore, the performance of the agent should be further improved with the self-learning method introduced in Section. 3.1.

    • To validate the effectiveness of the proposed dynamic exploration strategy, three different strategies are compared: episode-step exploration, episode exploration and step exploration. The initial insertion depths are set randomly. The self-learning stage terminates after 200 episodes. The training results are shown in Fig. 5. The curve converges after about 25 episodes with the episode-step exploration strategy. In contrast, the curve converges after about 65 and 45 episodes with episode exploration and step exploration, respectively. Compared with episode exploration method, the step exploration method can adjust the exploration intensity in a more timely way. And the convergence speed of step exploration is faster than that of episode exploration. Therefore, the proposed episode-step exploration strategy can improve the training efficiency.

      Figure 5.  Reward curves of three different exploration strategies

      After self-learning, the performance of the agent is evaluated in the test dataset. The mean reward is 0.91 and the success rate is 100%. The number of insertion steps is about 31. The contact force descends below 10 mN after about 3 steps since the beginning of task, which is more efficient than the results of expert action.

      The distribution of radial contact force is shown in Fig. 4. The force less than 10 mN occupies over 92% with action at after self-learning. In contrast, it occupies only 55% when using expert action ate. Compared with the expert learning results, the contact force can be kept within a smaller range after self-learning and the performance of the agent has been improved.

    • The classic DDPG method[11], denoted as comparative method 1, is chosen as a comparative method, and episode exploration strategy is adopted. The structure of the network is the same as which of our method except the expert action part. The robotic assembly method in [21], denoted as comparative method 2, is chosen as another comparative method. In order to compare its performance with our method's, the fuzzy reward system is replaced with our reward function. And the other parts are the same as which of the original method in [21].

      The training process finishes after 200 episodes. The training result is shown in Fig. 6. The convergence speed of comparative method 2 is much faster than which of comparative method 1. But the curve of method 2 is always below which of our method. Then the trained models are evaluated in the test dataset. The final performance of comparative methods 1 and 2 are similar. The success rates are both 100%. The mean rewards are 0.89 and 0.86, respectively. The mean reward of our method is 0.91, given in Section 4.3, which is better than the two comparative methods′. Some comparative results are given in Table 2. The average radial contact force (ACF) is computed in each insertion task. And the mean and standard deviation (STD) of ACFs are computed. The mean and STD of steps are also computed. The four values of our method are smaller than the two comparative methods.

      Figure 6.  Reward curves of different methods during training process

      Table 2.  Contact forces and steps of different methods

      MethodsMean of ACF (mN)STD of ACF(mN)Mean of StepsSTD of Steps
      Our method5.531.2630.210.43
      Comparative method 16.771.3432.580.59
      Comparative method 210.131.6634.320.65

      Our method outperforms the two comparative methods and the reasons are given as follows. As for the comparative method 1, the exploration strategy is worse than ours. Besides, our framework contains an expert action and a refinement action, which can accelerate the training process. As for the comparative method 2, the acquisition of expert action is improper. It views the contact force as decoupled, which is not always the truth. And the variance of action space noise decreases all the time. On the contrary, our method can adjust the exploration intensity more flexibly. Furthermore, the two comparative methods should retrain the models when meeting new components. However, in our method, a skill pool is established with the force Jacobian matrix, and the most appropriate model will be selected from the skill pool to directly accomplish the new insertion tasks. And the gap between simulation and real robotic system can be bridged with the method detailed in Section 5.2, which is based on the force Jacobian matrix. But the two comparative methods cannot do that. Therefore, the adaptability of our method is much better.

    • In real robotic applications, the properties of components may be different and it is time-consuming to train a new model for each new component. The method, detailed in Section. 3.4, can solve this problem. In order to validate the performance of the method, a set of experiments are conducted. Three typical components are chosen and the Hookean coefficient are 0.5k0, k0 and 2k0, respectively. k0 is the Hookean coefficient used in the aforementioned experiments. These models are trained with the method introduced in Section 3 and are saved to the skill pool Sp.

      A new component is chosen as a test component and the Hookean coefficient is 2.2k0. The distance Di, i=1, 2, 3 is computed with (18). And D3 is the smallest distance and the third model is chosen as the most appropriate model. Then the model is evaluated in the test dataset. The mean reward is 0.924 and the success rate is 100%. The results are given in Table 3. The models 1−3 are the three models in the skill pool. We test the two other models with the new component and the mean rewards decrease obviously. It validates the correctness of choosing the third model for insertion tasks with the new component. Therefore, it is important to choose an appropriate model to obtain better performance. And the proposed method provides a convenient and efficient approach for insertion tasks with a new component.

      Table 3.  Success rate and reward with new component

      ModelsSuccess rateMean reward
      Model 10.980.75
      Model 21.00.84
      Model 31.00.92
    • An experiment system is established according to the scheme given in Section 2.1, as shown in Fig. 7. In this experiment system, camera 1 and camera 2 are GC2450 cameras and camera 3 is a PointGrey camera. All the three cameras are equipped with Navitar zoom lens with magnification 0.47~4.5×, which capture images 15 frames per second with image size of 2448×2050 in pixel. The adjustment platform is composed of a Micos WT-100 for rotation around Xp and Yp axes, Sigma SGSP-40YAW for around Zp axis Micos ES-100 for translation along Zp axis. The rotation and translation resolutions of adjustment platform are 0.001 degree, 0.02 degree, 0.02 degree. and 1 μm, respectively. The manipulator is composed of a Sugura KWG06030G for translation along Xw, Yw and Zw axes with resolution 1 μm.

      Figure 7.  Experimental system and components

      Three kinds of components are employed to verify the effectiveness of the proposed insertion skill learning method. The insertion tasks are to insert the components A, B and C into the component D. The components A, B and C are electronic components. The component D is a bread board and there are many holes on it. The components A, B, and C are separately mounted on the manipulator in sequence, and component D is mounted on the adjustment platform. The diameters of the pegs are 1 mm. The height of the components A, B, and C are 5 mm, 8 mm and 5 mm, respectively. The vision-based pose alignment is conducted before the insertion process[1]. The force sensor provides contact force during insertion tasks.

    • Usually, it takes many hours to train the RL model to obtain insertion skills on real robotic systems. It is very time-consuming and the safety cannot be guaranteed. In order to solve this problem, we proposed a method to bridge the gap between simulation and the real robotic system. And the insertion skills obtained in the simulation environment can be directly used on real robotic systems by using our method.

      In real robotic systems, the coordinates of force sensor and manipulator may not be identical. JX is the inverse of JF. and can indicate the relationship between relative movements and contact force. Then a new simulation environment can be established, which is similar to the real robotic environment.

      The components A and B are separately mounted on the manipulator in sequence and the relative movements of the manipulator will cause deformation offset of the components. The corresponding force Jacobian matrices of components A and B are given in (19) and (20).

      $${J_{FA}} = \left[ \begin{aligned} & { - 0.133\;0}&{ - 0.010\;1} \\ & { - 0.004\;1}&{{\rm{ }}0.063\;1} \end{aligned} \right]{{{\text{μ}}{\rm{m}}}}/{\rm{mN}}$$ (19)
      $${J_{FB}} = \left[ \begin{aligned} & { - 0.232\;8}&{ - 0.032\;1} \\ & { - 0.036\;5}&{{\rm{ }}0.525\;1} \end{aligned} \right]{\text{μ}}{\rm{m}}/{\rm{mN}}$$ (20)

      where JFA and JFB are the force Jacobian matrices of components A and B, respectively.

      The components A and B are trained in the corresponding simulation environments and the corresponding model A and model B are saved in the skill pool. Four experiments are conducted for each component with the learned insertion policy. And all of the eight experiments finished successfully. The distributions of the contact force are given in Fig. 8. In the insertion tasks with component A, the contact force less than 50 mN occupies over 90%. As for component B, the contact force less than 30 mN occupies over 95%. The success of the experiments validates the effectiveness of our method. It can save a lot of time and provide a much more convenient way to train RL model for assembly tasks on real robotic system.

      Figure 8.  Distributions of radial contact force with components A, B and C.

      In order to further validate the feasibility of the skill saving and selection mechanism, the component C is viewed as a new component used in the insertion tasks. The component C is mounted on manipulator and the relative movements of manipulator will cause deformation offset of the component. The corresponding force Jacobian matrix JFC is firstly obtained and given in (21).

      $${J_{FC}} = \left[ \begin{split} & { - 0.230\;4}&{0.023\;7} \\ &{ - 0.000\;1}&{0.258\;4} \end{split} \right]{{{\text{μ}} {\rm{m}}}}/{\rm{mN}}.$$ (21)

      There are two trained models, model A and model B, in the skill pool. The distance, {Di| i=1, 2}, between JFC and the force Jacobian matrices of two models are computed with (18). The materials of components A and C are similar, and as expected, D1 is smaller than D2. Then the model A is selected to complete the insertion tasks with component C. The maximum insertion step is set as 50 μm. Four experiments are conducted with different initial states. Some details of one insertion task are shown in Fig. 9. The contact force fx and fy are kept within a safe range during the insertion process. Adjustments dx and dy are employed to reduce the contact force. The insertion step dz decreases as the contact force increases. The distribution of the radial contact force is shown in Fig. 8. The contact force less than 30 mN occupies over 96%. The above results verify the feasibility of the skill saving and selection mechanism which promotes the adaptability to new components.

      Figure 9.  Results of one insertion task with component C: (a) Contact force fx and fy; (b) Contact force fz; (c) Actions; (d) The whole trajectory.

    • A DDPG-based skill learning framework is proposed for robotic insertion. Considering both the safety and efficiency, the action to be executed is composed of two parts: an expert action and a refinement action, which are learned from one demonstration and RL, respectively. The episode-step exploration strategy is designed to improve training efficiency of RL. In order to improve the adaptability of the insertion skill learning method, a skill saving and selection mechanism is designed. It is convenient to select an appropriate model from the skill pool to execute insertion tasks when meeting new components. To bridge the gap between simulation and real robotic systems, a simulation environment is established under the guidance of force Jacobian matrix. Then the models can be trained in the simulation environment and be used directly in the real insertion tasks. The results of simulations and experiments show the effectiveness of the proposed insertion skill learning framework.

    • This work was supported by National Key Research and Development Program of China (No.2018AAA0103005) and National Natural Science Foundation of China (No. 61873266).

参考文献 (21)

目录

    /

    返回文章
    返回