还剩18页未读,继续阅读
本资源只提供10页预览,全部文档请下载后查看!喜欢就下载吧,查找使用更方便
文本内容:
下面以代码为例,基于框架复现论文“一种无人机自主避障与目Python TensorFlow
2.2标追踪方法”中的算法,实现无人机自主避障与目标追踪功能MP-DQN环境设置与参数定义###
1.pythonimport tensorflowas tfimportnumpy asnpimport random环境参数##栅格地图大小GRID_SIZE=12#最大步数MAX_STEPS=60#逃避者速度是追踪者速度的一半EVADER.SPEED=
0.5算法参数#MP-DQN#学习率LEARNING_RATE=
0.01#折扣系数GAMMA=
0.9#目标网络软更新参数TAU—
0.01#每次采样的数据量BATCH,SIZE=32#从成功经验池采样数据的比例SUCCESS_POOL_RATIO=
0.85#最大训练回合数MAX_EPISODES=1000探索策略参数#£-inspireEPSILON.MIN=0EPSILON_MAX=
0.99EPSILONJNCREMENT=
0.001_#连续任务失败最大次数FAIL THRESHOLD=50#遗忘探索率FORGET_EXPLORATION_RATE=
0.8else:total_fail_count=0解释-初始化追踪者和逃避者的位置,根据-策略选择动作£inspire-计算奖励、更新追踪者和逃避者的位置,并将经验存储到临时经验池-当临时经验池满时,将溢出数据存入成功经验池;当回合结束时,根据任务成败将临时经验池数据存入成功或失败经验池-当成功经验池和失败经验池数据量满足采样要求时,进行采样并计算目标值和损失,Q更新评估网络和目标网络.测试阶段###7pythontest_episodes=1000success_count=0for episodein rangetest_episodes:tracker_pos=[random.randint0,GRID_SIZE-1,random.randint0,GRID_SIZE-1]evader_pos=[random.randint0,GRID_SIZE-1,random.randint0,GRID_SIZE-1]forstep in rangeMAX_STEPS:state=[evader_pos
[0]-tracker_pos
[0]/GRID_SIZE,evader_pos[l]-tracker_pos[l]/GRID_SIZE]for i in range4:neighbor_x=tracker_pos
[0]+[-1,1,0,0][i]neighbor_y=tracker_pos[l]+[0,0,-1,l][i]if0=neighbor_xGRID_SIZE and0=neighbor_yGRID_SIZE:state.appendlelse:state.append-lstate=np.arraystate,dtype=np.float32#测试时探索率设为action=epsilon_inspire_policystate,0,eval_network0new_tracker_pos=[tracker_pos
[0]+actionfO],tracker_pos[l]+actionfl]]is_out_of_bound=new_tracker_pos
[0]0or new_tracker_pos
[0]=GRID_SIZEor new_tracker_pos[l]0or new_tracker_pos[l]=GRID_SIZEif is_out_of_bound:new_tracker_pos=tracker_postracker_pos=new_tracker_posif np.array_equaltracker_pos,evader_pos:success_count+=1breakelse:evader_action=random.choice[0,-1,0,1,-1,0,1,0]new_evader_pos=[evader_pos
[0]+evader_action
[0]*EVADER_SPEED,evader_pos[l]+evader_action[l]*EVADER_SPEED]new_evader_pos=[max0,minGRID_SIZE-1,intpos forpos innew_evader_pos]evader_pos=new_evader_pos测试成功率:printf”{success_count/test_episodesXXX解释-在测试阶段,设置固定的探索率为让追踪者根据训练好的评估网络选择动作0,-统计追踪者成功追踪到逃避者的次数,计算测试成功率以上代码仅为基于论文内容的简化实现示例,实际应用中可能需要根据具体需求进步优化和扩展,如处理更复杂的环境、调整网络结构和参数等.代码的优化与扩展方向###8网络结构优化####
8.1原代码使用的是简单的全连接网络,对于更复杂的环境或任务,可能需要引入卷积层如使用来更好地提取特征以下是一个简单的修改示例Conv2Dpythonclass QNetworktf.keras.Model:def_init_self:superQNetwork,self._init_self.convl=tf.keras.layers.Conv2D16,kernel_size=3,3,activation=relu,input_shape=GRID_SIZE,GRID_SIZE,1self.flatten=tf.keras.layers.Flattenself.densel=tf.keras.layers.Dense60,activation=relu个动作,对应上下左右self.dense2=tf.keras.layers.Dense4#4def callself,x:#添加通道维度x=tf.expand_dimsx,-1x=self.convlxx=self.flattenxx=self.denselx returnself.dense2x eval_network=QNetworktarget_network=QNetworkoptimizer=tf.keras.optimizers.Adamlearning_rate=LEARN ING_RATE、、、解释-在网络中添加了一个卷积层可以更好地捕捉环境中的空间特征Conv2D,层将卷积层的输出展平,以便输入到全连接层-Flatten#环境复杂度增加###
8.2原代码的环境是简单的栅格地图,可考虑增加障碍物可以在环境初始化时随机生成一些障碍物的位置,并在追踪者和逃避者移动时检查是否会碰到障碍物以下是修改后的部分代码python#初始化障碍物OBSTACLE_NUM=5obstacles=[]for_in rangeOBSTACLE_NUM:obstacle_pos=[random.randint0,GRID_SIZE-1,random.randint0,GRID_SIZE-1]obstacles.appendobstacle_pos#在移动时检查障碍物def is_collide_with_obstaclepos,obstacles:for obstaclein obstacles:if np.array_equalpos,obstacle:return TruereturnFalse#在追踪者和逃避者移动时添加障碍物检查#追踪者移动new_tracker_pos=[tracker_pos
[0]+action
[0],tracker_pos[l]+actionfl]]is_out_of_bound=new_tracker_pos
[0]0or new_tracker_pos
[0]=GRID_SIZE ornew_tracker_pos[l]0or new_tracker_pos[l]=GRID_SIZEif is_out_of_bound oris_collide_with_obstaclenew_tracker_pos,obstacles:new_tracker_pos=tracker_postracker_pos=new_tracker_pos#逃避者移动new_evader_pos=[evader_pos
[0]+evader_action
[0]*EVADER_SPEED,evader_pos[l]+evader_action[l]*EVADER_SPEED]new_evader_pos=[max0,minGRID_SIZE-1,intpos forpos innew_evader_pos]if is_collide_with_obstaclenew_evader_pos,obstacles:new_evader_pos=evader_posevader_pos=new_evader_posXXX解释-随机生成了一些障碍物的位置,并存储在、列表中obstacles-定义了函数来检查某个位置是否与障碍物碰撞is_collide_with_obstacle-在追踪者和逃避者移动时,添加了障碍物检查,避免它们移动到障碍物所在位置经验回放优化####
8.3原代码使用的是简单的随机采样,可以考虑使用优先经验回放Priohtized ExperienceReplay优先经验回放根据经验的重要性如误差来调整采样概率,使得更重要的经验被更o TD频繁地采样以下是一个简单的优先经验回放实现示例pythonimport heapqclassPrioritizedExperiencePool:def_init_self,capacity:self.capacity=capacity=口self.bufferself.index=0口self.priorities=def storeself,experience,priority:if lenself.bufferself.capacity:self.buffer.appendexperienceheapq.heappushself.priorities,-priority,self.indexelse:oldjndex=self.priorities
[0][l]self.buffer[oldjndex]=experienceheapq.heapreplaceself.priorities,-priority,self.index二self.index self.index+1%self.capacitydef sampleself,batch_size:batchjndices=[]for_in rangebatch_size:priority,index=heapq.heappopself.prioritiesbatch_indices.appendindex heapq.heappushself.priorities,priority,indexreturn[self.buffer[i]for iin batchjndices]success_pool=Prioritized ExperiencePoolSUCCESS_POOL_CAPACITYfail_pool=Prioritized ExperiencePoolFAIL_POOL_CAPACITYtemp_pool=ExperiencePoolTEMP_POOL_CAPACITY、、、解释类使用堆来管理经验的优先级-PrioritizedExperiencePool heapq方法在存储经验时,同时记录其优先级-store方法根据优先级采样经验,使得优先级高的经验更有可能被采样-sample.代码的可维护性和可读性改进###9函数封装####
9.1将一些复杂的逻辑封装成函数,提高代码的可读性和可维护性例如,将状态处理、奖励计算等逻辑封装成函数pythondef get_statetracker_pos,evader_pos,GRID_SIZE:state=[evader_pos
[0]-tracker_pos
[0]/GRID_SIZE,evader_pos[l]-tracker_pos[l]/GRID_SIZE]for iin rangc4:neighbor_x=tracker_pos
[0]+[-1,1,0,0][i]neighbor_y=tracker_pos[l]+[0,0,-1,l][i]if0=neighbor_xGRID_SIZE and0=neighbor_yGRID_SIZE:state.appendlelse:state.append-lreturn np.arraystate,dtype=np.float32在训练和测试阶段使用封装的函数#state=get_statetracker_pos,evader_pos,GRID_SIZE new_state=get_statetracker_pos,evader_pos,GRID_SIZE、、、解释-、函数封装了状态处理的逻辑,使得代码更加简洁,易于理解和修改get_state添加注释和文档字符串####
9.2在代码中添加详细的注释和文档字符串,解释每个函数和类的功能、输入输出等信息例如pythondef calculate_rewardtracker_pos,evader_pos,action,done,is_out_of_bound:IIIIII计算奖励值参数追踪者的位置tracker_pos list:逃避者的位置cvadcr_pos list:追踪者采取的动作action tuple:任务是否结束done bool:追踪者是否超出边界is_out_of_bound bool:返回计算得到的奖励值float:IIIIII#计算相对位置ol=evader_pos
[0]-tracker_pos
[0]/GRID_SIZEo2=evader_pos[l]-tracker_pos[l]/GRID_SIZE#终止奖惩r_T=TERMINALREWARD if done else-TERMINAL_REWARD#步进奖惩r_S=STEP_OUT_PENALTY ifis_out_of_bound elseSTEP_PENALTY#距离奖惩d_L=np.linalg.normnp.arraytracker_pos-np.arrayevader_posnew_tracker_pos=np.arraytracker_pos+np.arrayactiond_C=np.linalg.normnew_tracker_pos-np.arrayevader_possign=1if dCd Lelse-1if dCd Lelse0r_D=sign*1/np.sqrt2*np.pi*DISTANCE_REWARD_SIGMA*np.exp-d_C-0**2/2*DISTANCE_REWARD_SIGMA**2*DISTANCE_REWARD_SCALE#方向奖惩dot_product=action
[0]*ol+action[l]*o2norm_action=np.linalg.normactionnorm_targct=np.linalg.norm[ol,o2]if norm_action==0or norm_target==0:theta=0else:theta=np.arccosdot_product/norm_action*norm_targetr_theta=DIRECTION_REWARD_SCALE/np.pi*np.pi/2-theta ifthetanp.pi/2else-DIRECTION_REWARD_SCALE/np.pi*theta-np.pi/2return r_T+r_S+r_D+r_theta解释-为函数添加了文档字符串,详细说明了函数的功能、参数和返回值,方calculatejeward便其他开发者理解和使用该函数通过以上优化和改进,可以使代码更加健壮、灵活,并且易于维护和扩展#经验池容量SUCCESS_POOL_CAPACITY=2000FAIL_POOL_CAPACITY=2000TEMP_POOL_CAPACITY=15#奖励函数参数#终止奖惩TERMINALREWARD=20#超出边界惩罚STEP_OUT_PENALTY=1#其它步进惩罚STEP_PENALTY=
0.5#距离奖惩系数DISTANCE_REWARD_SCALE=2#距离奖惩范围DISTANCE_REWARD_SIGMA=
0.5#方向奖惩系数DIRECTION_REWARD_SCALE=5解释-定义了栅格地图大小、最大步数、逃避者与追踪者速度关系等环境参数-设置算法的学习率、折扣系数、软更新参数等关键超参数MP-DQN-为探索策略设置最小探索率、最大探索率等参数e-inspire-确定经验池容量以及奖励函数中各部分的参数构建网络模型###
2.pythonclass QNetworktf.keras.Model:def_init_self:superQNetwork,self._init_self.densel=tf.keras.layers.Dense60,activation=relu个动作,对应上下左右self.dense2=tf.keras.layers.Dense4#4def callself,x:x=self.denselxreturn self.dense2xeval_network=QNetworktarget_network=QNetworkoptimizer=tf.keras.optimizers.Adamlearning_rate=LEARNING_RATE、、、解释-定义了一个类,继承自包含两个全连接隐藏层,输出个动作的QNetwork tf.keras.Model,4价值-创建评估网络和目标网络并使用优化器eval_network targetjetwork,Adam经验池类定义###
3.pythonclass ExperiencePool:def_init_self,capacity:self.capacity=capacity=口self.bufferself.index=0def storeself,experience:if lcnsclf.buffcrself.capacity:self.buffer.appendexperienceelse:self.buffer[self.index]=experience二self.index self.index+1%self.capacitydef sampleself,batch_size:return random.sampleself.buffer,batch_sizesuccess_pool=ExperiencePoolSUCCESS_POOL_CAPACITYfaiLpool=ExperiencePoolFAIL_POOL_CAPACITYtemp_pool=ExperiencePoolTEMP_POOL_CAPACITY、、、解释类用于管理经验池,有存储经验和采样经验的功能-ExperiencePool-创建成功经验池、失败经验池和临时经验池、也success_pool faiLpooltemp_pool.计算奖励函数###4pythondef calculate_rewardtracker_pos,evader_pos,action,done,is_out_of_bound:#计算相对位置ol=evader_pos
[0]-tracker_pos
[0]/GRID_SIZEo2=evader_pos[l]-tracker_pos[l]/GRID_SIZE#终止奖惩r_T=TERMINALREWARD if done else-TERMINAL_REWARD#步进奖惩r_S=STEP_OUT_PENALTY ifis_out_of_bound elseSTEP_PENALTY#距离奖惩d_L=np.linalg.normnp.arraytracker_pos-np.arrayevader_posnew_tracker_pos=np.arraytracker_pos+np.arrayactiond_C=np.linalg.normnew_tracker_pos-np.arrayevader_possign=lifdCdL else-1ifdCd Lelse0r_D=sign*1/np.sqrt2*np.pi*DISTANCE_REWARD_SIGMA*np.exp-d_C-0**2/2*DISTANCE_REWARD_SIGMA**2*DISTANCE_REWARD_SCALE#方向奖惩dot_product=action
[0]*ol+action[l]*o2norm_action=np.linalg.normactionnorm_target=np.linalg.norm[ol,o2]if norm_action==0or norm_target==0:theta=0else:theta=np.arccosdot_product/norm_action*norm_targetr_theta=DIRECTION_REWARD_SCALE/np.pi*np.pi/2-theta ifthetanp.pi/2else-DIRECTION_REWARD_SCALE/np.pi*theta-np.pi/2return r_T+r_S+r_D+r_theta、、、解释-根据论文中的奖励函数公式,计算终止奖惩、步进奖惩、距离奖惩和方向奖惩,最后返回总奖励探索策略###
5.8-inspirepython defepsilon_inspire_policystate,epsilon,eval_network:if np.random.randepsilon:return random.choice[0,-1,0,1,-1,0,1,0]else:state_tensor=tf.convert_to_tensor[state],dtype=tf.float32q_values=eval_networkstate_tensoractionjndex=tf.argmaxq_values,axis=l.numpy
[0]actions=[
0.-1,0,1,-1,0,1,0]return actions[action_index]、、、解释-根据《-策略,以一定概率随机选择动作,否则选择评估网络给出的最大值对inspire Q应的动作算法训练过程###
6.MP-DQNpythontotal_fail_count=0for episodein rangeMAX_EPISODES:tracker_pos=[random.randint0,GRID_SIZE-1,random.randint0,GRID_SIZE-1]evader_pos=[random.randint0,GRID_SIZE-1,random.randint0,GRID_SIZE-1]epsilon=EPSILON_MIN iftotal_fail_count=FAIL_THRESHOLD elseminEPSILON_MAX,EPSILON_MIN+total_fail_count*EPSILONJNCREMENTfor stepin rangeMAX_STEPS:state=[evader_pos
[0]-tracker_pos
[0]/GRID_SIZE,evader_pos[l]-tracker_pos[l]/GRID_SIZE for iinrange4:neighbor_x=tracker_pos
[0]+[-1,1,0,0][i]neighbor_y=tracker_pos[l]+[0,0,-1,l][i]if0=neighbor_xGRID_SIZE and0=neighbor_yGRID_SIZE:state.appendlelse:state.append-lstate=np.arraystate,dtype=np.float32action=epsilon_inspire_policystate,epsilon,eval_networknew_tracker_pos=[tracker_pos
[0]+actionfO],tracker_pos[l]+action[l]]is_out_of_bound=new_tracker_pos
[0]0or new_tracker_pos
[0]=GRID_SIZEor new_tracker_pos[l]0or new_tracker_pos[l]=GRID_SIZEif is_out_of_bound:new_tracker_pos=tracker_postracker_pos=new_tracker_posif np.array_equaltracker_pos,evader_pos:done=Trueelse:#逃避者随机移动evader_action=random.choice[0,-1,0,1,-1,0,1,0]new_evader_pos=[evader_pos
[0]+evader_action
[0]*EVADER_SPEED,evader_pos[l]+evader_action[l]*EVADER_SPEED]new_evader_pos=[max0,minGRID_SIZE-1,intpos forpos innew_evader_pos]evader_pos=new_evader_pos done=Falsereward=calculate_rewardtracker_pos,evader_pos,action,done,is_out_of_boundnew_state=[evader_pos
[0]-tracker_pos
[0]/GRID_SIZE,evader_pos[l]-tracker_pos[l]/GRID_SIZE]foriinrange4:neighbor_x=tracker_pos
[0]+[-1,1,0,0][i]neighbor_y=tracker_pos[l]+[0,0,-1,l][i]if0=neighbor_xGRID_SIZE and0=neighbor_yGRID_SIZE:new_state.appendlelse:new_state.append-lnew_state=np.arraynew_state,dtype=np.float32temp_pool.storestate,action,reward,new_state,done二二if lentemp_pool.buffer TEMP_POOL_CAPACITY:success_pool.storetemp_pool.buffer,pop0ifdone:if reward==TERMINAL_REWARD:for expin temp_pool.buffer:success_pool.storeexpelse:for expin temp_pool.buffer:fail_pool.storccxp口temp_pool.buffer=breakif lensuccess_pool.buffer=BATCH_SIZE andlenfail_pool.buffer=BATCH_SIZE:success_batch_size=intBATCH_SIZE*SUCCESS_POOL_RATIOsuccess_batch=success_pool.samplesuccess_batch_sizefail_batch=fail_pool.sampleBATCH_SIZE-success_batch_sizebatch=success_batch+fail_batchstates=np.array[exp[O]for expin batch],dtype=np.float32actions=np.array[exp[l]for expin batch],dtype=np.int32rewards=np.array[exp
[2]for expin batch],dtype=np.float32new_states=np.array[exp
[3]for expin batch],dtype=np.float32dones=np.array[exp
[4]for expin batch],dtype=np.bool_actions_onehot=tf.one_hotactions,4with tf.GradientTape astape:q_values=eval_networkstatesq_value=tf.reduce_sumtf.multiplyq_values,actions_onehot,axis=ltarget_q_values=ta rget_n etwor kn ew_sta tesmax_target_q=tf.reduce_maxtarget_q_vaIues,axis=ltarget=rewards+GAMMA*max_target_q*1-donesloss=tf.reduce_meantf.squaretarget-q_value二gradients tape.gradientloss,eval_network.trainable_variablesoptimizer.apply_gradientszipgradients,eval_network.trainable_variablesfor target_weight,eval_weight inziptarget_network.trainable_variables,eval_network.trainable_variables:target_weight.assignTAU*eval_weight+1-TAU*target_weightif reward!=TERMINAL_REWARD:total_fail_count+=1。
个人认证
优秀文档
获得点赞 0