Table of Contents

Q-Learning是增强学习中model free的的重要算法,其基本思想是通过Q表记录并更新状态-行动的价值,使得最后获得一个“完美”的Q表:当agent处于任意状态时,查询该Q表即可获知如何行动。

下面通过一个非常简单的小例子来说明Q Learning的思想(本案例主要参考了: https://morvanzhou.github.io/tutorials/ )。这是一个来自一维世界的agent,它只能在一个固定长度的线段上左右运动,每次只能运动一格,当运动到线段的最右边时才会获得奖励:+1的reward。初始时,agent位于线段的最左边,它并不知道在线段的最右边有个“宝物”可以获得reward。

下面的一篇文章可以参考:https://blog.csdn.net/Young_Gy/article/details/73485518

import numpy as np
import pandas as pd
import time

定义系统参数

N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police,这里的意思是,即便在Q表中有对应的(最佳)Q价值,也有10%的概率随机选取action
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 7   # maximum episodes
FRESH_TIME = 0.01    # fresh time for one move

TERMINAL='bang' # 终止状态,当agent遇到最右边的宝物时设置此状态
DEBUG=True # 调试时设置为True则打印更多的信息

Q表的创建函数,初始化为0

本案例Q表的结构如下,其中最左边的一列是状态,本案例有6个状态,即agent可以在6个格子内左右移动:

     left right
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    print(table)    # show table
    return table

策略

这是增强学习中的策略部分,这里的策略很简单:如果平均随机采样值大于设定的epsilon或者当前状态的所有动作价值为0则随机游走探索(随机选取动作),否则从Q表选取价值最大的动作。我们的目标是不断优化Q表中的动作价值。

def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    # 如果当前状态的所有动作的价值为0,则随机选取动作
    # 如果平均随机采样值 > EPSILON,则随机选取动作
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()): 
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.idxmax()
    return action_name

和环境的交互

环境接受agent的action并执行之,然后给出下一个状态和相应的reward。只有agent走到了最右边,环境才给予+1的reward,其他情况下reward=0。

def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    # S_: next status
    # R: reward to action A
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = TERMINAL
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

更新环境

这是agent和环境交互的一部分,绘制环境。

def update_env(S, episode, step_counter):
    # This is how environment be updated
    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
    if S == TERMINAL:
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r                                ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print('\r{}'.format(interaction), end='')
        time.sleep(FRESH_TIME)

游戏的实现

rl = reinforcement learning

这里重点区分两个概念:

  • q_predict,Q预测,即当前(S,A)在Q表中的值(简称Q价值),表达了在S状态下如果采取A动作的价值多少。这是在环境还没有接收并执行A动作时的Q价值,即此时A动作还没有真正执行,因此是一个预测值,或者说是上一轮(S,A)后的Q真实,如果存在上一轮的话。
  • q_target,Q真实,即(S,A)执行后的Q价值:环境接收并执行了A动作,给出了S_(下一个动作)和R(reward),则根据Q Learning算法的更新公式可计算q_target。之所以叫做Q真实,是因为这个时候A动作已经被环境执行了,这是确凿发生的事实产生的Q价值。

画个图来进一步理解:

下图说明了Q Learning的算法(see: https://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html ):

def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            A = choose_action(S, q_table)
            # Q表中当前(S,A)对应的值称为Q预测,即当前的(S,A)组合的价值。
            q_predict = q_table.loc[S, A]
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            if S_ != TERMINAL:
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode

            q_table.loc[S, A] = q_predict + ALPHA * (q_target - q_predict)  # update
            if DEBUG == True and q_target != q_predict:
                print(' %s episode,S(%s),A(%s),R(%.6f),S_(%s),q_p(%.6f),q_t(%.6f),q_tab[S,A](%.6f)' % (episode,S,A,R,S_,q_predict,q_target,q_table.loc[S,A]))
                #print(q_table)
            S = S_  # move to next state

            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table

执行强化学习训练

遗憾的是,还不知道在jupyter中如何不换行持续显示训练的过程,请高手指点。目前可以通过打开DEBUG开关观察agent的训练过程。

q_table = rl()
print('\r\nQ-table after training:\n')
print(q_table)
   left  right
0   0.0    0.0
1   0.0    0.0
2   0.0    0.0
3   0.0    0.0
4   0.0    0.0
5   0.0    0.0
----oT 0 episode,S(4),A(right),R(1.000000),S_(bang),q_p(0.000000),q_t(1.000000),q_tab[S,A](0.100000)
---o-T 1 episode,S(3),A(right),R(0.000000),S_(4),q_p(0.000000),q_t(0.090000),q_tab[S,A](0.009000)
----oT 1 episode,S(4),A(right),R(1.000000),S_(bang),q_p(0.100000),q_t(1.000000),q_tab[S,A](0.190000)
--o--T 2 episode,S(2),A(right),R(0.000000),S_(3),q_p(0.000000),q_t(0.008100),q_tab[S,A](0.000810)
---o-T 2 episode,S(3),A(right),R(0.000000),S_(4),q_p(0.009000),q_t(0.171000),q_tab[S,A](0.025200)
----oT 2 episode,S(4),A(right),R(1.000000),S_(bang),q_p(0.190000),q_t(1.000000),q_tab[S,A](0.271000)
-o---T 3 episode,S(1),A(right),R(0.000000),S_(2),q_p(0.000000),q_t(0.000729),q_tab[S,A](0.000073)
--o--T 3 episode,S(2),A(right),R(0.000000),S_(3),q_p(0.000810),q_t(0.022680),q_tab[S,A](0.002997)
---o-T 3 episode,S(3),A(right),R(0.000000),S_(4),q_p(0.025200),q_t(0.243900),q_tab[S,A](0.047070)
----oT 3 episode,S(4),A(right),R(1.000000),S_(bang),q_p(0.271000),q_t(1.000000),q_tab[S,A](0.343900)
o----T 4 episode,S(0),A(right),R(0.000000),S_(1),q_p(0.000000),q_t(0.000066),q_tab[S,A](0.000007)
-o---T 4 episode,S(1),A(right),R(0.000000),S_(2),q_p(0.000073),q_t(0.002697),q_tab[S,A](0.000335)
--o--T 4 episode,S(2),A(right),R(0.000000),S_(3),q_p(0.002997),q_t(0.042363),q_tab[S,A](0.006934)
---o-T 4 episode,S(3),A(right),R(0.000000),S_(4),q_p(0.047070),q_t(0.309510),q_tab[S,A](0.073314)
----oT 4 episode,S(4),A(right),R(1.000000),S_(bang),q_p(0.343900),q_t(1.000000),q_tab[S,A](0.409510)
o----T 5 episode,S(0),A(right),R(0.000000),S_(1),q_p(0.000007),q_t(0.000302),q_tab[S,A](0.000036)
-o---T 5 episode,S(1),A(right),R(0.000000),S_(2),q_p(0.000335),q_t(0.006240),q_tab[S,A](0.000926)
--o--T 5 episode,S(2),A(right),R(0.000000),S_(3),q_p(0.006934),q_t(0.065983),q_tab[S,A](0.012839)
---o-T 5 episode,S(3),A(right),R(0.000000),S_(4),q_p(0.073314),q_t(0.368559),q_tab[S,A](0.102839)
----oT 5 episode,S(4),A(right),R(1.000000),S_(bang),q_p(0.409510),q_t(1.000000),q_tab[S,A](0.468559)
o----T 6 episode,S(0),A(right),R(0.000000),S_(1),q_p(0.000036),q_t(0.000833),q_tab[S,A](0.000116)
-o---T 6 episode,S(1),A(right),R(0.000000),S_(2),q_p(0.000926),q_t(0.011555),q_tab[S,A](0.001989)
--o--T 6 episode,S(2),A(right),R(0.000000),S_(3),q_p(0.012839),q_t(0.092555),q_tab[S,A](0.020810)
---o-T 6 episode,S(3),A(right),R(0.000000),S_(4),q_p(0.102839),q_t(0.421703),q_tab[S,A](0.134725)
----oT 6 episode,S(4),A(right),R(1.000000),S_(bang),q_p(0.468559),q_t(1.000000),q_tab[S,A](0.521703)
                                
Q-table after training:

   left     right
0   0.0  0.000116
1   0.0  0.001989
2   0.0  0.020810
3   0.0  0.134725
4   0.0  0.521703
5   0.0  0.000000