赌徒问题
假设赌徒根据硬币的正反面来决定输赢。如果硬币正面(head)朝上,赌徒可以拿走和赌注等量的筹码,如果硬币反面朝上,赌徒则输掉赌注。当赌徒赢够100元或者输掉全部赌本则赌局结束。求(硬币正面朝上)在不同的概率和不同的赌资下,赌徒赢够100元的概率(或许还应该加上赌博轮次的限制?比如只抛100次硬币)。
问题分析
显然,不同的赌资应该采取不同的策略。比如,如果赌徒只有1块钱,只有全部压上(赌注)才有机会最终赢得100块。但是随着赌资的增加,赌徒必须考虑要拿出多少钱作为赌注。比如,如果赌徒手中已经有51块前,显然不能将51块全部压上:如果赢了,则无法实现赢够100元的目标(超出了),如果输掉,则变得一穷二白。因此,全部压上51块钱是一个糟糕的策略。
假设问题的状态空间s为{1,2,…,99},即赌徒手中现有的赌资(capital)。问题的动作空间a为{0,1,2,…,min(s,100-s)},即赌徒所下的赌注(stake),则问题转化为在不同的赌资条件下,赌徒采取何种(最优)下注策略,能够尽快赢够100元?这是一个策略优化问题,可以采用价值迭代的方式来寻找最优策略。
代码实现
主要参考了:https://github.com/dennybritz/reinforcement-learning 。代码的编写是直线型的(不是面向对象),因此还是比较容易理解的。其中主要的是两点:
- 值迭代的过程:首先使用贪心算法计算每个状态的价值函数,直到价值函数收敛,然后再一次性计算最优策略。
- 对于sutton book中公式4.10的理解很重要:基于贪心算法,当前状态的价值函数是其对应的最大的动作价值函数(以当前赌资为98元为例),再次列出如下:
import numpy as np
import matplotlib.pyplot as plt
def value_iteration_for_gamblers(p_h, theta=0.0001, gamma=1.0):
"""
Args:
p_h: Probability of the coin coming up heads
theta: 迭代结束条件
gamma: 衰减因子
"""
# The reward is zero on all transitions except those on which the gambler reaches his goal,
# when it is +1.
R = np.zeros(101)
R[100] = 1
# We introduce two dummy states corresponding to termination with capital of 0 and 100
V = np.zeros(101)
def one_step_lookahead(s, V, R):
"""
Helper function to calculate the value for all action in a given state.
Args:
s: The gambler’s capital. Integer.当前的状态
V: The vector that contains values at each state.
R: The reward vector.
Returns:
A vector containing the expected value of each action.
Its length equals to the number of actions.
"""
A = np.zeros(101)
stakes = range(1, min(s, 100 - s) + 1) # Your minimum bet is 1, maximum bet is min(s, 100-s).
for a in stakes:
# R[s+a], R[s-a] are immediate rewards.
# V[s+a], V[s-a] are values of the next states.
# This is the core of the Bellman equation: The expected value of your action is
# the sum of immediate rewards and the value of the next state.
A[a] = p_h * (R[s + a] + V[s + a] * gamma) + (1 - p_h) * (
R[s - a] + V[s - a] * gamma)
return A
# 第几次重新设置状态价值函数?`
sweep = 0
while True:
print("=============sweep=" + str(sweep) + "==============")
# Stopping condition
delta = 0
# Update each state...
for s in range(1, 100):
# Do a one-step lookahead to find the best action
A = one_step_lookahead(s, V, R)
# print(s,A,V) if you want to debug.
# 大量的调试输出
print("s=")
print(s)
print("A=")
print(A)
print("V=")
print(V)
print("R=")
print(R)
best_action_value = np.max(A)
# Calculate delta across all states seen so far
delta = max(delta, np.abs(best_action_value - V[s]))
# Update the value function. Ref: Sutton book eq. 4.10.
V[s] = best_action_value
sweep = sweep + 1
# 画出每次迭代的状态价值函数曲线,观察状态价值函数的变化趋势
plt.plot(range(100), V[:100])
# Check if we can stop
if delta < theta:
plt.xlabel('Capital')
plt.ylabel('Value Estimates')
plt.title('Final Policy(action stakes) vs. State(Capital),p_h=' + str(p_h))
plt.show()
break
# Create a deterministic policy using the optimal value function
policy = np.zeros(100)
for s in range(1, 100):
# One step lookahead to find the best action for this state
A = one_step_lookahead(s, V, R)
best_action = np.argmax(A)
# Always take the best action
policy[s] = best_action
return policy, V
def draw_value_estimates(p_h):
x = range(100)
y = v[:100]
plt.plot(x, y)
plt.xlabel('Capital')
plt.ylabel('Value Estimates')
plt.title('Final Policy (action stake) vs State (Capital),p_h=' + str(p_h))
plt.show()
def draw_policy(p_h):
x = range(100)
y = policy
plt.bar(x, y, align='center', alpha=0.5)
plt.xlabel('Capital')
plt.ylabel('Final policy (stake)')
plt.title('Capital vs Final Policy(ph=' + str(p_h) + ')')
plt.show()
if __name__ == '__main__':
for p_h in (0.4, 0.55):
policy, v = value_iteration_for_gamblers(p_h)
print("Optimized Policy(p_h=" + str(p_h) + "):")
print(policy)
print("")
print("Optimized Value Function(p_h=" + str(p_h) + "):")
print(v)
print("")
# draw_value_estimates(p_h)
draw_policy(p_h)
当p_h=0.25的时候,状态价值函数的计算过程如下图所示:
最优策略如下图所示:
当p_h=0.55的时候,状态价值函数的计算过程如下图所示:
最优策略如下图所示:
可以看出,当硬币正面概率为0.55的时候,最优策略几乎总是下注1块,有点匪夷所思,难道计算错了?