Example: Defining an online solver
In this example, we will define a simple online solver that works for both POMDPs and MDPs. In order to focus on the code structure, we will not create an algorithm that finds an optimal policy, but rather a greedy policy, that is, one that optimizes the expected immediate reward. For information on using this solver in a simulation, see Running Simulations.
In order to handle the widest range of problems, we will use @gen
to generate Mone Carlo samples to estimate the reward even if only a simulator is available. We begin by creating the necessary types and the solve function. The only solver parameter is the number of samples used to estimate the reward at each step, and the solve
function does nothing more than create a planner with the appropriate (PO)MDP problem definition.
using POMDPs
struct MonteCarloGreedySolver <: Solver
num_samples::Int
end
struct MonteCarloGreedyPlanner{M} <: Policy
m::M
num_samples::Int
end
POMDPs.solve(sol::MonteCarloGreedySolver, m) = MonteCarloGreedyPlanner(m, sol.num_samples)
Next, we define the action
function where the online work takes place.
MDP Case
function POMDPs.action(p::MonteCarloGreedyPlanner{<:MDP}, s)
best_reward = -Inf
local best_action
for a in actions(p.m)
reward_sum = sum(@gen(:r)(p.m, s, a) for _ in 1:p.num_samples)
if reward_sum >= best_reward
best_reward = reward_sum
best_action = a
end
end
return best_action
end
POMDP Case
function POMDPs.action(p::MonteCarloGreedyPlanner{<:POMDP}, b)
best_reward = -Inf
local best_action
for a in actions(p.m)
s = rand(b)
reward_sum = sum(@gen(:r)(p.m, s, a) for _ in 1:p.num_samples)
if reward_sum >= best_reward
best_reward = reward_sum
best_action = a
end
end
return best_action
end
# output
Verification
We can now verify that the online planner works in some simple cases:
using POMDPModels
gw = SimpleGridWorld(size=(2,1), rewards=Dict(GWPos(2,1)=>1.0))
solver = MonteCarloGreedySolver(1000)
planner = solve(solver, gw)
action(planner, GWPos(1,1))
# output
:right
using POMDPModels
using POMDPTools: Deterministic, Uniform
tiger = TigerPOMDP()
solver = MonteCarloGreedySolver(1000)
planner = solve(solver, tiger)
@assert action(planner, Deterministic(TIGER_LEFT)) == TIGER_OPEN_RIGHT
@assert action(planner, Deterministic(TIGER_RIGHT)) == TIGER_OPEN_LEFT
# note action(planner, Uniform(states(tiger))) is not very reliable with this number of samples