Example: Defining an online solver

In this example, we will define a simple online solver that works for both POMDPs and MDPs. In order to focus on the code structure, we will not create an algorithm that finds an optimal policy, but rather a greedy policy, that is, one that optimizes the expected immediate reward. For information on using this solver in a simulation, see Running Simulations.

In order to handle the widest range of problems, we will use @gen to generate Mone Carlo samples to estimate the reward even if only a simulator is available. We begin by creating the necessary types and the solve function. The only solver parameter is the number of samples used to estimate the reward at each step, and the solve function does nothing more than create a planner with the appropriate (PO)MDP problem definition.

using POMDPs

struct MonteCarloGreedySolver <: Solver
    num_samples::Int
end

struct MonteCarloGreedyPlanner{M} <: Policy
    m::M
    num_samples::Int
end

POMDPs.solve(sol::MonteCarloGreedySolver, m) = MonteCarloGreedyPlanner(m, sol.num_samples)

Next, we define the action function where the online work takes place.

MDP Case

function POMDPs.action(p::MonteCarloGreedyPlanner{<:MDP}, s)
    best_reward = -Inf
    local best_action
    for a in actions(p.m)
        reward_sum = sum(@gen(:r)(p.m, s, a) for _ in 1:p.num_samples)
        if reward_sum >= best_reward
            best_reward = reward_sum
            best_action = a
        end
    end
    return best_action
end

POMDP Case

function POMDPs.action(p::MonteCarloGreedyPlanner{<:POMDP}, b)
    best_reward = -Inf
    local best_action
    for a in actions(p.m)
        s = rand(b)
        reward_sum = sum(@gen(:r)(p.m, s, a) for _ in 1:p.num_samples)
        if reward_sum >= best_reward
            best_reward = reward_sum
            best_action = a
        end
    end
    return best_action
end

# output

Verification

We can now verify that the online planner works in some simple cases:

using POMDPModels

gw = SimpleGridWorld(size=(2,1), rewards=Dict(GWPos(2,1)=>1.0))
solver = MonteCarloGreedySolver(1000)
planner = solve(solver, gw)

action(planner, GWPos(1,1))

# output

:right
using POMDPModels
using POMDPTools: Deterministic, Uniform

tiger = TigerPOMDP()
solver = MonteCarloGreedySolver(1000)

planner = solve(solver, tiger)

@assert action(planner, Deterministic(TIGER_LEFT)) == TIGER_OPEN_RIGHT
@assert action(planner, Deterministic(TIGER_RIGHT)) == TIGER_OPEN_LEFT
# note action(planner, Uniform(states(tiger))) is not very reliable with this number of samples