Example: Defining an offline solver

In this example, we will define a simple offline solver that works for both POMDPs and MDPs. In order to focus on the code structure, we will not create an algorithm that finds an optimal policy, but rather a greedy policy, that is, one that optimizes the expected immediate reward. For information on using this solver in a simulation, see Running Simulations.

We begin by creating a solver type. Since there are no adjustable parameters for the solver, it is an empty type, but for a more complex solver, parameters would usually be included as type fields.

using POMDPs

struct GreedyOfflineSolver <: Solver end

Next, we define the functions that will make the solver work for both MDPs and POMDPs.

MDP Case

Finding a greedy policy for an MDP consists of determining the action that has the best reward for each state. First, we create a simple policy object that holds a greedy action for each state.

struct DictPolicy{S,A} <: Policy
    actions::Dict{S,A}
end

POMDPs.action(p::DictPolicy, s) = p.actions[s]
Note

A POMDPTools.VectorPolicy could be used here. We include this example to show how to define a custom policy.

The solve function calculates the best greedy action for each state and saves it in a policy. To have the widest possible compatibility with POMDP models, we want to use reward(m, s, a, sp) instead of reward(m, s, a), which means we need to calculate the expectation of the reward over transitions to every possible next state.

function POMDPs.solve(::GreedyOfflineSolver, m::MDP)

    best_actions = Dict{statetype(m), actiontype(m)}()

    for s in states(m)
        if !isterminal(m, s)
            best = -Inf
            for a in actions(m)
                td = transition(m, s, a)
                r = 0.0
                for sp in support(td)
                    r += pdf(td, sp) * reward(m, s, a, sp)
                end
                if r >= best
                    best_actions[s] = a
                    best = r
                end
            end
        end
    end
    
    return DictPolicy(best_actions)
end
Note

We limited this implementation to using basic POMDPs.jl implementation functions, but tools such as POMDPTools.StateActionReward, POMDPTools.ordered_states, and POMDPTools.weighted_iterator could have been used for a more concise and efficient implementation.

We can now verify whether the policy produces the greedy action on an example from POMDPModels:

using POMDPModels

gw = SimpleGridWorld(size=(2,1), rewards=Dict(GWPos(2,1)=>1.0))
policy = solve(GreedyOfflineSolver(), gw)

action(policy, GWPos(1,1))

# output

:right

POMDP Case

For a POMDP, the greedy solution is the action that maximizes the expected immediate reward according to the belief. Since there are an infinite number of possible beliefs, the greedy solution for every belief cannot be calculated online. However, the greedy policy can take the form of an alpha vector policy where each action has an associated alpha vector with each entry corresponding to the immediate reward from taking the action in that state.

Again, because a POMDP, may have reward(m, s, a, sp, o) instead of reward(m, s, a), we use the former and calculate the expectation over all next states and observations.

using POMDPTools: AlphaVectorPolicy

function POMDPs.solve(::GreedyOfflineSolver, m::POMDP)

    alphas = Vector{Float64}[]

    for a in actions(m)
        alpha = zeros(length(states(m)))
        for s in states(m)
            if !isterminal(m, s)
                r = 0.0
                td = transition(m, s, a)
                for sp in support(td)
                    tp = pdf(td, sp)
                    od = observation(m, s, a, sp)
                    for o in support(od)
                        r += tp * pdf(od, o) * reward(m, s, a, sp, o)
                    end
                end
                alpha[stateindex(m, s)] = r
            end
        end
        push!(alphas, alpha)
    end
    
    return AlphaVectorPolicy(m, alphas, collect(actions(m)))
end

We can now verify that a policy created by the solver determines the correct greedy actions:

using POMDPModels
using POMDPTools: Deterministic, Uniform

tiger = TigerPOMDP()
policy = solve(GreedyOfflineSolver(), tiger)

@assert action(policy, Deterministic(TIGER_LEFT)) == TIGER_OPEN_RIGHT
@assert action(policy, Deterministic(TIGER_RIGHT)) == TIGER_OPEN_LEFT
@assert action(policy, Uniform(states(tiger))) == TIGER_LISTEN