Implemented Policies

POMDPTools currently provides the following policy types:

  • a wrapper to turn a function into a Policy
  • an alpha vector policy type
  • a random policy
  • a stochastic policy type
  • exploration policies
  • a vector policy type
  • a wrapper to collect statistics and errors about policies

In addition, it provides the showpolicy function for printing policies similar to the way that matrices are printed in the repl and the evaluate function for evaluating MDP policies.


Wraps a Function mapping states to actions into a Policy.

Alpha Vector Policy

Represents a policy with a set of alpha vectors (See AlphaVectorPolicy constructor docstring). In addition to finding the optimal action with action, the alpha vectors can be accessed with alphavectors or alphapairs.

Determining the estimated value and optimal action depends on calculating the dot product between alpha vectors and a belief vector. POMDPTools.Policies.beliefvec(pomdp, b) is used to create this vector and can be overridden for new belief types for efficiency.

AlphaVectorPolicy(pomdp::POMDP, alphas, action_map)

Construct a policy from alpha vectors.


  • alphas: an |S| x (number of alpha vecs) matrix or a vector of alpha vectors.

  • action_map: a vector of the actions correponding to each alpha vector

    AlphaVectorPolicy{P<:POMDP, A}

Represents a policy with a set of alpha vectors.

Use action to get the best action for a belief, and alphavectors and alphapairs to


  • pomdp::P the POMDP problem
  • n_states::Int the number of states in the POMDP
  • alphas::Vector{Vector{Float64}} the list of alpha vectors
  • action_map::Vector{A} a list of action corresponding to the alpha vectors
POMDPTools.Policies.beliefvec(m::POMDP, n_states::Int, b)

Return a vector-like representation of the belief b suitable for calculating the dot product with the alpha vectors.


Random Policy

A policy that returns a randomly selected action using rand(rng, actions(pomdp)).

RandomPolicy{RNG<:AbstractRNG, P<:Union{POMDP,MDP}, U<:Updater}

a generic policy that uses the actions function to create a list of actions and then randomly samples an action from it.




  • rng::RNG a random number generator
  • probelm::P the POMDP or MDP problem
  • updater::U a belief updater (default to NothingUpdater in the above constructor)

Stochastic Policies

Types for representing randomized policies:

  • StochasticPolicy samples actions from an arbitrary distribution.
  • UniformRandomPolicy samples actions uniformly (see RandomPolicy for a similar use)
  • CategoricalTabularPolicy samples actions from a categorical distribution with weights given by a ValuePolicy.

StochasticPolicy{D, RNG <: AbstractRNG}

Represents a stochastic policy. Action are sampled from an arbitrary distribution.


`StochasticPolicy(distribution; rng=Random.default_rng())`


  • distribution::D
  • rng::RNG a random number generator

represents a stochastic policy sampling an action from a categorical distribution with weights given by a ValuePolicy


CategoricalTabularPolicy(mdp::Union{POMDP,MDP}; rng=Random.default_rng())


  • stochastic::StochasticPolicy
  • value::ValuePolicy

Vector Policies

Tabular policies including the following:

  • VectorPolicy holds a vector of actions, one for each state, ordered according to stateindex.
  • ValuePolicy holds a matrix of values for state-action pairs and chooses the action with the highest value at the given state

A generic MDP policy that consists of a vector of actions. The entry at stateindex(mdp, s) is the action that will be taken in state s.


  • mdp::MDP{S,A} the MDP problem
  • act::Vector{A} a vector of size |S| mapping state indices to actions
 ValuePolicy{P<:Union{POMDP,MDP}, T<:AbstractMatrix{Float64}, A}

A generic MDP policy that consists of a value table. The entry at stateindex(mdp, s) is the action that will be taken in state s. It is expected that the order of the actions in the value table is consistent with the order of the actions in act. If act is not explicitly set in the construction, act is ordered according to actionindex.


  • mdp::P the MDP problem
  • value_table::T the value table as a |S|x|A| matrix
  • act::Vector{A} the possible actions

Value Dict Policy

ValueDictPolicy holds a dictionary of values, where the key is state-action tuple, and chooses the action with the highest value at the given state. It allows one to write solvers without enumerating state and action spaces, but actions and states must support Base.isequal() and Base.hash().


A generic MDP policy that consists of a Dict storing Q-values for state-action pairs. If there are no entries higher than a default value, this will fall back to a default policy.

Keyword Arguments

  • value_table::AbstractDict the value dict, key is (s, a) Tuple.
  • default_value::Float64 the defalut value of value_dict.
  • default_policy::Policy the policy taken when no action has a value higher than default_value

Exploration Policies

Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (on_policy).

Exploration policies are subtype of the abstract ExplorationPolicy type and they follow the following interface: action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s). k is used to compute the value of the exploration parameter (see Schedule), and s is the current state or observation in which the agent is taking an action.

The action method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of action where on_policy is the policy being learned (e.g. tabular policy or neural network policy).

This package provides two exploration policies: EpsGreedyPolicy and SoftmaxPolicy

EpsGreedyPolicy <: ExplorationPolicy

represents an epsilon greedy policy, sampling a random action with a probability eps or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.


EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.default_rng(), schedule=ConstantSchedule)

If a function is passed for eps, eps(k) is called to compute the value of epsilon when calling action(exploration_policy, on_policy, k, s).


  • eps::Function
  • rng::AbstractRNG
  • m::M POMDPs or MDPs problem
SoftmaxPolicy <: ExplorationPolicy

represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.


SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.default_rng())

If a function is passed for temperature, temperature(k) is called to compute the value of the temperature when calling action(exploration_policy, on_policy, k, s)


  • temperature::Function
  • rng::AbstractRNG
  • actions::A an indexable list of action


Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:

    m # your mdp or pomdp model
    exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))

POMDPTools exports a linear decay schedule object that can be used as well.


A schedule that linearly decreases a value from start to stop in steps steps. if the value is greater or equal to stop, it stays constant.


LinearDecaySchedule(;start, stop, steps)


Playback Policy

A policy that replays a fixed sequence of actions. When all actions are used, a backup policy is used.

PlaybackPolicy{A<:AbstractArray, P<:Policy, V<:AbstractArray{<:Real}}

a policy that applies a fixed sequence of actions until they are all used and then falls back onto a backup policy until the end of the episode.


`PlaybackPolicy(actions::AbstractArray, backup_policy::Policy; logpdfs::AbstractArray{Float64, 1} = Float64[])`


  • actions::Vector{A} a vector of actions to play back
  • backup_policy::Policy the policy to use when all prescribed actions have been taken but the episode continues
  • logpdfs::Vector{Float64} the log probability (density) of actions
  • i::Int64 the current action index

Utility Wrapper

A wrapper for policies to collect statistics and handle errors.


Flexible utility wrapper for a policy designed for collecting statistics about planning.

Carries a function, a policy, and optionally a payload (that can be any type).

The function should typically be defined with the do syntax. Each time action is called on the wrapper, this function will be called.

If there is no payload, it will be called with two argments: the policy and the state/belief. If there is a payload, it will be called with three arguments: the policy, the payload, and the current state or belief. The function should return an appropriate action. The idea is that, in this function, action(policy, s) should be called, statistics from the policy/planner should be collected and saved in the payload, exceptions can be handled, and the action should be returned.


PolicyWrapper(policy::Policy; payload=nothing)


using POMDPModels
using POMDPToolbox

mdp = GridWorld()
policy = RandomPolicy(mdp)
counts = Dict(a=>0 for a in actions(mdp))

# with a payload
statswrapper = PolicyWrapper(policy, payload=counts) do policy, counts, s
    a = action(policy, s)
    counts[a] += 1
    return a

h = simulate(HistoryRecorder(max_steps=100), mdp, statswrapper)
for (a, count) in payload(statswrapper)
    println("policy chose action $a $count of $(n_steps(h)) times.")

# without a payload
errwrapper = PolicyWrapper(policy) do policy, s
        a = action(policy, s)
    catch ex
        @warn("Caught error in policy; using default")
        a = :left
    return a

h = simulate(HistoryRecorder(max_steps=100), mdp, errwrapper)


  • f::F
  • policy::P
  • payload::PL

Pretty Printing Policies

showpolicy([io], [mime], m::MDP, p::Policy)
showpolicy([io], [mime], statelist::AbstractVector, p::Policy)
showpolicy(...; pre=" ")

Print the states in m or statelist and the actions from policy p corresponding to those states.

For the MDP version, if io[:limit] is true, will only print enough states to fill the display.


Policy Evaluation

The evaluate function provides a policy evaluation tool for MDPs:

evaluate(m::MDP, p::Policy)
evaluate(m::MDP, p::Policy; rewardfunction=POMDPs.reward)

Calculate the value for a policy on an MDP using the approach in equation 4.2.2 of Kochenderfer, Decision Making Under Uncertainty, 2015.

Returns a DiscreteValueFunction, which maps states to values.


using POMDPTools, POMDPModels
m = SimpleGridWorld()
u = evaluate(m, FunctionPolicy(x->:left))
u([1,1]) # value of always moving left starting at state [1,1]