Implemented Policies

POMDPTools currently provides the following policy types:

a wrapper to turn a function into a Policy
an alpha vector policy type
a random policy
a stochastic policy type
exploration policies
a vector policy type
a wrapper to collect statistics and errors about policies

In addition, it provides the showpolicy function for printing policies similar to the way that matrices are printed in the repl and the evaluate function for evaluating MDP policies.

Function

Wraps a Function mapping states to actions into a Policy.

POMDPTools.Policies.FunctionPolicy — Type

FunctionPolicy

Policy p=FunctionPolicy(f) returns f(x) when action(p, x) is called.

source

POMDPTools.Policies.FunctionSolver — Type

FunctionSolver

Solver for a FunctionPolicy.

source

Represents a policy with a set of alpha vectors (See AlphaVectorPolicy constructor docstring). In addition to finding the optimal action with action, the alpha vectors can be accessed with alphavectors or alphapairs.

Determining the estimated value and optimal action depends on calculating the dot product between alpha vectors and a belief vector. POMDPTools.Policies.beliefvec(pomdp, b) is used to create this vector and can be overridden for new belief types for efficiency.

POMDPTools.Policies.AlphaVectorPolicy — Type

AlphaVectorPolicy(pomdp::POMDP, alphas, action_map)

Construct a policy from alpha vectors.

Arguments

alphas: an |S| x (number of alpha vecs) matrix or a vector of alpha vectors.
action_map: a vector of the actions correponding to each alpha vector
AlphaVectorPolicy{P<:POMDP, A}

Represents a policy with a set of alpha vectors.

Use action to get the best action for a belief, and alphavectors and alphapairs to

Fields

pomdp::P the POMDP problem
n_states::Int the number of states in the POMDP
alphas::Vector{Vector{Float64}} the list of alpha vectors
action_map::Vector{A} a list of action corresponding to the alpha vectors

source

POMDPTools.Policies.alphavectors — Function

Return the alpha vectors.

source

POMDPTools.Policies.alphapairs — Function

Return an iterator of alpha vector-action pairs in the policy.

source

POMDPTools.Policies.beliefvec — Function

POMDPTools.Policies.beliefvec(m::POMDP, n_states::Int, b)

Return a vector-like representation of the belief b suitable for calculating the dot product with the alpha vectors.

source

Random Policy

A policy that returns a randomly selected action using rand(rng, actions(pomdp)).

POMDPTools.Policies.RandomPolicy — Type

RandomPolicy{RNG<:AbstractRNG, P<:Union{POMDP,MDP}, U<:Updater}

a generic policy that uses the actions function to create a list of actions and then randomly samples an action from it.

Constructor:

`RandomPolicy(problem::Union{POMDP,MDP};
         rng=Random.default_rng(),
         updater=NothingUpdater())`

Fields

rng::RNG a random number generator
probelm::P the POMDP or MDP problem
updater::U a belief updater (default to NothingUpdater in the above constructor)

source

POMDPTools.Policies.RandomSolver — Type

solver that produces a random policy

source

Stochastic Policies

Types for representing randomized policies:

StochasticPolicy samples actions from an arbitrary distribution.
UniformRandomPolicy samples actions uniformly (see RandomPolicy for a similar use)
CategoricalTabularPolicy samples actions from a categorical distribution with weights given by a ValuePolicy.

POMDPTools.Policies.StochasticPolicy — Type

StochasticPolicy{D, RNG <: AbstractRNG}

Represents a stochastic policy. Action are sampled from an arbitrary distribution.

Constructor:

`StochasticPolicy(distribution; rng=Random.default_rng())`

Fields

distribution::D
rng::RNG a random number generator

source

POMDPTools.Policies.CategoricalTabularPolicy — Type

CategoricalTabularPolicy

represents a stochastic policy sampling an action from a categorical distribution with weights given by a ValuePolicy

constructor:

CategoricalTabularPolicy(mdp::Union{POMDP,MDP}; rng=Random.default_rng())

Fields

stochastic::StochasticPolicy
value::ValuePolicy

source

Vector Policies

Tabular policies including the following:

VectorPolicy holds a vector of actions, one for each state, ordered according to stateindex.
ValuePolicy holds a matrix of values for state-action pairs and chooses the action with the highest value at the given state

POMDPTools.Policies.VectorPolicy — Type

VectorPolicy{S,A}

A generic MDP policy that consists of a vector of actions. The entry at stateindex(mdp, s) is the action that will be taken in state s.

Fields

mdp::MDP{S,A} the MDP problem
act::Vector{A} a vector of size |S| mapping state indices to actions

source

POMDPTools.Policies.VectorSolver — Type

VectorSolver{A}

Solver for VectorPolicy. Doesn't do any computation - just sets the action vector.

Fields

act::Vector{A} the action vector

source

POMDPTools.Policies.ValuePolicy — Type

 ValuePolicy{P<:Union{POMDP,MDP}, T<:AbstractMatrix{Float64}, A}

A generic MDP policy that consists of a value table. The entry at stateindex(mdp, s) is the action that will be taken in state s. It is expected that the order of the actions in the value table is consistent with the order of the actions in act. If act is not explicitly set in the construction, act is ordered according to actionindex.

Fields

mdp::P the MDP problem
value_table::T the value table as a |S|x|A| matrix
act::Vector{A} the possible actions

source

Value Dict Policy

ValueDictPolicy holds a dictionary of values, where the key is state-action tuple, and chooses the action with the highest value at the given state. It allows one to write solvers without enumerating state and action spaces, but actions and states must support Base.isequal() and Base.hash().

POMDPTools.Policies.ValueDictPolicy — Type

 ValueDictPolicy(mdp)

A generic MDP policy that consists of a Dict storing Q-values for state-action pairs. If there are no entries higher than a default value, this will fall back to a default policy.

Keyword Arguments

value_table::AbstractDict the value dict, key is (s, a) Tuple.
default_value::Float64 the defalut value of value_dict.
default_policy::Policy the policy taken when no action has a value higher than default_value

source

Exploration Policies

Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (on_policy).

Exploration policies are subtype of the abstract ExplorationPolicy type and they follow the following interface: action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s). k is used to compute the value of the exploration parameter (see Schedule), and s is the current state or observation in which the agent is taking an action.

The action method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of action where on_policy is the policy being learned (e.g. tabular policy or neural network policy).

This package provides two exploration policies: EpsGreedyPolicy and SoftmaxPolicy

POMDPTools.Policies.EpsGreedyPolicy — Type

EpsGreedyPolicy <: ExplorationPolicy

represents an epsilon greedy policy, sampling a random action with a probability eps or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.

Constructor:

EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.default_rng(), schedule=ConstantSchedule)

If a function is passed for eps, eps(k) is called to compute the value of epsilon when calling action(exploration_policy, on_policy, k, s).

Fields

eps::Function
rng::AbstractRNG
m::M POMDPs or MDPs problem

source

POMDPTools.Policies.SoftmaxPolicy — Type

SoftmaxPolicy <: ExplorationPolicy

represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.

Constructor

SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.default_rng())

If a function is passed for temperature, temperature(k) is called to compute the value of the temperature when calling action(exploration_policy, on_policy, k, s)

Fields

temperature::Function
rng::AbstractRNG
actions::A an indexable list of action

source

Schedule

Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:

    m # your mdp or pomdp model
    exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))

POMDPTools exports a linear decay schedule object that can be used as well.

POMDPTools.Policies.LinearDecaySchedule — Type

LinearDecaySchedule

A schedule that linearly decreases a value from start to stop in steps steps. if the value is greater or equal to stop, it stays constant.

Constructor

LinearDecaySchedule(;start, stop, steps)

source

Playback Policy

A policy that replays a fixed sequence of actions. When all actions are used, a backup policy is used.

POMDPTools.Policies.PlaybackPolicy — Type

PlaybackPolicy{A<:AbstractArray, P<:Policy, V<:AbstractArray{<:Real}}

a policy that applies a fixed sequence of actions until they are all used and then falls back onto a backup policy until the end of the episode.

Constructor:

`PlaybackPolicy(actions::AbstractArray, backup_policy::Policy; logpdfs::AbstractArray{Float64, 1} = Float64[])`

Fields

actions::Vector{A} a vector of actions to play back
backup_policy::Policy the policy to use when all prescribed actions have been taken but the episode continues
logpdfs::Vector{Float64} the log probability (density) of actions
i::Int64 the current action index

source

Utility Wrapper

A wrapper for policies to collect statistics and handle errors.

POMDPTools.Policies.PolicyWrapper — Type

PolicyWrapper

Flexible utility wrapper for a policy designed for collecting statistics about planning.

Carries a function, a policy, and optionally a payload (that can be any type).

The function should typically be defined with the do syntax. Each time action is called on the wrapper, this function will be called.

If there is no payload, it will be called with two argments: the policy and the state/belief. If there is a payload, it will be called with three arguments: the policy, the payload, and the current state or belief. The function should return an appropriate action. The idea is that, in this function, action(policy, s) should be called, statistics from the policy/planner should be collected and saved in the payload, exceptions can be handled, and the action should be returned.

Constructor

PolicyWrapper(policy::Policy; payload=nothing)

Example

using POMDPModels
using POMDPToolbox

mdp = GridWorld()
policy = RandomPolicy(mdp)
counts = Dict(a=>0 for a in actions(mdp))

# with a payload
statswrapper = PolicyWrapper(policy, payload=counts) do policy, counts, s
    a = action(policy, s)
    counts[a] += 1
    return a
end

h = simulate(HistoryRecorder(max_steps=100), mdp, statswrapper)
for (a, count) in payload(statswrapper)
    println("policy chose action $a $count of $(n_steps(h)) times.")
end

# without a payload
errwrapper = PolicyWrapper(policy) do policy, s
    try
        a = action(policy, s)
    catch ex
        @warn("Caught error in policy; using default")
        a = :left
    end
    return a
end

h = simulate(HistoryRecorder(max_steps=100), mdp, errwrapper)

Fields

f::F
policy::P
payload::PL

source

Pretty Printing Policies

POMDPTools.Policies.showpolicy — Function

showpolicy([io], [mime], m::MDP, p::Policy)
showpolicy([io], [mime], statelist::AbstractVector, p::Policy)
showpolicy(...; pre=" ")

Print the states in m or statelist and the actions from policy p corresponding to those states.

For the MDP version, if io[:limit] is true, will only print enough states to fill the display.

source

Policy Evaluation

The evaluate function provides a policy evaluation tool for MDPs:

POMDPTools.Policies.evaluate — Function

evaluate(m::MDP, p::Policy)
evaluate(m::MDP, p::Policy; rewardfunction=POMDPs.reward)

Calculate the value for a policy on an MDP using the approach in equation 4.2.2 of Kochenderfer, Decision Making Under Uncertainty, 2015.

Returns a DiscreteValueFunction, which maps states to values.

Example

using POMDPTools, POMDPModels
m = SimpleGridWorld()
u = evaluate(m, FunctionPolicy(x->:left))
u([1,1]) # value of always moving left starting at state [1,1]

source