Implemented Policies
POMDPTools currently provides the following policy types:
- a wrapper to turn a function into a
Policy
- an alpha vector policy type
- a random policy
- a stochastic policy type
- exploration policies
- a vector policy type
- a wrapper to collect statistics and errors about policies
In addition, it provides the showpolicy
function for printing policies similar to the way that matrices are printed in the repl and the evaluate
function for evaluating MDP policies.
Function
Wraps a Function
mapping states to actions into a Policy
.
POMDPTools.Policies.FunctionPolicy
— TypeFunctionPolicy
Policy p=FunctionPolicy(f)
returns f(x)
when action(p, x)
is called.
POMDPTools.Policies.FunctionSolver
— TypeFunctionSolver
Solver for a FunctionPolicy.
Alpha Vector Policy
Represents a policy with a set of alpha vectors (See AlphaVectorPolicy
constructor docstring). In addition to finding the optimal action with action
, the alpha vectors can be accessed with alphavectors
or alphapairs
.
Determining the estimated value and optimal action depends on calculating the dot product between alpha vectors and a belief vector. POMDPTools.Policies.beliefvec(pomdp, b)
is used to create this vector and can be overridden for new belief types for efficiency.
POMDPTools.Policies.AlphaVectorPolicy
— TypeAlphaVectorPolicy(pomdp::POMDP, alphas, action_map)
Construct a policy from alpha vectors.
Arguments
alphas
: an |S| x (number of alpha vecs) matrix or a vector of alpha vectors.action_map
: a vector of the actions correponding to each alpha vectorAlphaVectorPolicy{P<:POMDP, A}
Represents a policy with a set of alpha vectors.
Use action
to get the best action for a belief, and alphavectors
and alphapairs
to
Fields
pomdp::P
the POMDP problemn_states::Int
the number of states in the POMDPalphas::Vector{Vector{Float64}}
the list of alpha vectorsaction_map::Vector{A}
a list of action corresponding to the alpha vectors
POMDPTools.Policies.alphavectors
— FunctionReturn the alpha vectors.
POMDPTools.Policies.alphapairs
— FunctionReturn an iterator of alpha vector-action pairs in the policy.
POMDPTools.Policies.beliefvec
— FunctionPOMDPTools.Policies.beliefvec(m::POMDP, n_states::Int, b)
Return a vector-like representation of the belief b
suitable for calculating the dot product with the alpha vectors.
Random Policy
A policy that returns a randomly selected action using rand(rng, actions(pomdp))
.
POMDPTools.Policies.RandomPolicy
— TypeRandomPolicy{RNG<:AbstractRNG, P<:Union{POMDP,MDP}, U<:Updater}
a generic policy that uses the actions function to create a list of actions and then randomly samples an action from it.
Constructor:
`RandomPolicy(problem::Union{POMDP,MDP};
rng=Random.default_rng(),
updater=NothingUpdater())`
Fields
rng::RNG
a random number generatorprobelm::P
the POMDP or MDP problemupdater::U
a belief updater (default toNothingUpdater
in the above constructor)
POMDPTools.Policies.RandomSolver
— Typesolver that produces a random policy
Stochastic Policies
Types for representing randomized policies:
StochasticPolicy
samples actions from an arbitrary distribution.UniformRandomPolicy
samples actions uniformly (seeRandomPolicy
for a similar use)CategoricalTabularPolicy
samples actions from a categorical distribution with weights given by aValuePolicy
.
POMDPTools.Policies.StochasticPolicy
— TypeStochasticPolicy{D, RNG <: AbstractRNG}
Represents a stochastic policy. Action are sampled from an arbitrary distribution.
Constructor:
`StochasticPolicy(distribution; rng=Random.default_rng())`
Fields
distribution::D
rng::RNG
a random number generator
POMDPTools.Policies.CategoricalTabularPolicy
— TypeCategoricalTabularPolicy
represents a stochastic policy sampling an action from a categorical distribution with weights given by a ValuePolicy
constructor:
CategoricalTabularPolicy(mdp::Union{POMDP,MDP}; rng=Random.default_rng())
Fields
stochastic::StochasticPolicy
value::ValuePolicy
Vector Policies
Tabular policies including the following:
VectorPolicy
holds a vector of actions, one for each state, ordered according tostateindex
.ValuePolicy
holds a matrix of values for state-action pairs and chooses the action with the highest value at the given state
POMDPTools.Policies.VectorPolicy
— TypeVectorPolicy{S,A}
A generic MDP policy that consists of a vector of actions. The entry at stateindex(mdp, s)
is the action that will be taken in state s
.
Fields
mdp::MDP{S,A}
the MDP problemact::Vector{A}
a vector of size |S| mapping state indices to actions
POMDPTools.Policies.VectorSolver
— TypeVectorSolver{A}
Solver for VectorPolicy. Doesn't do any computation - just sets the action vector.
Fields
act::Vector{A}
the action vector
POMDPTools.Policies.ValuePolicy
— Type ValuePolicy{P<:Union{POMDP,MDP}, T<:AbstractMatrix{Float64}, A}
A generic MDP policy that consists of a value table. The entry at stateindex(mdp, s)
is the action that will be taken in state s
. It is expected that the order of the actions in the value table is consistent with the order of the actions in act
. If act
is not explicitly set in the construction, act
is ordered according to actionindex
.
Fields
mdp::P
the MDP problemvalue_table::T
the value table as a |S|x|A| matrixact::Vector{A}
the possible actions
Value Dict Policy
ValueDictPolicy
holds a dictionary of values, where the key is state-action tuple, and chooses the action with the highest value at the given state. It allows one to write solvers without enumerating state and action spaces, but actions and states must support Base.isequal()
and Base.hash()
.
POMDPTools.Policies.ValueDictPolicy
— Type ValueDictPolicy(mdp)
A generic MDP policy that consists of a Dict
storing Q-values for state-action pairs. If there are no entries higher than a default value, this will fall back to a default policy.
Keyword Arguments
value_table::AbstractDict
the value dict, key is (s, a) Tuple.default_value::Float64
the defalut value ofvalue_dict
.default_policy::Policy
the policy taken when no action has a value higher thandefault_value
Exploration Policies
Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (on_policy
).
Exploration policies are subtype of the abstract ExplorationPolicy
type and they follow the following interface: action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s)
. k
is used to compute the value of the exploration parameter (see Schedule), and s
is the current state or observation in which the agent is taking an action.
The action
method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of action
where on_policy
is the policy being learned (e.g. tabular policy or neural network policy).
This package provides two exploration policies: EpsGreedyPolicy
and SoftmaxPolicy
POMDPTools.Policies.EpsGreedyPolicy
— TypeEpsGreedyPolicy <: ExplorationPolicy
represents an epsilon greedy policy, sampling a random action with a probability eps
or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.
Constructor:
EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.default_rng(), schedule=ConstantSchedule)
If a function is passed for eps
, eps(k)
is called to compute the value of epsilon when calling action(exploration_policy, on_policy, k, s)
.
Fields
eps::Function
rng::AbstractRNG
m::M
POMDPs or MDPs problem
POMDPTools.Policies.SoftmaxPolicy
— TypeSoftmaxPolicy <: ExplorationPolicy
represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.
Constructor
SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.default_rng())
If a function is passed for temperature
, temperature(k)
is called to compute the value of the temperature when calling action(exploration_policy, on_policy, k, s)
Fields
temperature::Function
rng::AbstractRNG
actions::A
an indexable list of action
Schedule
Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:
m # your mdp or pomdp model
exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))
POMDPTools
exports a linear decay schedule object that can be used as well.
POMDPTools.Policies.LinearDecaySchedule
— TypeLinearDecaySchedule
A schedule that linearly decreases a value from start
to stop
in steps
steps. if the value is greater or equal to stop
, it stays constant.
Constructor
LinearDecaySchedule(;start, stop, steps)
Playback Policy
A policy that replays a fixed sequence of actions. When all actions are used, a backup policy is used.
POMDPTools.Policies.PlaybackPolicy
— TypePlaybackPolicy{A<:AbstractArray, P<:Policy, V<:AbstractArray{<:Real}}
a policy that applies a fixed sequence of actions until they are all used and then falls back onto a backup policy until the end of the episode.
Constructor:
`PlaybackPolicy(actions::AbstractArray, backup_policy::Policy; logpdfs::AbstractArray{Float64, 1} = Float64[])`
Fields
actions::Vector{A}
a vector of actions to play backbackup_policy::Policy
the policy to use when all prescribed actions have been taken but the episode continueslogpdfs::Vector{Float64}
the log probability (density) of actionsi::Int64
the current action index
Utility Wrapper
A wrapper for policies to collect statistics and handle errors.
POMDPTools.Policies.PolicyWrapper
— TypePolicyWrapper
Flexible utility wrapper for a policy designed for collecting statistics about planning.
Carries a function, a policy, and optionally a payload (that can be any type).
The function should typically be defined with the do syntax. Each time action
is called on the wrapper, this function will be called.
If there is no payload, it will be called with two argments: the policy and the state/belief. If there is a payload, it will be called with three arguments: the policy, the payload, and the current state or belief. The function should return an appropriate action. The idea is that, in this function, action(policy, s)
should be called, statistics from the policy/planner should be collected and saved in the payload, exceptions can be handled, and the action should be returned.
Constructor
PolicyWrapper(policy::Policy; payload=nothing)
Example
using POMDPModels
using POMDPToolbox
mdp = GridWorld()
policy = RandomPolicy(mdp)
counts = Dict(a=>0 for a in actions(mdp))
# with a payload
statswrapper = PolicyWrapper(policy, payload=counts) do policy, counts, s
a = action(policy, s)
counts[a] += 1
return a
end
h = simulate(HistoryRecorder(max_steps=100), mdp, statswrapper)
for (a, count) in payload(statswrapper)
println("policy chose action $a $count of $(n_steps(h)) times.")
end
# without a payload
errwrapper = PolicyWrapper(policy) do policy, s
try
a = action(policy, s)
catch ex
@warn("Caught error in policy; using default")
a = :left
end
return a
end
h = simulate(HistoryRecorder(max_steps=100), mdp, errwrapper)
Fields
f::F
policy::P
payload::PL
Pretty Printing Policies
POMDPTools.Policies.showpolicy
— Functionshowpolicy([io], [mime], m::MDP, p::Policy)
showpolicy([io], [mime], statelist::AbstractVector, p::Policy)
showpolicy(...; pre=" ")
Print the states in m
or statelist
and the actions from policy p
corresponding to those states.
For the MDP version, if io[:limit]
is true
, will only print enough states to fill the display.
Policy Evaluation
The evaluate
function provides a policy evaluation tool for MDPs:
POMDPTools.Policies.evaluate
— Functionevaluate(m::MDP, p::Policy)
evaluate(m::MDP, p::Policy; rewardfunction=POMDPs.reward)
Calculate the value for a policy on an MDP using the approach in equation 4.2.2 of Kochenderfer, Decision Making Under Uncertainty, 2015.
Returns a DiscreteValueFunction, which maps states to values.
Example
using POMDPTools, POMDPModels
m = SimpleGridWorld()
u = evaluate(m, FunctionPolicy(x->:left))
u([1,1]) # value of always moving left starting at state [1,1]