# Implemented Policies

POMDPTools currently provides the following policy types:

- a wrapper to turn a function into a
`Policy`

- an alpha vector policy type
- a random policy
- a stochastic policy type
- exploration policies
- a vector policy type
- a wrapper to collect statistics and errors about policies

In addition, it provides the `showpolicy`

function for printing policies similar to the way that matrices are printed in the repl and the `evaluate`

function for evaluating MDP policies.

## Function

Wraps a `Function`

mapping states to actions into a `Policy`

.

`POMDPTools.Policies.FunctionPolicy`

— TypeFunctionPolicy

Policy `p=FunctionPolicy(f)`

returns `f(x)`

when `action(p, x)`

is called.

`POMDPTools.Policies.FunctionSolver`

— TypeFunctionSolver

Solver for a FunctionPolicy.

## Alpha Vector Policy

Represents a policy with a set of alpha vectors (See `AlphaVectorPolicy`

constructor docstring). In addition to finding the optimal action with `action`

, the alpha vectors can be accessed with `alphavectors`

or `alphapairs`

.

Determining the estimated value and optimal action depends on calculating the dot product between alpha vectors and a belief vector. `POMDPTools.Policies.beliefvec(pomdp, b)`

is used to create this vector and can be overridden for new belief types for efficiency.

`POMDPTools.Policies.AlphaVectorPolicy`

— Type`AlphaVectorPolicy(pomdp::POMDP, alphas, action_map)`

Construct a policy from alpha vectors.

**Arguments**

`alphas`

: an |S| x (number of alpha vecs) matrix or a vector of alpha vectors.`action_map`

: a vector of the actions correponding to each alpha vectorAlphaVectorPolicy{P<:POMDP, A}

Represents a policy with a set of alpha vectors.

Use `action`

to get the best action for a belief, and `alphavectors`

and `alphapairs`

to

**Fields**

`pomdp::P`

the POMDP problem`n_states::Int`

the number of states in the POMDP`alphas::Vector{Vector{Float64}}`

the list of alpha vectors`action_map::Vector{A}`

a list of action corresponding to the alpha vectors

`POMDPTools.Policies.alphavectors`

— FunctionReturn the alpha vectors.

`POMDPTools.Policies.alphapairs`

— FunctionReturn an iterator of alpha vector-action pairs in the policy.

`POMDPTools.Policies.beliefvec`

— Function`POMDPTools.Policies.beliefvec(m::POMDP, n_states::Int, b)`

Return a vector-like representation of the belief `b`

suitable for calculating the dot product with the alpha vectors.

## Random Policy

A policy that returns a randomly selected action using `rand(rng, actions(pomdp))`

.

`POMDPTools.Policies.RandomPolicy`

— Type`RandomPolicy{RNG<:AbstractRNG, P<:Union{POMDP,MDP}, U<:Updater}`

a generic policy that uses the actions function to create a list of actions and then randomly samples an action from it.

Constructor:

```
`RandomPolicy(problem::Union{POMDP,MDP};
rng=Random.default_rng(),
updater=NothingUpdater())`
```

**Fields**

`rng::RNG`

a random number generator`probelm::P`

the POMDP or MDP problem`updater::U`

a belief updater (default to`NothingUpdater`

in the above constructor)

`POMDPTools.Policies.RandomSolver`

— Typesolver that produces a random policy

## Stochastic Policies

Types for representing randomized policies:

`StochasticPolicy`

samples actions from an arbitrary distribution.`UniformRandomPolicy`

samples actions uniformly (see`RandomPolicy`

for a similar use)`CategoricalTabularPolicy`

samples actions from a categorical distribution with weights given by a`ValuePolicy`

.

`POMDPTools.Policies.StochasticPolicy`

— TypeStochasticPolicy{D, RNG <: AbstractRNG}

Represents a stochastic policy. Action are sampled from an arbitrary distribution.

Constructor:

``StochasticPolicy(distribution; rng=Random.default_rng())``

**Fields**

`distribution::D`

`rng::RNG`

a random number generator

`POMDPTools.Policies.CategoricalTabularPolicy`

— Type`CategoricalTabularPolicy`

represents a stochastic policy sampling an action from a categorical distribution with weights given by a `ValuePolicy`

constructor:

`CategoricalTabularPolicy(mdp::Union{POMDP,MDP}; rng=Random.default_rng())`

**Fields**

`stochastic::StochasticPolicy`

`value::ValuePolicy`

## Vector Policies

Tabular policies including the following:

`VectorPolicy`

holds a vector of actions, one for each state, ordered according to`stateindex`

.`ValuePolicy`

holds a matrix of values for state-action pairs and chooses the action with the highest value at the given state

`POMDPTools.Policies.VectorPolicy`

— Type`VectorPolicy{S,A}`

A generic MDP policy that consists of a vector of actions. The entry at `stateindex(mdp, s)`

is the action that will be taken in state `s`

.

**Fields**

`mdp::MDP{S,A}`

the MDP problem`act::Vector{A}`

a vector of size |S| mapping state indices to actions

`POMDPTools.Policies.VectorSolver`

— Type`VectorSolver{A}`

Solver for VectorPolicy. Doesn't do any computation - just sets the action vector.

**Fields**

`act::Vector{A}`

the action vector

`POMDPTools.Policies.ValuePolicy`

— Type` ValuePolicy{P<:Union{POMDP,MDP}, T<:AbstractMatrix{Float64}, A}`

A generic MDP policy that consists of a value table. The entry at `stateindex(mdp, s)`

is the action that will be taken in state `s`

. It is expected that the order of the actions in the value table is consistent with the order of the actions in `act`

. If `act`

is not explicitly set in the construction, `act`

is ordered according to `actionindex`

.

**Fields**

`mdp::P`

the MDP problem`value_table::T`

the value table as a |S|x|A| matrix`act::Vector{A}`

the possible actions

## Value Dict Policy

`ValueDictPolicy`

holds a dictionary of values, where the key is state-action tuple, and chooses the action with the highest value at the given state. It allows one to write solvers without enumerating state and action spaces, but actions and states must support `Base.isequal()`

and `Base.hash()`

.

`POMDPTools.Policies.ValueDictPolicy`

— Type` ValueDictPolicy(mdp)`

A generic MDP policy that consists of a `Dict`

storing Q-values for state-action pairs. If there are no entries higher than a default value, this will fall back to a default policy.

**Keyword Arguments**

`value_table::AbstractDict`

the value dict, key is (s, a) Tuple.`default_value::Float64`

the defalut value of`value_dict`

.`default_policy::Policy`

the policy taken when no action has a value higher than`default_value`

## Exploration Policies

Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (`on_policy`

).

Exploration policies are subtype of the abstract `ExplorationPolicy`

type and they follow the following interface: `action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s)`

. `k`

is used to compute the value of the exploration parameter (see Schedule), and `s`

is the current state or observation in which the agent is taking an action.

The `action`

method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of `action`

where `on_policy`

is the policy being learned (e.g. tabular policy or neural network policy).

This package provides two exploration policies: `EpsGreedyPolicy`

and `SoftmaxPolicy`

`POMDPTools.Policies.EpsGreedyPolicy`

— Type`EpsGreedyPolicy <: ExplorationPolicy`

represents an epsilon greedy policy, sampling a random action with a probability `eps`

or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.

**Constructor:**

`EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.default_rng(), schedule=ConstantSchedule)`

If a function is passed for `eps`

, `eps(k)`

is called to compute the value of epsilon when calling `action(exploration_policy, on_policy, k, s)`

.

**Fields**

`eps::Function`

`rng::AbstractRNG`

`m::M`

POMDPs or MDPs problem

`POMDPTools.Policies.SoftmaxPolicy`

— Type`SoftmaxPolicy <: ExplorationPolicy`

represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.

**Constructor**

`SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.default_rng())`

If a function is passed for `temperature`

, `temperature(k)`

is called to compute the value of the temperature when calling `action(exploration_policy, on_policy, k, s)`

**Fields**

`temperature::Function`

`rng::AbstractRNG`

`actions::A`

an indexable list of action

### Schedule

Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:

```
m # your mdp or pomdp model
exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))
```

`POMDPTools`

exports a linear decay schedule object that can be used as well.

`POMDPTools.Policies.LinearDecaySchedule`

— Type`LinearDecaySchedule`

A schedule that linearly decreases a value from `start`

to `stop`

in `steps`

steps. if the value is greater or equal to `stop`

, it stays constant.

**Constructor**

`LinearDecaySchedule(;start, stop, steps)`

## Playback Policy

A policy that replays a fixed sequence of actions. When all actions are used, a backup policy is used.

`POMDPTools.Policies.PlaybackPolicy`

— Type`PlaybackPolicy{A<:AbstractArray, P<:Policy, V<:AbstractArray{<:Real}}`

a policy that applies a fixed sequence of actions until they are all used and then falls back onto a backup policy until the end of the episode.

Constructor:

``PlaybackPolicy(actions::AbstractArray, backup_policy::Policy; logpdfs::AbstractArray{Float64, 1} = Float64[])``

**Fields**

`actions::Vector{A}`

a vector of actions to play back`backup_policy::Policy`

the policy to use when all prescribed actions have been taken but the episode continues`logpdfs::Vector{Float64}`

the log probability (density) of actions`i::Int64`

the current action index

## Utility Wrapper

A wrapper for policies to collect statistics and handle errors.

`POMDPTools.Policies.PolicyWrapper`

— Type`PolicyWrapper`

Flexible utility wrapper for a policy designed for collecting statistics about planning.

Carries a function, a policy, and optionally a payload (that can be any type).

The function should typically be defined with the do syntax. Each time `action`

is called on the wrapper, this function will be called.

If there is no payload, it will be called with two argments: the policy and the state/belief. If there is a payload, it will be called with three arguments: the policy, the payload, and the current state or belief. The function should return an appropriate action. The idea is that, in this function, `action(policy, s)`

should be called, statistics from the policy/planner should be collected and saved in the payload, exceptions can be handled, and the action should be returned.

Constructor

`PolicyWrapper(policy::Policy; payload=nothing)`

**Example**

```
using POMDPModels
using POMDPToolbox
mdp = GridWorld()
policy = RandomPolicy(mdp)
counts = Dict(a=>0 for a in actions(mdp))
# with a payload
statswrapper = PolicyWrapper(policy, payload=counts) do policy, counts, s
a = action(policy, s)
counts[a] += 1
return a
end
h = simulate(HistoryRecorder(max_steps=100), mdp, statswrapper)
for (a, count) in payload(statswrapper)
println("policy chose action $a $count of $(n_steps(h)) times.")
end
# without a payload
errwrapper = PolicyWrapper(policy) do policy, s
try
a = action(policy, s)
catch ex
@warn("Caught error in policy; using default")
a = :left
end
return a
end
h = simulate(HistoryRecorder(max_steps=100), mdp, errwrapper)
```

**Fields**

`f::F`

`policy::P`

`payload::PL`

## Pretty Printing Policies

`POMDPTools.Policies.showpolicy`

— Function```
showpolicy([io], [mime], m::MDP, p::Policy)
showpolicy([io], [mime], statelist::AbstractVector, p::Policy)
showpolicy(...; pre=" ")
```

Print the states in `m`

or `statelist`

and the actions from policy `p`

corresponding to those states.

For the MDP version, if `io[:limit]`

is `true`

, will only print enough states to fill the display.

# Policy Evaluation

The `evaluate`

function provides a policy evaluation tool for MDPs:

`POMDPTools.Policies.evaluate`

— Function```
evaluate(m::MDP, p::Policy)
evaluate(m::MDP, p::Policy; rewardfunction=POMDPs.reward)
```

Calculate the value for a policy on an MDP using the approach in equation 4.2.2 of Kochenderfer, *Decision Making Under Uncertainty*, 2015.

Returns a DiscreteValueFunction, which maps states to values.

**Example**

```
using POMDPTools, POMDPModels
m = SimpleGridWorld()
u = evaluate(m, FunctionPolicy(x->:left))
u([1,1]) # value of always moving left starting at state [1,1]
```