Exploration Policies
Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (on_policy
).
Exploration policies are subtype of the abstract ExplorationPolicy
type and they follow the following interface: action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s)
. k
is used to compute the value of the exploration parameter (see Schedule), and s
is the current state or observation in which the agent is taking an action.
The action
method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of action
where on_policy
is the policy being learned (e.g. tabular policy or neural network policy).
This package provides two exploration policies: EpsGreedyPolicy
and SoftmaxPolicy
POMDPPolicies.EpsGreedyPolicy
— TypeEpsGreedyPolicy <: ExplorationPolicy
represents an epsilon greedy policy, sampling a random action with a probability eps
or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.
Constructor:
EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.GLOBAL_RNG, schedule=ConstantSchedule)
If a function is passed for eps
, eps(k)
is called to compute the value of epsilon when calling action(exploration_policy, on_policy, k, s)
.
Fields
eps::Function
rng::AbstractRNG
actions::A
an indexable list of action
POMDPPolicies.SoftmaxPolicy
— TypeSoftmaxPolicy <: ExplorationPolicy
represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.
Constructor
SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.GLOBAL_RNG)
If a function is passed for temperature
, temperature(k)
is called to compute the value of the temperature when calling action(exploration_policy, on_policy, k, s)
Fields
temperature::Function
rng::AbstractRNG
actions::A
an indexable list of action
Schedule
Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:
m # your mdp or pomdp model
exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))
POMDPPolicies.jl
exports a linear decay schedule object that can be used as well.
POMDPPolicies.LinearDecaySchedule
— TypeLinearDecaySchedule
A schedule that linearly decreases a value from start
to stop
in steps
steps. if the value is greater or equal to stop
, it stays constant.
Constructor
LinearDecaySchedule(;start, stop, steps)