Exploration Policies

Exploration policies are often useful for Reinforcement Learning algorithm to choose an action that is different than the action given by the policy being learned (on_policy).

Exploration policies are subtype of the abstract ExplorationPolicy type and they follow the following interface: action(exploration_policy::ExplorationPolicy, on_policy::Policy, k, s). k is used to compute the value of the exploration parameter (see Schedule), and s is the current state or observation in which the agent is taking an action.

The action method is exported by POMDPs.jl. To use exploration policies in a solver, you must use the four argument version of action where on_policy is the policy being learned (e.g. tabular policy or neural network policy).

This package provides two exploration policies: EpsGreedyPolicy and SoftmaxPolicy

POMDPPolicies.EpsGreedyPolicyType
EpsGreedyPolicy <: ExplorationPolicy

represents an epsilon greedy policy, sampling a random action with a probability eps or returning an action from a given policy otherwise. The evolution of epsilon can be controlled using a schedule. This feature is useful for using those policies in reinforcement learning algorithms.

Constructor:

EpsGreedyPolicy(problem::Union{MDP, POMDP}, eps::Union{Function, Float64}; rng=Random.GLOBAL_RNG, schedule=ConstantSchedule)

If a function is passed for eps, eps(k) is called to compute the value of epsilon when calling action(exploration_policy, on_policy, k, s).

Fields

  • eps::Function
  • rng::AbstractRNG
  • actions::A an indexable list of action
source
POMDPPolicies.SoftmaxPolicyType
SoftmaxPolicy <: ExplorationPolicy

represents a softmax policy, sampling a random action according to a softmax function. The softmax function converts the action values of the on policy into probabilities that are used for sampling. A temperature parameter or function can be used to make the resulting distribution more or less wide.

Constructor

SoftmaxPolicy(problem, temperature::Union{Function, Float64}; rng=Random.GLOBAL_RNG)

If a function is passed for temperature, temperature(k) is called to compute the value of the temperature when calling action(exploration_policy, on_policy, k, s)

Fields

  • temperature::Function
  • rng::AbstractRNG
  • actions::A an indexable list of action
source

Schedule

Exploration policies often rely on a key parameter: $\epsilon$ in $\epsilon$-greedy and the temperature in softmax for example. Reinforcement learning algorithms often require a decay schedule for these parameters. Schedule can be passed to an exploration policy as functions. For example one can define an epsilon greedy policy with an exponential decay schedule as follow:

    m # your mdp or pomdp model
    exploration_policy = EpsGreedyPolicy(m, k->0.05*0.9^(k/10))

POMDPPolicies.jl exports a linear decay schedule object that can be used as well.

POMDPPolicies.LinearDecayScheduleType
LinearDecaySchedule

A schedule that linearly decreases a value from start to stop in steps steps. if the value is greater or equal to stop, it stays constant.

Constructor

LinearDecaySchedule(;start, stop, steps)

source