Simulations Examples
In these simulation examples, we will use the crying baby POMDPs defined in the Defining a POMDP section (i.e. quick_crying_baby_pomdp
, explicit_crying_baby_pomdp
, gen_crying_baby_pomdp
, and tabular_crying_baby_pomdp
).
Stepthrough
The stepthrough simulator provides a window into the simulation with a for-loop syntax.
Within the body of the for loop, we have access to the belief, the action, the observation, and the reward, in each step. We also calculate the sum of the rewards in this example, but note that this is not the discounted reward.
policy = RandomPolicy(quick_crying_baby_pomdp)
r_sum = 0.0
step = 0
for (b, s, a, o, r) in stepthrough(quick_crying_baby_pomdp, policy, DiscreteUpdater(quick_crying_baby_pomdp), "b,s,a,o,r"; max_steps=4)
step += 1
println("Step $step")
println("b = sated => $(b.b[1]), hungry => $(b.b[2])")
@show s
@show a
@show o
@show r
r_sum += r
@show r_sum
println()
end
Step 1
b = sated => 1.0, hungry => 0.0
s = :sated
a = :feed
o = :quiet
r = -5.0
r_sum = -5.0
Step 2
b = sated => 1.0, hungry => 0.0
s = :sated
a = :feed
o = :quiet
r = -5.0
r_sum = -10.0
Step 3
b = sated => 1.0, hungry => 0.0
s = :sated
a = :feed
o = :quiet
r = -5.0
r_sum = -15.0
Step 4
b = sated => 1.0, hungry => 0.0
s = :sated
a = :sing
o = :quiet
r = -0.5
r_sum = -15.5
Rollout Simulations
While stepthrough is a flexible and convenient tool for many user-facing demonstrations, it is often less error-prone to use the standard simulate function with a Simulator
object. The simplest Simulator is the RolloutSimulator
. It simply runs a simulation and returns the discounted reward.
policy = RandomPolicy(explicit_crying_baby_pomdp)
sim = RolloutSimulator(max_steps=10)
r_sum = simulate(sim, explicit_crying_baby_pomdp, policy)
println("Total discounted reward: $r_sum")
Total discounted reward: -44.97873584450001
Recording Histories
Sometimes it is important to record the entire history of a simulation for further examination. This can be accomplished with a HistoryRecorder
.
policy = RandomPolicy(tabular_crying_baby_pomdp)
hr = HistoryRecorder(max_steps=5)
history = simulate(hr, tabular_crying_baby_pomdp, policy, DiscreteUpdater(tabular_crying_baby_pomdp), Deterministic(1))
The history object produced by a HistoryRecorder
is a SimHistory
, documented in the POMDPTools simulator section Histories. The information in this object can be accessed in several ways. For example, there is a function:
discounted_reward(history)
-26.001000000000005
Accessor functions like state_hist
and action_hist
can also be used to access parts of the history:
state_hist(history)
6-element Vector{Int64}:
2
2
2
2
2
1
collect(action_hist(history))
5-element Vector{Int64}:
3
3
2
2
1
Keeping track of which states, actions, and observations belong together can be tricky (for example, since there is a starting state, and ending state, but no action is taken from the ending state, the list of actions has a different length than the list of states). It is often better to think of histories in terms of steps that include both starting and ending states.
The most powerful function for accessing the information in a SimHistory
is the eachstep
function which returns an iterator through named tuples representing each step in the history. The eachstep
function is similar to the stepthrough
function above except that it iterates through the immutable steps of a previously simulated history instead of conducting the simulation as the for loop is being carried out.
r_sum = 0.0
step = 0
for step_i in eachstep(sim_history, "b,s,a,o,r")
step += 1
println("Step $step")
println("step_i.b = sated => $(step_i.b.b[1]), hungry => $(step_i.b.b[2])")
@show step_i.s
@show step_i.a
@show step_i.o
@show step_i.r
r_sum += step_i.r
@show r_sum
println()
end
end # hide
Step 1
step_i.b = sated => 1.0, hungry => 0.0
step_i.s = 2
step_i.a = 3
step_i.o = 1
step_i.r = 0.0
r_sum = 0.0
Step 2
step_i.b = sated => 0.5294117647058822, hungry => 0.47058823529411764
step_i.s = 2
step_i.a = 3
step_i.o = 2
step_i.r = 0.0
r_sum = 0.0
Step 3
step_i.b = sated => 0.8037486218302095, hungry => 0.19625137816979055
step_i.s = 2
step_i.a = 2
step_i.o = 1
step_i.r = -10.5
r_sum = -10.5
Step 4
step_i.b = sated => 0.0, hungry => 1.0
step_i.s = 2
step_i.a = 2
step_i.o = 1
step_i.r = -10.5
r_sum = -21.0
Step 5
step_i.b = sated => 0.0, hungry => 1.0
step_i.s = 2
step_i.a = 1
step_i.o = 2
step_i.r = -15.0
r_sum = -36.0
Parallel Simulations
It is often useful to evaluate a policy by running many simulations. The parallel simulator is the most effective tool for this. To use the parallel simulator, first create a list of Sim
objects, each of which contains all of the information needed to run a simulation. Then then run the simulations using run_parallel
, which will return a DataFrame
with the results.
In this example, we will compare the performance of the policies we computed in the Using Different Solvers section (i.e. sarsop_policy
, pomcp_planner
, and heuristic_policy
). To evaluate the policies, we will run 100 simulations for each policy. We can do this by adding 100 Sim
objects of each policy to the list.
using DataFrames
using StatsBase: std
# Defining paramters for the simulations
number_of_sim_to_run = 100
max_steps = 20
starting_seed = 1
# We will also compare against a random policy
rand_policy = RandomPolicy(quick_crying_baby_pomdp, rng=MersenneTwister(1))
# Create the list of Sim objects
sim_list = []
# Add 100 Sim objects of each policy to the list.
for sim_number in 1:number_of_sim_to_run
seed = starting_seed + sim_number
# Add the SARSOP policy
push!(sim_list, Sim(
quick_crying_baby_pomdp,
rng=MersenneTwister(seed),
sarsop_policy,
max_steps=max_steps,
metadata=Dict(:policy => "sarsop", :seed => seed))
)
# Add the POMCP policy
push!(sim_list, Sim(
quick_crying_baby_pomdp,
rng=MersenneTwister(seed),
pomcp_planner,
max_steps=max_steps,
metadata=Dict(:policy => "pomcp", :seed => seed))
)
# Add the heuristic policy
push!(sim_list, Sim(
quick_crying_baby_pomdp,
rng=MersenneTwister(seed),
heuristic_policy,
max_steps=max_steps,
metadata=Dict(:policy => "heuristic", :seed => seed))
)
# Add the random policy
push!(sim_list, Sim(
quick_crying_baby_pomdp,
rng=MersenneTwister(seed),
rand_policy,
max_steps=max_steps,
metadata=Dict(:policy => "random", :seed => seed))
)
end
# Run the simulations in parallel
data = run_parallel(sim_list)
# Define a function to calculate the mean and confidence interval
function mean_and_ci(x)
m = mean(x)
ci = 1.96 * std(x) / sqrt(length(x)) # 95% confidence interval
return (mean = m, ci = ci)
end
# Calculate the mean and confidence interval for each policy
grouped_df = groupby(data, :policy)
result = combine(grouped_df, :reward => mean_and_ci => AsTable)
Row | policy | mean | ci |
---|---|---|---|
String? | Float64 | Float64 | |
1 | sarsop | -15.2408 | 1.84332 |
2 | pomcp | -20.1549 | 1.73457 |
3 | heuristic | -15.9499 | 2.07676 |
4 | random | -28.1802 | 2.5492 |
By default, the parallel simulator only returns the reward from each simulation, but more information can be gathered by specifying a function to analyze the Sim
-history pair and record additional statistics. Reference the POMDPTools simulator section for more information (Specifying information to be recorded).