Boltzmann Wealth Model with Reinforcement Learning

This is a tutorial for the ReinforcementLearningABM model which combines reinforcement learning with agent based modelling. The example demonstrates how to integrate reinforcement learning with agent based modelling using the Boltzmann wealth distribution model. In this model, agents move around a grid and exchange wealth when they encounter other agents, but their movement decisions are learned through reinforcement learning rather than being random.

The model showcases how RL agents can learn to optimize their behavior to achieve specific goals - in this case, reducing wealth inequality as measured by the Gini coefficient.

`Crux` needed!

This functionality is formally a package extension. To access it you need to be using Crux.

Reinforcement Learning fundamentals assummed

The discussion in this tutorial assumes you are familiar with the fundamentals of reinforcement learning. If not, you can start your journey from the Wikipedia page.

Model specification

The Boltzmann wealth model is a classic example in econophysics where agents represent economic actors who exchange wealth. The traditional model uses random movement, but here we replace that with learned behavior using reinforcement learning.

Rules:

Agents move on a 2D periodic grid
When agents occupy the same position, they may exchange wealth
Wealth flows from richer to poorer agents
Agent movement is learned through RL to minimize wealth inequality

RL Integration:

Actions: Stay, move North, South, East, or West (5 discrete actions)
Observations: Local neighborhood information and agent's relative wealth
Reward: Reduction in Gini coefficient (wealth inequality measure)
Goal: Learn movement patterns that promote wealth redistribution

Loading packages and defining the agent type

This part is the same as with the normal usage of Agents.jl

using Agents, Random, Statistics, Distributions

while we also need to import the library required for the RL extension

import Crux

and define the agent type as usual

@agent struct RLBoltzmannAgent(GridAgent{2})
    wealth::Int
end

Utility functions

First, we define the Gini coefficient calculation, which measures wealth inequality. A Gini coefficient of 0 represents perfect equality, while 1 represents maximum inequality.

function gini(model::ABM)
    wealths = [a.wealth for a in allagents(model)]
    n, sum_wi = length(wealths), sum(wealths)
    (n <= 1 || sum_wi == 0.0) && return 0.0
    num = sum((2i - n - 1) * w for (i, w) in enumerate(sort!(wealths)))
    den = n * sum_wi
    return num / den
end

gini (generic function with 1 method)

Agent stepping function

The agent stepping function defines how agents behave in response to RL actions. Unlike traditional ABM where this might contain random movement, here the movement is determined by the RL policy based on the chosen action.

This function augments the standard Agents.jl agent_step! signature by including an additional third argument action::Int that represents the action chosen by the RL policy. This action is optimized during training by the RL framework. The correspondence between integers and "actual agent actions" is decided by the user.

function boltzmann_rl_step!(agent::RLBoltzmannAgent, model, action::Int)
    # Action definitions: 1=stay, 2=north, 3=south, 4=east, 5=west
    dirs = ((0, 0), (0, 1), (0, -1), (1, 0), (-1, 0))
    walk!(agent, dirs[action], model; ifempty = false)

    # Wealth exchange mechanism
    other = random_agent_in_position(agent.pos, model, a -> a.id != agent.id)
    return if !isnothing(other)
        # Transfer wealth from richer to poorer agent
        if other.wealth > agent.wealth && other.wealth > 0
            agent.wealth += 1
            other.wealth -= 1
        end
    end
end

boltzmann_rl_step! (generic function with 1 method)

RL-specific functions

The following functions define how the RL environment interacts with the ABM:

Observation function: Extracts relevant state information for the RL agent
Reward function: Defines what behavior we want to encourage
Terminal function: Determines when an episode ends

Observation function

The observation function is a key component of reinforcement learning. In our scenario, it focuses on the local neighborhood of the agent and the wealth distribution in this neighborhood (as well as the agent's own wealth). Note that the observation function must return the observed quantities as a Vector{Float32}.

function global_to_local(neighbor_pos, center_pos, radius, grid_dims) # helper function
    function transform_dim(neighbor_coord, center_coord, dim_size)
        local_center = radius + 1
        delta = neighbor_coord - center_coord
        delta > radius && return local_center + (delta - dim_size)
        delta < -radius && return local_center + (delta + dim_size)
        return local_center + delta
    end
    return ntuple(i -> transform_dim(neighbor_pos[i], center_pos[i], grid_dims[i]), length(grid_dims))
end

function boltzmann_get_observation(agent, model::ABM)
    width, height = spacesize(model)
    observation_radius = model.observation_radius

    grid_size = 2 * observation_radius + 1
    # 2 channels: occupancy and relative wealth
    neighborhood_grid = zeros(Float32, grid_size, grid_size, 2)

    for pos in nearby_positions(agent.pos, model, observation_radius)
        k = 0
        for neighbor in agents_in_position(pos, model)
            lpos = global_to_local(pos, agent.pos, observation_radius, spacesize(model))
            neighbor.id == agent.id && continue
            neighborhood_grid[lpos..., 1] = 1.0
            wealth_diff = Float32(neighbor.wealth - agent.wealth)
            wealth_sum = Float32(neighbor.wealth + agent.wealth)
            if wealth_sum > 0
                k += 1
                neighborhood_grid[lpos..., 2] = wealth_diff / wealth_sum
            end
            k != 0 && (neighborhood_grid[lpos..., 2] /= k)
        end
    end

    total_wealth = sum(a.wealth for a in allagents(model))
    normalized_wealth = total_wealth > 0 ? Float32(agent.wealth / total_wealth) : 0.0f0
    normalized_pos_x = Float32(agent.pos[1] / width)
    normalized_pos_y = Float32(agent.pos[2] / height)

    return vcat(
        normalized_wealth,
        normalized_pos_x,
        normalized_pos_y,
        vec(neighborhood_grid)
    )
end

boltzmann_get_observation (generic function with 1 method)

Reward function

The reward function is also a key component of reinforcement learning. Here, the reward function encourages agents to reduce wealth inequality by rewarding decreases in the Gini coefficient. This creates an incentive for agents to learn movement patterns that promote wealth redistribution. As the Gini coefficient is only dependent on the global model state, all agents have the same reward, and hence the reward function does not actually utilize the input agent.

function boltzmann_calculate_reward(agent, action, previous_model, current_model)

    initial_gini = gini(previous_model)
    final_gini = gini(current_model)

    # Reward decrease in Gini coefficient
    reward = (initial_gini - final_gini) * 100
    reward > 0 && (reward = reward / (abmtime(current_model) + 1))
    # Small penalty for neutral actions
    reward <= 0.0 && (reward = -0.1f0)

    return reward
end

boltzmann_calculate_reward (generic function with 1 method)

Terminal condition

Define when an RL episode should end. Here, episodes terminate when wealth inequality (Gini coefficient) drops below a threshold, indicating success.

boltzmann_is_terminal_rl(env) = gini(env) < 0.1

boltzmann_is_terminal_rl (generic function with 1 method)

Model initialization

The following functions handle model creation and RL configuration setup. We first create a separate function the generates a ReinforcementLearningABM without an RL configuration. This function will be further used during training to generate new models while updating the overall model policies.

function create_fresh_boltzmann_model(num_agents, dims, initial_wealth, observation_radius, seed = rand(Int))
    rng = MersenneTwister(seed)
    space = GridSpace(dims; periodic = true)

    properties = Dict{Symbol, Any}(:observation_radius => observation_radius)

    model = ReinforcementLearningABM(
        RLBoltzmannAgent, space;
        agent_step = boltzmann_rl_step!,
        properties = properties, rng = rng, scheduler = Schedulers.Randomly()
    )

    # Add agents with random initial wealth
    for _ in 1:num_agents
        add_agent_single!(RLBoltzmannAgent, model, rand(rng, 1:initial_wealth))
    end

    return model
end

create_fresh_boltzmann_model (generic function with 2 methods)

We then use this function to create another keyword-based function, which initializes an RL model with the RL training functions we have created so far:

function initialize_boltzmann_rl_model(; num_agents = 10, dims = (10, 10), initial_wealth = 10, observation_radius = 4)
    # Create the main model using the initialization function
    model = create_fresh_boltzmann_model(num_agents, dims, initial_wealth, observation_radius)

    # `RLConfig` specifies the learning environment parameters
    rl_config = RLConfig(;
        model_init_fn = () -> create_fresh_boltzmann_model(num_agents, dims, initial_wealth, observation_radius),
        observation_fn = boltzmann_get_observation,
        reward_fn = boltzmann_calculate_reward,
        terminal_fn = boltzmann_is_terminal_rl,
        agent_step_fn = boltzmann_rl_step!,
        action_spaces = Dict(
            RLBoltzmannAgent => Crux.DiscreteSpace(5)  ## 5 possible actions
        ),
        observation_spaces = Dict(
            # Observation space: (2*radius+1)² grid cells * 2 channels + 3 agent features
            RLBoltzmannAgent => Crux.ContinuousSpace((((2 * observation_radius + 1)^2 * 2) + 3,), Float32)
        ),
        training_agent_types = [RLBoltzmannAgent]
    )

    # Set the RL configuration to the model
    set_rl_config!(model, rl_config)

    return model
end

initialize_boltzmann_rl_model (generic function with 1 method)

Training the RL agents

Now we create and train our model. The agents will learn through trial and error which movement patterns best achieve the goal of reducing wealth inequality.

Create and train the Boltzmann RL model

boltzmann_rl_model = initialize_boltzmann_rl_model()

ReinforcementLearningABM{GridSpace{2, true}, Main.RLBoltzmannAgent, Dict{Int64, Main.RLBoltzmannAgent}, Tuple{DataType}, typeof(dummystep), typeof(dummystep), Agents.Schedulers.Randomly, Dict{Symbol, Any}, Random.MersenneTwister}(Dict{Int64, Main.RLBoltzmannAgent}(5 => Main.RLBoltzmannAgent(5, (8, 6), 10), 4 => Main.RLBoltzmannAgent(4, (4, 3), 10), 6 => Main.RLBoltzmannAgent(6, (5, 1), 4), 7 => Main.RLBoltzmannAgent(7, (7, 8), 1), 2 => Main.RLBoltzmannAgent(2, (6, 2), 6), 10 => Main.RLBoltzmannAgent(10, (8, 3), 10), 9 => Main.RLBoltzmannAgent(9, (3, 1), 6), 8 => Main.RLBoltzmannAgent(8, (9, 7), 7), 3 => Main.RLBoltzmannAgent(3, (10, 5), 7), 1 => Main.RLBoltzmannAgent(1, (7, 10), 8)…), Agents.dummystep, Agents.dummystep, GridSpace with size (10, 10), metric=chebyshev, periodic=true, Agents.Schedulers.Randomly(Int64[]), Dict{Symbol, Any}(:observation_radius => 4), Random.MersenneTwister(-309953280522471625, (0, 1254, 0, 0, 0, 20)), (Main.RLBoltzmannAgent,), true, Base.RefValue{Int64}(10), Base.RefValue{Int64}(0), Base.RefValue{Any}(RLConfig(Main.var"#11#12"{Int64, Tuple{Int64, Int64}, Int64, Int64}(10, (10, 10), 10, 4), Main.boltzmann_get_observation, Main.boltzmann_calculate_reward, Main.boltzmann_is_terminal_rl, Main.boltzmann_rl_step!, Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.DiscreteSpace
  N: Int64 5
  vals: Array{Int64}((5,)) [1, 2, 3, 4, 5]
), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ContinuousSpace{Tuple{Int64}}
  dims: Tuple{Int64}
  type: primitive type Float32 <: AbstractFloat
  μ: Float32 0.0f0
  σ: Float32 1.0f0
), Type[Main.RLBoltzmannAgent], Dict{Type, Float64}(), Dict{Type, Any}())), Dict{Type, Any}(), Dict{Type, Any}(Main.RLBoltzmannAgent => nothing), Base.RefValue{Bool}(false), Base.RefValue{Any}(nothing), Base.RefValue{Int64}(1))

Train the Boltzmann agents

@time train_model!(
    boltzmann_rl_model;
    training_steps = 25_000,
    max_steps = 50,
    solver_params = Dict(
        :ΔN => 200,            # Custom batch size for PPO updates
        :log => (period = 5000,) # Log every 1000 steps
    )
)

ReinforcementLearningABM{GridSpace{2, true}, Main.RLBoltzmannAgent, Dict{Int64, Main.RLBoltzmannAgent}, Tuple{DataType}, typeof(dummystep), typeof(dummystep), Agents.Schedulers.Randomly, Dict{Symbol, Any}, Random.MersenneTwister}(Dict{Int64, Main.RLBoltzmannAgent}(5 => Main.RLBoltzmannAgent(5, (6, 1), 6), 4 => Main.RLBoltzmannAgent(4, (1, 6), 4), 6 => Main.RLBoltzmannAgent(6, (9, 6), 8), 7 => Main.RLBoltzmannAgent(7, (1, 1), 10), 2 => Main.RLBoltzmannAgent(2, (1, 4), 3), 10 => Main.RLBoltzmannAgent(10, (6, 6), 2), 9 => Main.RLBoltzmannAgent(9, (7, 4), 4), 8 => Main.RLBoltzmannAgent(8, (2, 10), 4), 3 => Main.RLBoltzmannAgent(3, (2, 8), 5), 1 => Main.RLBoltzmannAgent(1, (10, 8), 9)…), Agents.dummystep, Agents.dummystep, GridSpace with size (10, 10), metric=chebyshev, periodic=true, Agents.Schedulers.Randomly(Int64[]), Dict{Symbol, Any}(:observation_radius => 4), Random.MersenneTwister(-309953280522471625, (0, 12534, 11532, 490, 10278, 640)), (Main.RLBoltzmannAgent,), true, Base.RefValue{Int64}(10), Base.RefValue{Int64}(0), Base.RefValue{Any}(RLConfig(Main.var"#11#12"{Int64, Tuple{Int64, Int64}, Int64, Int64}(10, (10, 10), 10, 4), Main.boltzmann_get_observation, Main.boltzmann_calculate_reward, Main.boltzmann_is_terminal_rl, Main.boltzmann_rl_step!, Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.DiscreteSpace
  N: Int64 5
  vals: Array{Int64}((5,)) [1, 2, 3, 4, 5]
), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ContinuousSpace{Tuple{Int64}}
  dims: Tuple{Int64}
  type: primitive type Float32 <: AbstractFloat
  μ: Float32 0.0f0
  σ: Float32 1.0f0
), Type[Main.RLBoltzmannAgent], Dict{Type, Float64}(), Dict{Type, Any}())), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ActorCritic{Crux.DiscreteNetwork, Crux.ContinuousNetwork}(Crux.DiscreteNetwork(Chain(Dense(165 => 64, relu), Dense(64 => 64, relu), Dense(64 => 5)), [1, 2, 3, 4, 5], Crux.var"#157#158"(), false, Flux.cpu), Crux.ContinuousNetwork(Chain(Dense(165 => 64, relu), Dense(64 => 64, relu), Dense(64 => 1)), 1, Flux.cpu))), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.OnPolicySolver
  agent: Crux.PolicyParams{Crux.ActorCritic{Crux.DiscreteNetwork, Crux.ContinuousNetwork}, Crux.DiscreteSpace}
  S: Crux.ContinuousSpace{Tuple{Int64}}
  N: Int64 25000
  ΔN: Int64 200
  max_steps: Int64 50
  log: Crux.LoggerParams
  i: Int64 25000
  param_optimizers: Dict{Any, Crux.TrainingParams}
  a_opt: Crux.TrainingParams
  c_opt: Crux.TrainingParams
  𝒫: @NamedTuple{ϵ::Float32, λp::Float32, λe::Float32}
  interaction_storage: Nothing nothing
  post_sample_callback: record_avgr (function of type Crux.var"#record_avgr#687"{Crux.var"#record_avgr#681#688"})
  post_batch_callback: #695 (function of type Crux.var"#695#696"{Crux.var"#697#698"})
  λ_gae: Float32 0.95f0
  required_columns: Array{Symbol}((3,))
  Vc: Nothing nothing
  cost_opt: Nothing nothing
), Base.RefValue{Bool}(false), Base.RefValue{Any}(Main.RLBoltzmannAgent), Base.RefValue{Int64}(1))

Plot the learning curve to see how agents improved over training

Crux.plot_learning(boltzmann_rl_model.training_history[RLBoltzmannAgent])

Running the trained model

After training, we create a fresh model instance and apply the learned policies to see how well the agents perform.

First, create a fresh model instance for simulation with the same parameters

fresh_boltzmann_model = initialize_boltzmann_rl_model()

ReinforcementLearningABM{GridSpace{2, true}, Main.RLBoltzmannAgent, Dict{Int64, Main.RLBoltzmannAgent}, Tuple{DataType}, typeof(dummystep), typeof(dummystep), Agents.Schedulers.Randomly, Dict{Symbol, Any}, Random.MersenneTwister}(Dict{Int64, Main.RLBoltzmannAgent}(5 => Main.RLBoltzmannAgent(5, (10, 10), 5), 4 => Main.RLBoltzmannAgent(4, (8, 8), 10), 6 => Main.RLBoltzmannAgent(6, (6, 6), 9), 7 => Main.RLBoltzmannAgent(7, (10, 5), 9), 2 => Main.RLBoltzmannAgent(2, (6, 9), 2), 10 => Main.RLBoltzmannAgent(10, (2, 2), 5), 9 => Main.RLBoltzmannAgent(9, (7, 4), 6), 8 => Main.RLBoltzmannAgent(8, (5, 4), 2), 3 => Main.RLBoltzmannAgent(3, (4, 10), 4), 1 => Main.RLBoltzmannAgent(1, (6, 1), 8)…), Agents.dummystep, Agents.dummystep, GridSpace with size (10, 10), metric=chebyshev, periodic=true, Agents.Schedulers.Randomly(Int64[]), Dict{Symbol, Any}(:observation_radius => 4), Random.MersenneTwister(-4450065503867595350, (0, 1254, 0, 0, 0, 21)), (Main.RLBoltzmannAgent,), true, Base.RefValue{Int64}(10), Base.RefValue{Int64}(0), Base.RefValue{Any}(RLConfig(Main.var"#11#12"{Int64, Tuple{Int64, Int64}, Int64, Int64}(10, (10, 10), 10, 4), Main.boltzmann_get_observation, Main.boltzmann_calculate_reward, Main.boltzmann_is_terminal_rl, Main.boltzmann_rl_step!, Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.DiscreteSpace
  N: Int64 5
  vals: Array{Int64}((5,)) [1, 2, 3, 4, 5]
), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ContinuousSpace{Tuple{Int64}}
  dims: Tuple{Int64}
  type: primitive type Float32 <: AbstractFloat
  μ: Float32 0.0f0
  σ: Float32 1.0f0
), Type[Main.RLBoltzmannAgent], Dict{Type, Float64}(), Dict{Type, Any}())), Dict{Type, Any}(), Dict{Type, Any}(Main.RLBoltzmannAgent => nothing), Base.RefValue{Bool}(false), Base.RefValue{Any}(nothing), Base.RefValue{Int64}(1))

And copy the trained policies to the fresh model

copy_trained_policies!(fresh_boltzmann_model, boltzmann_rl_model)

ReinforcementLearningABM{GridSpace{2, true}, Main.RLBoltzmannAgent, Dict{Int64, Main.RLBoltzmannAgent}, Tuple{DataType}, typeof(dummystep), typeof(dummystep), Agents.Schedulers.Randomly, Dict{Symbol, Any}, Random.MersenneTwister}(Dict{Int64, Main.RLBoltzmannAgent}(5 => Main.RLBoltzmannAgent(5, (10, 10), 5), 4 => Main.RLBoltzmannAgent(4, (8, 8), 10), 6 => Main.RLBoltzmannAgent(6, (6, 6), 9), 7 => Main.RLBoltzmannAgent(7, (10, 5), 9), 2 => Main.RLBoltzmannAgent(2, (6, 9), 2), 10 => Main.RLBoltzmannAgent(10, (2, 2), 5), 9 => Main.RLBoltzmannAgent(9, (7, 4), 6), 8 => Main.RLBoltzmannAgent(8, (5, 4), 2), 3 => Main.RLBoltzmannAgent(3, (4, 10), 4), 1 => Main.RLBoltzmannAgent(1, (6, 1), 8)…), Agents.dummystep, Agents.dummystep, GridSpace with size (10, 10), metric=chebyshev, periodic=true, Agents.Schedulers.Randomly(Int64[]), Dict{Symbol, Any}(:observation_radius => 4), Random.MersenneTwister(-4450065503867595350, (0, 1254, 0, 0, 0, 21)), (Main.RLBoltzmannAgent,), true, Base.RefValue{Int64}(10), Base.RefValue{Int64}(0), Base.RefValue{Any}(RLConfig(Main.var"#11#12"{Int64, Tuple{Int64, Int64}, Int64, Int64}(10, (10, 10), 10, 4), Main.boltzmann_get_observation, Main.boltzmann_calculate_reward, Main.boltzmann_is_terminal_rl, Main.boltzmann_rl_step!, Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.DiscreteSpace
  N: Int64 5
  vals: Array{Int64}((5,)) [1, 2, 3, 4, 5]
), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ContinuousSpace{Tuple{Int64}}
  dims: Tuple{Int64}
  type: primitive type Float32 <: AbstractFloat
  μ: Float32 0.0f0
  σ: Float32 1.0f0
), Type[Main.RLBoltzmannAgent], Dict{Type, Float64}(), Dict{Type, Any}())), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ActorCritic{Crux.DiscreteNetwork, Crux.ContinuousNetwork}(Crux.DiscreteNetwork(Chain(Dense(165 => 64, relu), Dense(64 => 64, relu), Dense(64 => 5)), [1, 2, 3, 4, 5], Crux.var"#157#158"(), false, Flux.cpu), Crux.ContinuousNetwork(Chain(Dense(165 => 64, relu), Dense(64 => 64, relu), Dense(64 => 1)), 1, Flux.cpu))), Dict{Type, Any}(Main.RLBoltzmannAgent => nothing), Base.RefValue{Bool}(false), Base.RefValue{Any}(nothing), Base.RefValue{Int64}(1))

Let's visualize the initial state and run a simulation to see the trained behavior.

using CairoMakie

function agent_color(agent) # Custom color function based on wealth
    max_expected_wealth = 10
    clamped_wealth = clamp(agent.wealth, 0, max_expected_wealth)
    normalized_wealth = clamped_wealth / max_expected_wealth
    # Color scheme: red (poor) to green (rich)
    return CairoMakie.Makie.ColorSchemes.RdYlGn_4[normalized_wealth]
end
function agent_size(agent) # Custom size function based on wealth
    max_expected_wealth = 10
    clamped_wealth = clamp(agent.wealth, 0, max_expected_wealth)
    size_factor = clamped_wealth / max_expected_wealth
    return 10 + size_factor * 15
end

fig, ax = abmplot(
    fresh_boltzmann_model;
    agent_color = agent_color,
    agent_size = agent_size,
    agent_marker = :circle
)
ax.title = "Boltzmann Wealth Distribution (Initial State)"
ax.xlabel = "X Position"
ax.ylabel = "Y Position"
fig

The initial Gini coefficient is

initial_gini = gini(fresh_boltzmann_model)

0.26

Step the model forward to see the trained behavior

Agents.step!(fresh_boltzmann_model, 10)

ReinforcementLearningABM{GridSpace{2, true}, Main.RLBoltzmannAgent, Dict{Int64, Main.RLBoltzmannAgent}, Tuple{DataType}, typeof(dummystep), typeof(dummystep), Agents.Schedulers.Randomly, Dict{Symbol, Any}, Random.MersenneTwister}(Dict{Int64, Main.RLBoltzmannAgent}(5 => Main.RLBoltzmannAgent(5, (7, 2), 5), 4 => Main.RLBoltzmannAgent(4, (5, 9), 10), 6 => Main.RLBoltzmannAgent(6, (2, 3), 9), 7 => Main.RLBoltzmannAgent(7, (7, 7), 9), 2 => Main.RLBoltzmannAgent(2, (5, 2), 3), 10 => Main.RLBoltzmannAgent(10, (7, 2), 5), 9 => Main.RLBoltzmannAgent(9, (5, 2), 3), 8 => Main.RLBoltzmannAgent(8, (5, 2), 4), 3 => Main.RLBoltzmannAgent(3, (9, 7), 4), 1 => Main.RLBoltzmannAgent(1, (3, 1), 8)…), Agents.dummystep, Agents.dummystep, GridSpace with size (10, 10), metric=chebyshev, periodic=true, Agents.Schedulers.Randomly([2, 4, 3, 9, 7, 8, 1, 10, 6, 5]), Dict{Symbol, Any}(:observation_radius => 4), Random.MersenneTwister(-4450065503867595350, (0, 2256, 1254, 143, 0, 21)), (Main.RLBoltzmannAgent,), true, Base.RefValue{Int64}(10), Base.RefValue{Int64}(10), Base.RefValue{Any}(RLConfig(Main.var"#11#12"{Int64, Tuple{Int64, Int64}, Int64, Int64}(10, (10, 10), 10, 4), Main.boltzmann_get_observation, Main.boltzmann_calculate_reward, Main.boltzmann_is_terminal_rl, Main.boltzmann_rl_step!, Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.DiscreteSpace
  N: Int64 5
  vals: Array{Int64}((5,)) [1, 2, 3, 4, 5]
), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ContinuousSpace{Tuple{Int64}}
  dims: Tuple{Int64}
  type: primitive type Float32 <: AbstractFloat
  μ: Float32 0.0f0
  σ: Float32 1.0f0
), Type[Main.RLBoltzmannAgent], Dict{Type, Float64}(), Dict{Type, Any}())), Dict{Type, Any}(Main.RLBoltzmannAgent => Crux.ActorCritic{Crux.DiscreteNetwork, Crux.ContinuousNetwork}(Crux.DiscreteNetwork(Chain(Dense(165 => 64, relu), Dense(64 => 64, relu), Dense(64 => 5)), [1, 2, 3, 4, 5], Crux.var"#157#158"(), false, Flux.cpu), Crux.ContinuousNetwork(Chain(Dense(165 => 64, relu), Dense(64 => 64, relu), Dense(64 => 1)), 1, Flux.cpu))), Dict{Type, Any}(Main.RLBoltzmannAgent => nothing), Base.RefValue{Bool}(false), Base.RefValue{Any}(nothing), Base.RefValue{Int64}(1))

Check the results after simulation

final_gini = gini(fresh_boltzmann_model)

0.23666666666666666

and indeed the coefficient is reduced. Let's plot the final state

fig, ax = abmplot(
    fresh_boltzmann_model;
    agent_color = agent_color,
    agent_size = agent_size,
    agent_marker = :circle
)
ax.title = "Boltzmann Wealth Distribution (After 10 RL Steps)"
ax.xlabel = "X Position"
ax.ylabel = "Y Position"
fig

Finally, let's create a video showing the trained agents in action over multiple steps on a bigger scale, and compare visually with a random policy

fresh_boltzmann_model = initialize_boltzmann_rl_model(; num_agents = 500, dims = (100, 100))
plotkwargs = (;
    agent_color = agent_color,
    agent_size = agent_size,
    agent_marker = :circle,
)
abmvideo(
    "rn_boltzmann.mp4", fresh_boltzmann_model; frames = 100,
    framerate = 20,
    title = "Boltzmann Wealth Model with Random Agents",
    plotkwargs...
)

We now copy the trained policies and the agents are... smarter!

fresh_boltzmann_model = initialize_boltzmann_rl_model(; num_agents = 500, dims = (100, 100))
copy_trained_policies!(fresh_boltzmann_model, boltzmann_rl_model)
abmvideo(
    "rl_boltzmann.mp4", fresh_boltzmann_model; frames = 100,
    framerate = 20,
    title = "Boltzmann Wealth Model with RL Agents",
    plotkwargs...
)

Key takeaways

This example demonstrated several important concepts:

RL-ABM Integration: How to integrate reinforcement learning with agent-based modeling using the [ReinforcementLearningABM] type.
Custom Reward Design: The reward function encourages behavior that reduces wealth inequality, showing how RL can optimize for specific outcomes.
Observation Engineering: Agents observe their local neighborhood, providing them with relevant information for decision-making.
Policy Transfer: Trained policies can be copied to fresh model instances, enabling evaluation and deployment of learned behaviors.