Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Core concepts

PolicyEngine.py is a Python package for tax-benefit microsimulation analysis. It provides a unified interface for running policy simulations, analysing distributional impacts, and visualising results across different countries.

Architecture overview

The package is organised around several core concepts:

Tax-benefit models

Tax-benefit models define the rules and calculations for a country’s tax and benefit system. Each model version contains:

Using a tax-benefit model

from policyengine.tax_benefit_models.uk import uk_latest
from policyengine.tax_benefit_models.us import us_latest

# UK model includes variables like:
# - income_tax, national_insurance, universal_credit
# - Parameters like personal allowance, NI thresholds

# US model includes variables like:
# - income_tax, payroll_tax, eitc, ctc, snap
# - Parameters like standard deduction, EITC rates

Datasets

Datasets contain microdata representing a population. Each dataset has:

Dataset structure

from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset

dataset = PolicyEngineUKDataset(
    name="FRS 2023-24",
    description="Family Resources Survey microdata",
    filepath="./data/frs_2023_24_year_2026.h5",
    year=2026,
)

# Access entity-level data
person_data = dataset.data.person      # MicroDataFrame
household_data = dataset.data.household
benunit_data = dataset.data.benunit    # Benefit unit (UK only)

Creating custom datasets

You can create custom datasets for scenario analysis:

import pandas as pd
from microdf import MicroDataFrame
from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset, UKYearData

# Create person data
person_df = MicroDataFrame(
    pd.DataFrame({
        "person_id": [0, 1, 2],
        "person_household_id": [0, 0, 1],
        "person_benunit_id": [0, 0, 1],
        "age": [35, 8, 40],
        "employment_income": [30000, 0, 50000],
        "person_weight": [1.0, 1.0, 1.0],
    }),
    weights="person_weight"
)

# Create household data
household_df = MicroDataFrame(
    pd.DataFrame({
        "household_id": [0, 1],
        "region": ["LONDON", "SOUTH_EAST"],
        "rent": [15000, 12000],
        "household_weight": [1.0, 1.0],
    }),
    weights="household_weight"
)

# Create benunit data
benunit_df = MicroDataFrame(
    pd.DataFrame({
        "benunit_id": [0, 1],
        "would_claim_uc": [True, True],
        "benunit_weight": [1.0, 1.0],
    }),
    weights="benunit_weight"
)

dataset = PolicyEngineUKDataset(
    name="Custom scenario",
    description="Single parent vs single adult",
    filepath="./custom.h5",
    year=2026,
    data=UKYearData(
        person=person_df,
        household=household_df,
        benunit=benunit_df,
    )
)

Data loading

Before running simulations, you need representative microdata. The package provides three functions for managing datasets:

from policyengine.tax_benefit_models.us import ensure_datasets

# First run: downloads from HuggingFace, computes variables, saves to ./data/
# Subsequent runs: loads from disk instantly
datasets = ensure_datasets(
    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
    years=[2026],
    data_folder="./data",
)
dataset = datasets["enhanced_cps_2024_2026"]
from policyengine.tax_benefit_models.uk import ensure_datasets

datasets = ensure_datasets(
    datasets=["hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5"],
    years=[2026],
    data_folder="./data",
)
dataset = datasets["enhanced_frs_2023_24_2026"]

All datasets are stored as HDF5 files on disk. No database server is required.

Simulations

Simulations apply tax-benefit models to datasets, calculating all variables for the specified year.

Running a simulation

from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)
simulation.run()

# Access output data
output_person = simulation.output_dataset.data.person
output_household = simulation.output_dataset.data.household

# Check calculated variables
print(output_household[["household_id", "household_net_income", "household_tax"]])

Simulation lifecycle: run() vs ensure()

The Simulation class provides two methods for computing results:

MethodBehaviour
simulation.run()Always recomputes from scratch. No caching.
simulation.ensure()Checks in-memory LRU cache, then tries loading from disk, then falls back to run() + save().
# One-off computation (no caching)
simulation.run()

# Cache-or-compute (preferred for production use)
simulation.ensure()

ensure() uses a module-level LRU cache (max 100 simulations) and saves output datasets as HDF5 files alongside the input dataset. On repeated calls, it returns cached results instantly. For baseline-vs-reform comparisons, economic_impact_analysis() calls ensure() internally, so you rarely need to call it yourself.

Accessing calculated variables

After running a simulation, you can access the calculated variables from the output dataset:

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)
simulation.run()

# Access specific variables
output = simulation.output_dataset.data
person_data = output.person[["person_id", "age", "employment_income", "income_tax"]]
household_data = output.household[["household_id", "household_net_income"]]
benunit_data = output.benunit[["benunit_id", "universal_credit", "child_benefit"]]

Policies

Policies modify tax-benefit system parameters through parametric reforms.

Creating a policy

from policyengine.core import Policy, Parameter, ParameterValue
import datetime

# Define parameter to modify
parameter = Parameter(
    name="gov.hmrc.income_tax.allowances.personal_allowance.amount",
    tax_benefit_model_version=uk_latest,
    description="Personal allowance for income tax",
    data_type=float,
)

# Set new value
parameter_value = ParameterValue(
    parameter=parameter,
    start_date=datetime.date(2026, 1, 1),
    end_date=datetime.date(2026, 12, 31),
    value=15000,  # Increase from ~£12,570 to £15,000
)

policy = Policy(
    name="Increased personal allowance",
    description="Raises personal allowance to £15,000",
    parameter_values=[parameter_value],
)

Running a reform simulation

# Baseline simulation
baseline = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
)
baseline.run()

# Reform simulation
reform = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
    policy=policy,
)
reform.run()

Combining policies

Policies can be combined using the + operator:

combined = policy_a + policy_b
# Concatenates parameter_values and chains simulation_modifiers

Simulation modifiers

For reforms that cannot be expressed as parameter value changes, Policy accepts a simulation_modifier callable that directly manipulates the underlying policyengine_core simulation:

def my_modifier(sim):
    """Custom reform logic applied to the core simulation object."""
    p = sim.tax_benefit_system.parameters
    # Modify parameters programmatically
    return sim

policy = Policy(
    name="Custom reform",
    simulation_modifier=my_modifier,
)

Note: the UK model supports simulation_modifier. The US model currently only uses the parameter_values path.

Dynamic behavioural responses

The Dynamic class is structurally identical to Policy and represents behavioural responses to policy changes (e.g., labour supply elasticities). It is applied after the policy in the simulation pipeline.

from policyengine.core.dynamic import Dynamic

dynamic = Dynamic(
    name="Labour supply response",
    parameter_values=[...],  # Same format as Policy
)

simulation = Simulation(
    dataset=dataset,
    tax_benefit_model_version=uk_latest,
    policy=policy,
    dynamic=dynamic,
)

Dynamic responses can also be combined using the + operator and support simulation_modifier callables.

Outputs

Output classes provide structured analysis of simulation results.

Aggregate

Calculate aggregate statistics (sum, mean, count) for any variable:

from policyengine.outputs.aggregate import Aggregate, AggregateType

# Total universal credit spending
agg = Aggregate(
    simulation=simulation,
    variable="universal_credit",
    aggregate_type=AggregateType.SUM,
    entity="benunit",  # Map to benunit level
)
agg.run()
print(f"Total UC spending: £{agg.result / 1e9:.1f}bn")

# Mean household income in top decile
agg = Aggregate(
    simulation=simulation,
    variable="household_net_income",
    aggregate_type=AggregateType.MEAN,
    filter_variable="household_net_income",
    quantile=10,
    quantile_eq=10,  # 10th decile
)
agg.run()
print(f"Mean income in top decile: £{agg.result:,.0f}")

ChangeAggregate

Analyse impacts of policy reforms:

from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType

# Count winners and losers
winners = ChangeAggregate(
    baseline_simulation=baseline,
    reform_simulation=reform,
    variable="household_net_income",
    aggregate_type=ChangeAggregateType.COUNT,
    change_geq=1,  # Gain at least £1
)
winners.run()
print(f"Winners: {winners.result / 1e6:.1f}m households")

losers = ChangeAggregate(
    baseline_simulation=baseline,
    reform_simulation=reform,
    variable="household_net_income",
    aggregate_type=ChangeAggregateType.COUNT,
    change_leq=-1,  # Lose at least £1
)
losers.run()
print(f"Losers: {losers.result / 1e6:.1f}m households")

# Revenue impact
revenue = ChangeAggregate(
    baseline_simulation=baseline,
    reform_simulation=reform,
    variable="household_tax",
    aggregate_type=ChangeAggregateType.SUM,
)
revenue.run()
print(f"Revenue change: £{revenue.result / 1e9:.1f}bn")

Entity mapping

The package automatically handles entity mapping when variables are defined at different entity levels.

Entity hierarchy

UK:

household
    └── benunit (benefit unit)
            └── person

US:

household
    ├── tax_unit
    ├── spm_unit
    ├── family
    └── marital_unit
            └── person

Automatic mapping

When you request a person-level variable (like ssi) at household level, the package:

  1. Sums person-level values within each household (aggregation)

  2. Returns household-level data with proper weights

# SSI is defined at person level, but we want household-level totals
agg = Aggregate(
    simulation=simulation,
    variable="ssi",  # Person-level variable
    entity="household",  # Target household level
    aggregate_type=AggregateType.SUM,
)
# Internally maps person → household by summing SSI for all persons in each household

When you request a household-level variable at person level:

  1. Replicates household values to all persons in that household (expansion)

Direct entity mapping

You can also map data between entities directly using the map_to_entity method:

# Map person income to household level (sum)
household_income = dataset.data.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)

# Map household rent to person level (project/broadcast)
person_rent = dataset.data.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["rent"],
    how="project"
)

Mapping with custom values

You can map custom value arrays instead of existing columns:

# Map custom per-person values to household level
import numpy as np

# Create custom values (e.g., imputed data)
custom_values = np.array([100, 200, 150, 300])

household_totals = dataset.data.map_to_entity(
    source_entity="person",
    target_entity="household",
    values=custom_values,
    how="sum"
)

Aggregation methods

The how parameter controls how values are mapped:

Person → Group (aggregation):

# Sum person incomes to household level
household_income = data.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["employment_income"],
    how="sum"
)

# Take first person's age as household reference
household_age = data.map_to_entity(
    source_entity="person",
    target_entity="household",
    columns=["age"],
    how="first"
)

Group → Person (expansion):

# Broadcast household rent to each person
person_rent = data.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["rent"],
    how="project"
)

# Split household savings equally per person
person_savings = data.map_to_entity(
    source_entity="household",
    target_entity="person",
    columns=["total_savings"],
    how="divide"
)

Group → Group (via person entity):

# UK: Sum benunit benefits to household level
household_benefits = data.map_to_entity(
    source_entity="benunit",
    target_entity="household",
    columns=["universal_credit"],
    how="sum"
)

# US: Map tax unit income to household, splitting by members
household_from_tax = data.map_to_entity(
    source_entity="tax_unit",
    target_entity="household",
    columns=["taxable_income"],
    how="divide"
)

Visualisation

The package includes utilities for creating PolicyEngine-branded visualisations:

from policyengine.utils.plotting import format_fig, COLORS
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(x=[1, 2, 3], y=[4, 5, 6]))

format_fig(
    fig,
    title="My chart",
    xaxis_title="X axis",
    yaxis_title="Y axis",
    height=600,
    width=800,
)
fig.show()

Brand colours

COLORS = {
    "primary": "#319795",        # Teal
    "success": "#22C55E",        # Green
    "warning": "#FEC601",        # Yellow
    "error": "#EF4444",          # Red
    "info": "#1890FF",           # Blue
    "blue_secondary": "#026AA2", # Dark blue
    "gray": "#667085",           # Gray
}

Common workflows

1. Analyse employment income variation

See UK employment income variation for a complete example of:

2. Policy reform analysis

See UK policy reform analysis for:

3. Distributional analysis

See US income distribution for:

Best practices

Creating custom datasets

  1. Always set would_claim variables: Benefits won’t be claimed unless explicitly enabled

    "would_claim_uc": [True] * n_households
  2. Set disability variables explicitly: Prevents random UC spikes from LCWRA element

    "is_disabled_for_benefits": [False] * n_people
    "uc_limited_capability_for_WRA": [False] * n_people
  3. Include required join keys: Person data needs entity membership

    "person_household_id": household_ids
    "person_benunit_id": benunit_ids  # UK only
  4. Set required household fields: Vary by country

    # UK
    "region": ["LONDON"] * n_households
    "tenure_type": ["RENT_PRIVATELY"] * n_households
    
    # US
    "state_code": ["CA"] * n_households

Performance optimisation

  1. Single simulation for variations: Create all scenarios in one dataset, run once

  2. Custom variable selection: Only calculate needed variables

  3. Filter efficiently: Use quantile filters for decile analysis

  4. Parallel analysis: Multiple Aggregate calls can run independently

Data integrity

  1. Check weights: Ensure weights sum to expected population

  2. Validate join keys: All persons should link to valid households

  3. Review output ranges: Check calculated values are reasonable

  4. Test edge cases: Zero income, high income, disabled, elderly

Next steps