Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Imputations

PolicyEngine UK Data enhances the Family Resources Survey with variables from other surveys using statistical imputation. All imputations use Quantile Regression Forests (QRF), which predict the full conditional distribution of target variables given predictor variables.

Imputation Pipeline Order

The imputations are applied in this order (dependencies noted):

  1. Wealth (from WAS)

  2. Consumption (from LCFS) — requires num_vehicles from wealth

  3. VAT (from ETB)

  4. Public Services (from ETB)

  5. Income (from SPI)

  6. Capital Gains (from Advani-Summers data)

  7. Salary Sacrifice (from FRS subsample)

  8. Student Loan Plan (rule-based, from age)


Wealth Imputation

Source: Wealth and Assets Survey (WAS) Round 7 (2018-2020)

Imputes household wealth components using demographic and income predictors.

Predictors

VariableDescription
household_net_incomeTotal household income after taxes
num_adultsNumber of adults in household
num_childrenNumber of children in household
private_pension_incomeIncome from private pensions
employment_incomeIncome from employment
self_employment_incomeIncome from self-employment
capital_incomeIncome from capital/investments
num_bedroomsNumber of bedrooms in dwelling
council_taxAnnual council tax payment
is_rentingWhether household rents (vs owns)
regionUK region

Outputs

VariableDescription
owned_landValue of owned land
property_wealthTotal property wealth
corporate_wealthShares, pensions, investment ISAs
gross_financial_wealthTotal financial assets
net_financial_wealthFinancial assets minus liabilities
main_residence_valueValue of main home
other_residential_property_valueValue of other properties
non_residential_property_valueValue of non-residential property
savingsSavings account balances
num_vehiclesNumber of vehicles owned

Consumption Imputation

Source: Living Costs and Food Survey (LCFS) 2021-22

Imputes household spending patterns for indirect tax modeling.

Predictors

VariableDescription
is_adultNumber of adults
is_childNumber of children
regionUK region
employment_incomeEmployment income
self_employment_incomeSelf-employment income
private_pension_incomePrivate pension income
household_net_incomeTotal household income
has_fuel_consumptionWhether household buys petrol/diesel (from WAS)

Outputs

VariableDescription
food_and_non_alcoholic_beverages_consumptionFood spending
alcohol_and_tobacco_consumptionAlcohol/tobacco spending
clothing_and_footwear_consumptionClothing spending
housing_water_and_electricity_consumptionHousing costs
household_furnishings_consumptionFurnishings spending
health_consumptionHealth spending
transport_consumptionTransport spending
communication_consumptionCommunication spending
recreation_consumptionRecreation spending
education_consumptionEducation spending
restaurants_and_hotels_consumptionRestaurants/hotels spending
miscellaneous_consumptionOther spending
petrol_spendingPetrol fuel spending
diesel_spendingDiesel fuel spending
domestic_energy_consumptionHome energy spending

Bridging WAS Vehicle Ownership to LCFS Fuel Spending

LCFS 2-week diaries undercount fuel purchasers (58%) compared to actual vehicle ownership (78% per NTS 2024). We bridge this gap using WAS vehicle data:

  1. In WAS: Create has_fuel_consumption from vehicle ownership:

    • has_fuel = (num_vehicles > 0) AND (random < 0.90)

    • The 90% accounts for EVs/PHEVs that don’t buy petrol/diesel

    • Source: NTS 2024 shows 59% petrol + 30% diesel + ~1% hybrid fuel use

  2. Train QRF: Predict has_fuel_consumption from demographics (income, adults, children, region)

  3. Apply to LCFS: Impute has_fuel_consumption to LCFS households before training consumption model

  4. At FRS imputation time: Compute has_fuel_consumption directly from num_vehicles (already calibrated to NTS targets)

  5. Zero non-fuel households: After imputation, set petrol_spending and diesel_spending to zero for households where has_fuel_consumption = 0

This ensures fuel duty incidence aligns with actual vehicle ownership (~70% of households = 78% vehicles × 90% ICE) rather than LCFS diary randomness.


VAT Imputation

Source: Effects of Taxes and Benefits (ETB) 1977-2021

Imputes the share of household spending subject to full-rate VAT.

Predictors

VariableDescription
is_adultNumber of adults
is_childNumber of children
is_SP_ageNumber at State Pension age
household_net_incomeTotal household income

Outputs

VariableDescription
full_rate_vat_expenditure_rateShare of spending at 20% VAT

Income Imputation

Source: Survey of Personal Incomes (SPI) 2020-21

Imputes detailed income components to create “synthetic taxpayers” with higher incomes than typically captured in the FRS. These records initially have zero weight but can be upweighted during calibration to match HMRC income distribution targets.

Predictors

VariableDescription
agePerson’s age
genderMale/Female
regionUK region

Outputs

VariableDescription
employment_incomeIncome from employment
self_employment_incomeIncome from self-employment
savings_interest_incomeInterest on savings
dividend_incomeDividend income
private_pension_incomePrivate pension income
property_incomeRental/property income

Capital Gains Imputation

Source: Advani-Summers capital gains distribution data

Uses a gradient-based optimization approach rather than QRF. The dataset is doubled, with one half receiving imputed capital gains amounts. Weights are then optimized to match the empirical relationship between total income and capital gains incidence.

Method

  1. Double the dataset (original + clone)

  2. Assign capital gains to one adult per household in the cloned half

  3. Optimize blend weights to match income-band capital gains incidence from Advani-Summers data


Salary Sacrifice Imputation

Source: FRS 2023-24 (respondents asked about salary sacrifice)

Imputes pension contributions made via salary sacrifice arrangements.

Predictors

VariableDescription
agePerson’s age
employment_incomeEmployment income

Outputs

VariableDescription
pension_contributions_via_salary_sacrificeAnnual SS pension contributions

Training Data


Student Loan Plan Imputation

Source: Rule-based (not QRF)

Assigns student loan plan type based on age and reported repayments.

Logic

  1. If student_loan_repayments > 0, person has a loan

  2. Estimate university start year = simulation_year - age + 18

  3. Assign plan:

    • Plan 1: Started before September 2012

    • Plan 2: Started September 2012 - August 2023

    • Plan 5: Started September 2023 onwards


Calibration Targets

After imputation, household weights are calibrated to match aggregate statistics from:

SourceTargets
OBRTax revenues, benefit expenditures (20 programs)
ONSAge/region populations, family types, tenure
HMRCIncome distributions by band (7 income types × 14 bands)
DWPUniversal Credit statistics, two-child limit
NTSVehicle ownership (22% none, 44% one, 34% two+)
Council TaxHouseholds by council tax band