Methodology¶
In this page, we’ll walk through step-by-step the process we use to create PolicyEngine’s dataset.
Family Resources Survey: we’ll start with the FRS, looking at close it is to reality. To take an actual concrete starting point, we’ll assume benefit payments are as reported in the survey.
FRS (+ tax-benefit model): we need to make sure that our tax-benefit model isn’t doing anything unexpected. If we turn on simulation of taxes and benefits, does anything look unexpected? If not- great, we’ve turned a household survey into something useful for policy analysis. We’ll also take stock here of what we’re missing from reality.
Wealth and consumption: the most obvious thing we’re missing is wealth and consumption. We’ll impute those here.
Fine-tuning: we’ll use reweighting to make some final adjustments to make sure our dataset is as close to reality as possible.
Validation: we’ll compare our dataset to the UK’s official statistics, and see how we’re doing.
The Family Resources Survey¶
First, we’ll start with the FRS as-is. Skipping over the technical details for how we actually feed this data into the model (you can find that in policyengine_uk_data/datasets/frs/
), we need to decide how we’re actually going to measure ‘close to reality’. We need to define an objective function, and if our final dataset improves it a lot, we can call that a success.
We’ll define this objective function using public statistics that we can generally agree are of high importance to describing the UK household sector. These are things that, if the survey gets them wrong, we’d expect to cause inaccuracy in our model, and if we get them all mostly right, we’d expect to have confidence that it’s a pretty accurate tax-benefit model.
For this, we’ve gone through and collected:
Demographics from the ONS: ten-year age band populations by region of the UK, national family type populations and national tenure type populations.
Incomes from HMRC: for each of 14 total income bands, the number of people with income and combined income of the seven income types that account for over 99% of total income: employment, self-employment, State Pension, private pension, property, savings interest, and dividends.
Tax-benefit programs from the DWP and OBR: statistics on caseloads, expenditures and revenues for all 20 major tax-benefit programs.
Let’s first take a look at the initial FRS, our starting point, and what is generally considered the best dataset to use (mostly completely un-modified across major tax-benefit models), and see how close it is to reproducing these statistics.
The table below shows the result, and: it’s really quite bad! Look at the relative errors.
/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
name | estimate | target | error | abs_error | rel_error | abs_rel_error | type |
---|---|---|---|---|---|---|---|
init_notebook_mode cell...
(need help?) |
It’s easier to understand ‘what kind of bad’ this is by splitting out the statistics into those three categories. Here’s a histogram of the absolute relative errors.
A few notes:
We’re comparing things in the same relevant time period (2022), and only doing a tiny amount of adjustment to the statistics: OBR statistics are taken directly from the latest EFO, ONS statistics are the most recent projections for 2022, and HMRC statistics are uprated from 2021 to 2022 using the same standard uprating factors we use in the model (and it’s only one year adjustment).
Demogaphics look basically fine: that’s expected, because the DWP applies an optimisation algorithm to optimise the household weights to be as close as possible to a similar set of demographic statistics. It’s a good sign that we use slightly different statistics than it was trained on and get good accuracy.
Incomes look not great at all. We’ll take a closer look below to understand why. But the FRS is well-known to under-report income significantly.
Tax-benefit programs also look not good. And this is a concern! Because we’re using this dataset to answer questions about tax-benefit programs, and the FRS isn’t even providing a good representation of them under baseline law.
There are a few interesting things here:
The FRS over-estimates incomes in the upper-middle of the distribution and under-estimates them in the top of the distribution. The reason for this is probably: the FRS misses out the top completely, and then because of the weight optimisation (which scales up the working-age age groups to hit their population targets), the middle of the distribution is inflated, overcompensating.
Some income types are severely under-estimated across all bands: notably capital incomes. This probably reflects issues with the survey questionnaire design more than sampling bias.
OK, so what can we do about it?
Simulating benefits¶
First, let’s turn on the model and check nothing unexpected happens. The table below shows each of our known statistics, and how they changed after replacing reported benefits with simulated benefits.
abs_error_original | abs_error_simulated | abs_rel_error_original | abs_rel_error_simulated | change_in_abs_rel_error | error_original | error_simulated | estimate_original | estimate_simulated | rel_error_original | rel_error_simulated | target_original | target_simulated | type_original | type_simulated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
init_notebook_mode cell...
(need help?) |
Again, a few notes:
You might be thinking: ‘why do some of the HMRC income statistics change?’. That’s because of the State Pension, which is simulated in the model. The State Pension is a component of total income, so people might be moved from one income band to another if we adjust their State Pension payments slightly.
Some of the tax-benefit statistics change, and get better and worse. This is expected for a variety of reasons- one is that incomes and benefits are often out of sync with each other in the data (the income in the survey week might not match income in the benefits assessment time period).
Adding imputations¶
Now, let’s add in the imputations for wealth and consumption. For this, we train quantile regression forests (essentially, random forest models that capture the conditional distribution of the data) to predict wealth and consumption variables from FRS-shared variables in other surveys.
The datasets we use are:
The Wealth and Assets Survey (WAS) for wealth imputations.
The Living Costs and Food Survey (LCFS) for most consumption imputations.
The Effects of Taxes and Benefits on Household Income (ETB) for ‘£ consumption that is full VAT rateable’. For example, different households will have different profiles in terms of the share of their consumption that falls on the VATable items.
Below is a table showing how just adding these imputations changes our objective statistics (filtered to just rows which changed). Not bad pre-calibrated performance! And we’ve picked up an extra £200bn in taxes.
name | estimate_simulated | target_simulated | error_simulated | abs_error_simulated | rel_error_simulated | abs_rel_error_simulated | type_simulated | estimate_imputed | target_imputed | error_imputed | abs_error_imputed | rel_error_imputed | abs_rel_error_imputed | type_imputed | change_in_abs_rel_error |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
init_notebook_mode cell...
(need help?) |
Calibration¶
Now, we’ve got a dataset that’s performs pretty well without explicitly targeting the official statistics we care about. So it’s time to add the final touch- calibrating the weights to explicitly minimise error against the target set.
name | estimate_imputed | target_imputed | error_imputed | abs_error_imputed | rel_error_imputed | abs_rel_error_imputed | type_imputed | estimate_calibrated | target_calibrated | error_calibrated | abs_error_calibrated | rel_error_calibrated | abs_rel_error_calibrated | type_calibrated | change_in_abs_rel_error |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
init_notebook_mode cell...
(need help?) |
Let’s also look at incomes.
So, what’s happening here seems like: the FRS just doesn’t have enough high-income records for calibration to work straight away. The optimiser can’t just set really high weights for the few rich people we do have, because it’d hurt performance on the demographic statistics.
So, we need a solution to add more high-income records. What we’ll do is:
Train a QRF model to predict the distributions of income variables from the Survey of Personal Incomes from FRS demographic variables.
For each FRS person, add an ‘imputed income’ clone with zero weight.
Run the calibration again.
The Enhanced FRS¶
Let’s see how this new dataset performs.
name | estimate_calibrated | target_calibrated | error_calibrated | abs_error_calibrated | rel_error_calibrated | abs_rel_error_calibrated | type_calibrated | estimate_enhanced | target_enhanced | error_enhanced | abs_error_enhanced | rel_error_enhanced | abs_rel_error_enhanced | type_enhanced | change_in_abs_rel_error |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
init_notebook_mode cell...
(need help?) |
And finally, let’s look at those incomes again.
Everything looks healthy here! We’ve got a dataset that’s close to reality, and we can have confidence in our tax-benefit model.