Writing datasets#
A common use case of PolicyEngine Core country models is not just simulating for a few households, but thousands in the form of microsimulation on survey data. This technique can be used to simulate the impact of a policy on a population, or to compare the impact of different policies on the same population. To do this, we need to be able to load data into PolicyEngine Core, and to do this we use a standardised format using the Dataset
class.
Example#
Here’s the Country Template’s default example for a dataset.
from policyengine_core.country_template.constants import COUNTRY_DIR
from policyengine_core.data import Dataset
from policyengine_core.periods import ETERNITY, MONTH, period
class CountryTemplateDataset(Dataset):
# Specify metadata used to describe and store the dataset.
name = "country_template_dataset"
label = "Country template dataset"
folder_path = COUNTRY_DIR / "data" / "storage"
data_format = Dataset.TIME_PERIOD_ARRAYS
# The generation function is the most important part: it defines
# how the dataset is generated from the raw data for a given year.
def generate(self, year: int) -> None:
person_id = [0, 1, 2]
household_id = [0, 1]
person_household_id = [0, 0, 1]
person_household_role = ["parent", "child", "parent"]
salary = [100, 0, 200]
salary_time_period = period("2022-01")
weight = [1e6, 1.2e6]
weight_time_period = period("2022")
data = {
"person_id": {ETERNITY: person_id},
"household_id": {ETERNITY: household_id},
"person_household_id": {ETERNITY: person_household_id},
"person_household_role": {ETERNITY: person_household_role},
"salary": {salary_time_period: salary},
"household_weight": {weight_time_period: weight},
}
self.save_variable_values(year, data)
# Important: we must instantiate datasets. This tests their validity and adds dynamic logic.
CountryTemplateDataset = CountryTemplateDataset()
Dataset API#
PolicyEngine Core also includes two subclasses of Dataset
:
PublicDataset
- a dataset that is publicly available, and can be downloaded from a URL. Includes adownload
method to download the dataset.PrivateDataset
- a dataset that is not publicly available, and must be downloaded from a private URL (specifically, Google Cloud buckets). Includes adownload
method to download the dataset, and aupload
method to upload the dataset.
See Data for the API reference.