Data#

The policyengine_core.data module is a new feature from the original OpenFisca Core fork. It provides a standardised class definition of a Dataset, which contains all the data needed to instantiate an OpenFisca simulation with nationally (or lower) representative data.

Dataset#

class policyengine_core.data.Dataset(require: bool = False)[source]#

Bases: object

The Dataset class is a base class for datasets used directly or indirectly for microsimulation models. A dataset defines a generation function to create it from other data, and this class provides common features like storage, metadata and loading.

ARRAYS = 'arrays'#

FLAT_FILE = 'flat_file'#

TABLES = 'tables'#

TIME_PERIOD_ARRAYS = 'time_period_arrays'#

data_format: str = None#: The format of the dataset. This can be either Dataset.ARRAYS, Dataset.TIME_PERIOD_ARRAYS or Dataset.TABLES. If Dataset.ARRAYS, the dataset is stored as a collection of arrays. If Dataset.TIME_PERIOD_ARRAYS, the dataset is stored as a collection of arrays, with one array per time period. If Dataset.TABLES, the dataset is stored as a collection of tables (DataFrames).

download(url: str = None)[source]#

Downloads a file to the dataset’s file path.

Parameters:: url (str) – The url to download.

property exists: bool#

Checks whether the dataset exists.

Returns:: Whether the dataset exists.
Return type:: bool

file_path: Path = None#: The path to the dataset file. This is used to load the dataset from a file.

static from_dataframe(dataframe: DataFrame, time_period: str = None)[source]#

Creates a dataset from a DataFrame.

Returns:: The dataset.
Return type:: Dataset

static from_file(file_path: str, time_period: str = None)[source]#

Creates a dataset from a file.

Parameters:: file_path (str) – The file path to create the dataset from.
Returns:: The dataset.
Return type:: Dataset

generate()[source]#

Generates the dataset for a given year (all datasets should implement this method).

Raises:: NotImplementedError – If the function has not been overriden.

label: str = None#: The label of the dataset. This is used for logging and is used as the key in the datasets dictionary.

load(key: str = None, mode: str = 'r') → Union[File, array, DataFrame, HDFStore][source]#

Loads the dataset for a given year, returning a H5 file reader. You can then access the dataset like a dictionary (e.g.e Dataset.load(2022)[“variable”]).

Parameters:

key (str, optional) – The key to load. Defaults to None.
mode (str, optional) – The mode to open the file with. Defaults to “r”.

Returns:

The dataset.

Return type:

Union[h5py.File, np.array, pd.DataFrame, pd.HDFStore]

load_dataset()[source]#

Loads a complete dataset from disk.

Returns:: The dataset.
Return type:: Dict[str, Dict[str, Sequence]]

name: str = None#: The name of the dataset. This is used to generate filenames and is used as the key in the datasets dictionary.

remove()[source]#: Removes the dataset from disk.

save(key: str, values: Union[array, DataFrame])[source]#

Overwrites the values for key with values.

Parameters:

key (str) – The key to save.
values (Union[np.array, pd.DataFrame]) – The values to save.

save_dataset(data, file_path: str = None) → None[source]#

Writes a complete dataset to disk.

Parameters:: data – The data to save.

>>> example_data: Dict[str, Dict[str, Sequence]] = {
...     "employment_income": {
...         "2022": np.array([25000, 25000, 30000, 30000]),
...     },
... }
>>> example_data["employment_income"]["2022"] = [25000, 25000, 30000, 30000]

store_file(file_path: str)[source]#

Moves a file to the dataset’s file path.

Parameters:: file_path (str) – The file path to move.

time_period: str = None#: The time period of the dataset. This is used to automatically enter the values in the correct time period if the data type is Dataset.ARRAYS.

url: str = None#: The URL to download the dataset from. This is used to download the dataset if it does not exist.

property variables: List[str]#

Returns the variables in the dataset.

Returns:: The variables in the dataset.
Return type:: List[str]

Data#

Dataset#

PublicDataset#

PrivateDataset#