Data#

The policyengine_core.data module is a new feature from the original OpenFisca Core fork. It provides a standardised class definition of a Dataset, which contains all the data needed to instantiate an OpenFisca simulation with nationally (or lower) representative data.

Dataset#

class policyengine_core.data.Dataset(require: bool = False)[source]#

Bases: object

The Dataset class is a base class for datasets used directly or indirectly for microsimulation models. A dataset defines a generation function to create it from other data, and this class provides common features like storage, metadata and loading.

ARRAYS = 'arrays'#
FLAT_FILE = 'flat_file'#
TABLES = 'tables'#
TIME_PERIOD_ARRAYS = 'time_period_arrays'#
data_format: str = None#

The format of the dataset. This can be either Dataset.ARRAYS, Dataset.TIME_PERIOD_ARRAYS or Dataset.TABLES. If Dataset.ARRAYS, the dataset is stored as a collection of arrays. If Dataset.TIME_PERIOD_ARRAYS, the dataset is stored as a collection of arrays, with one array per time period. If Dataset.TABLES, the dataset is stored as a collection of tables (DataFrames).

download(url: str = None)[source]#

Downloads a file to the dataset’s file path.

Parameters:

url (str) – The url to download.

property exists: bool#

Checks whether the dataset exists.

Returns:

Whether the dataset exists.

Return type:

bool

file_path: Path = None#

The path to the dataset file. This is used to load the dataset from a file.

static from_dataframe(dataframe: DataFrame, time_period: str = None)[source]#

Creates a dataset from a DataFrame.

Returns:

The dataset.

Return type:

Dataset

static from_file(file_path: str, time_period: str = None)[source]#

Creates a dataset from a file.

Parameters:

file_path (str) – The file path to create the dataset from.

Returns:

The dataset.

Return type:

Dataset

generate()[source]#

Generates the dataset for a given year (all datasets should implement this method).

Raises:

NotImplementedError – If the function has not been overriden.

label: str = None#

The label of the dataset. This is used for logging and is used as the key in the datasets dictionary.

load(key: str = None, mode: str = 'r') Union[File, array, DataFrame, HDFStore][source]#

Loads the dataset for a given year, returning a H5 file reader. You can then access the dataset like a dictionary (e.g.e Dataset.load(2022)[“variable”]).

Parameters:
  • key (str, optional) – The key to load. Defaults to None.

  • mode (str, optional) – The mode to open the file with. Defaults to “r”.

Returns:

The dataset.

Return type:

Union[h5py.File, np.array, pd.DataFrame, pd.HDFStore]

load_dataset()[source]#

Loads a complete dataset from disk.

Returns:

The dataset.

Return type:

Dict[str, Dict[str, Sequence]]

name: str = None#

The name of the dataset. This is used to generate filenames and is used as the key in the datasets dictionary.

remove()[source]#

Removes the dataset from disk.

save(key: str, values: Union[array, DataFrame])[source]#

Overwrites the values for key with values.

Parameters:
  • key (str) – The key to save.

  • values (Union[np.array, pd.DataFrame]) – The values to save.

save_dataset(data, file_path: str = None) None[source]#

Writes a complete dataset to disk.

Parameters:

data – The data to save.

>>> example_data: Dict[str, Dict[str, Sequence]] = {
...     "employment_income": {
...         "2022": np.array([25000, 25000, 30000, 30000]),
...     },
... }
>>> example_data["employment_income"]["2022"] = [25000, 25000, 30000, 30000]
store_file(file_path: str)[source]#

Moves a file to the dataset’s file path.

Parameters:

file_path (str) – The file path to move.

time_period: str = None#

The time period of the dataset. This is used to automatically enter the values in the correct time period if the data type is Dataset.ARRAYS.

url: str = None#

The URL to download the dataset from. This is used to download the dataset if it does not exist.

property variables: List[str]#

Returns the variables in the dataset.

Returns:

The variables in the dataset.

Return type:

List[str]

PublicDataset#

PrivateDataset#