The module is a new feature from the original OpenFisca Core fork. It provides a standardised class definition of a Dataset, which contains all the data needed to instantiate an OpenFisca simulation with nationally (or lower) representative data.


class bool = False)[source]#

Bases: object

The Dataset class is a base class for datasets used directly or indirectly for microsimulation models. A dataset defines a generation function to create it from other data, and this class provides common features like storage, metadata and loading.

ARRAYS = 'arrays'#
TABLES = 'tables'#
TIME_PERIOD_ARRAYS = 'time_period_arrays'#
data_format: str = None#

The format of the dataset. This can be either Dataset.ARRAYS, Dataset.TIME_PERIOD_ARRAYS or Dataset.TABLES. If Dataset.ARRAYS, the dataset is stored as a collection of arrays. If Dataset.TIME_PERIOD_ARRAYS, the dataset is stored as a collection of arrays, with one array per time period. If Dataset.TABLES, the dataset is stored as a collection of tables (DataFrames).

download(url: Optional[str] = None)[source]#

Downloads a file to the dataset’s file path.


url (str) – The url to download.

property exists: bool#

Checks whether the dataset exists.


Whether the dataset exists.

Return type


file_path: pathlib.Path = None#

The path to the dataset file. This is used to load the dataset from a file.


Generates the dataset for a given year (all datasets should implement this method).


NotImplementedError – If the function has not been overriden.

label: str = None#

The label of the dataset. This is used for logging and is used as the key in the datasets dictionary.

load(key: Optional[str] = None, mode: str = 'r') Union[h5py._hl.files.File, numpy.array, pandas.core.frame.DataFrame,][source]#

Loads the dataset for a given year, returning a H5 file reader. You can then access the dataset like a dictionary (e.g.e Dataset.load(2022)[“variable”]).

  • key (str, optional) – The key to load. Defaults to None.

  • mode (str, optional) – The mode to open the file with. Defaults to “r”.


The dataset.

Return type

Union[h5py.File, np.array, pd.DataFrame, pd.HDFStore]


Loads a complete dataset from disk.


The dataset.

Return type

Dict[str, Dict[str, Sequence]]

name: str = None#

The name of the dataset. This is used to generate filenames and is used as the key in the datasets dictionary.


Removes the dataset from disk.

save(key: str, values: Union[numpy.array, pandas.core.frame.DataFrame])[source]#

Overwrites the values for key with values.

  • key (str) – The key to save.

  • values (Union[np.array, pd.DataFrame]) – The values to save.

save_dataset(data, file_path: Optional[str] = None) None[source]#

Writes a complete dataset to disk.


data – The data to save.

>>> example_data: Dict[str, Dict[str, Sequence]] = {
...     "employment_income": {
...         "2022": np.array([25000, 25000, 30000, 30000]),
...     },
... }
>>> example_data["employment_income"]["2022"] = [25000, 25000, 30000, 30000]
store_file(file_path: str)[source]#

Moves a file to the dataset’s file path.


file_path (str) – The file path to move.

time_period: str = None#

The time period of the dataset. This is used to automatically enter the values in the correct time period if the data type is Dataset.ARRAYS.

url: str = None#

The URL to download the dataset from. This is used to download the dataset if it does not exist.

property variables: List[str]#

Returns the variables in the dataset.


The variables in the dataset.

Return type