Data#
The policyengine_core.data
module is a new feature from the original OpenFisca Core fork. It provides a standardised class definition of a Dataset
, which contains all the data needed to instantiate an OpenFisca simulation with nationally (or lower) representative data.
Dataset#
- class policyengine_core.data.Dataset(require: bool = False)[source]#
Bases:
object
The Dataset class is a base class for datasets used directly or indirectly for microsimulation models. A dataset defines a generation function to create it from other data, and this class provides common features like storage, metadata and loading.
- ARRAYS = 'arrays'#
- FLAT_FILE = 'flat_file'#
- TABLES = 'tables'#
- TIME_PERIOD_ARRAYS = 'time_period_arrays'#
- data_format: str = None#
The format of the dataset. This can be either Dataset.ARRAYS, Dataset.TIME_PERIOD_ARRAYS or Dataset.TABLES. If Dataset.ARRAYS, the dataset is stored as a collection of arrays. If Dataset.TIME_PERIOD_ARRAYS, the dataset is stored as a collection of arrays, with one array per time period. If Dataset.TABLES, the dataset is stored as a collection of tables (DataFrames).
- download(url: str = None)[source]#
Downloads a file to the dataset’s file path.
- Parameters:
url (str) – The url to download.
- property exists: bool#
Checks whether the dataset exists.
- Returns:
Whether the dataset exists.
- Return type:
bool
- file_path: Path = None#
The path to the dataset file. This is used to load the dataset from a file.
- static from_dataframe(dataframe: DataFrame, time_period: str = None)[source]#
Creates a dataset from a DataFrame.
- Returns:
The dataset.
- Return type:
- static from_file(file_path: str, time_period: str = None)[source]#
Creates a dataset from a file.
- Parameters:
file_path (str) – The file path to create the dataset from.
- Returns:
The dataset.
- Return type:
- generate()[source]#
Generates the dataset for a given year (all datasets should implement this method).
- Raises:
NotImplementedError – If the function has not been overriden.
- label: str = None#
The label of the dataset. This is used for logging and is used as the key in the datasets dictionary.
- load(key: str = None, mode: str = 'r') Union[File, array, DataFrame, HDFStore] [source]#
Loads the dataset for a given year, returning a H5 file reader. You can then access the dataset like a dictionary (e.g.e Dataset.load(2022)[“variable”]).
- Parameters:
key (str, optional) – The key to load. Defaults to None.
mode (str, optional) – The mode to open the file with. Defaults to “r”.
- Returns:
The dataset.
- Return type:
Union[h5py.File, np.array, pd.DataFrame, pd.HDFStore]
- load_dataset()[source]#
Loads a complete dataset from disk.
- Returns:
The dataset.
- Return type:
Dict[str, Dict[str, Sequence]]
- name: str = None#
The name of the dataset. This is used to generate filenames and is used as the key in the datasets dictionary.
- save(key: str, values: Union[array, DataFrame])[source]#
Overwrites the values for key with values.
- Parameters:
key (str) – The key to save.
values (Union[np.array, pd.DataFrame]) – The values to save.
- save_dataset(data, file_path: str = None) None [source]#
Writes a complete dataset to disk.
- Parameters:
data – The data to save.
>>> example_data: Dict[str, Dict[str, Sequence]] = { ... "employment_income": { ... "2022": np.array([25000, 25000, 30000, 30000]), ... }, ... } >>> example_data["employment_income"]["2022"] = [25000, 25000, 30000, 30000]
- store_file(file_path: str)[source]#
Moves a file to the dataset’s file path.
- Parameters:
file_path (str) – The file path to move.
- time_period: str = None#
The time period of the dataset. This is used to automatically enter the values in the correct time period if the data type is Dataset.ARRAYS.
- url: str = None#
The URL to download the dataset from. This is used to download the dataset if it does not exist.
- property variables: List[str]#
Returns the variables in the dataset.
- Returns:
The variables in the dataset.
- Return type:
List[str]