datasetgen.generator

Module Contents

datasetgen.generator._DEFAULT_SEED = 42[source]
datasetgen.generator._make_empty_df() → ’pd.DataFrame’[source]

Generates an empy Dataframe with che columns indicated in COLUMNS dict.

Returns

a new DataFrame

Return type

pd.DataFrame

class datasetgen.generator.Day(date: datetime.date, df: pd.DataFrame = None)[source]

Bases: object

Initialize current day basic information.

Parameters

date (datetime.date) – The current date of the Day object

__repr__(self)[source]
property df(self)[source]
reset_index(self)[source]

Reset the dataframe index inplace.

Returns

self

Return type

Day

bulk_append(self, rows: List[dict])[source]

Insert a bunch of rows into the day’s dataframe.

For each row it sets these default value:
  • reqDay = int(time.mktime(self._date.timetuple()))

  • JobSuccess = True

  • SiteName = 0

  • DataType = 0

  • FileType = 0

Also, if there is no information about the job this function generates a random fake information on cpu work using gen_fake_cpu_work:

  • NumCPU

  • WrapWC

  • WrapCPU

  • CPUTime

  • IOTime

Parameters

rows (List[dict]) – List of rows

Returns

self

Return type

Day

append(self, row: dict)[source]

Insert a single row into the day’s dataframe.

It sets these default value:
  • reqDay = int(time.mktime(self._date.timetuple()))

  • JobSuccess = True

  • SiteName = 0

  • DataType = 0

  • FileType = 0

If there is no information about the job this function generates a random fake information on cpu work using gen_fake_cpu_work:

  • NumCPU

  • WrapWC

  • WrapCPU

  • CPUTime

  • IOTime

Parameters

row (dict) – the current row’s columns

Returns

self

Return type

Day

save(self, dest_folder: PurePath = Path('.'))[source]

Export the current day dataframe in a zipped csv format.

Parameters

dest_folder (PurePath, optional) – the destination directory, defaults to Path(“.”)

Returns

self

Return type

Day

class datasetgen.generator.Generator(config: dict = {}, num_days: int = - 1, num_req_x_day: int = - 1, start_date: datetime.date = datetime.date(2020, 1, 1), seed: int = _DEFAULT_SEED, dest_folder: PurePath = Path('.'))[source]

Bases: object

The main generatore object that creates datasets.

Initialize the generator.

Parameters
  • config (dict, optional) – A dictionary with the configuration to use, defaults to {}

  • num_days (int, optional) – number of days to generate, defaults to -1

  • num_req_x_day (int, optional) – number of requests per day, defaults to -1

  • start_date (datetime, optional) – the starting date of the generator data, defaults to datetime.date(2020, 1, 1)

  • seed (int, optional) – the random generator seed, defaults to _DEFAULT_SEED

  • dest_folder (PurePath, optional) – the folder where to store the dataset, defaults to Path(“.”)

property seed(self)[source]
__update_seeds(self)[source]

Initialize the random generator seeds.

Internal Python random generator seed and NumPy random seed.

property df(self)[source]

Returns a new dataframes that contains all the days’ dataframes.

Returns

the concatenated dataframe

Return type

pd.DataFrame

property df_stats(self)[source]

Returns the concat days’ dataframes and some useful stats

Returns

a tuple with several DataFrames

Return type

Tuple[pd.DataFrame]

property days(self)[source]

Returns a list of days’ DataFrames.

Returns

a list with days’ DataFrames

Return type

List[pd.DataFrame]

property num_req_x_day(self)[source]
property num_days(self)[source]
property tot_num_requests(self)[source]
property dest_folder(self)[source]
clean(self)[source]

Delete all day dataframes.

prepare(self, function_name: str, kwargs: dict, max_buf_len: int = 1024)[source]

Prepare the dataset.

This method recall the function generators.

Parameters
  • function_name (str) – The function to use during the preparation

  • kwargs (dict) – arguments of generator function

  • max_buf_len (int, optional) – size of row buffer, defaults to 1024

Yield

status percentage of the preparation

Return type

int

_open_dataset_file(self, filename: str)[source]

Open a single dataset day.

Returns

the current day data

Return type

Day

open_data(self, folder: str)[source]

Open dataset from a folder.

Parameters

folder (str) – the dataset folder

save(self)[source]

Exports all days’ DataFrames in dest_folder.