howso.utilities#
Classes
Provides accessor methods for and dict-like access to inferred feature attributes. |
|
Feature type enum. |
|
Implements a thread-safe context manager for switching locales temporarily. |
|
A dict-like object containing feature attributes for multiple tables. |
|
Monitor progress of a task. |
|
A dict-like object containing feature attributes for a single table or DataFrame. |
|
Simple context manager to capture run duration of the inner context. |
|
Return a callable that, when called, simply prints msg and cleanly exits. |
Functions
Check and fix type problems with the data and reshape it. |
|
Build a DataFrame from the response from react_series. |
|
Check if features in features dict matches expected_feature_names. |
|
Converts a Howso dict confusion matrix into the same format as sklearn.metrics.confusion_matrix. |
|
Check if datetime format is ISO8601. |
|
Convert date into epoch (i.e seconds counted from Jan 1st 1970). |
|
Update dict base with updates from dict updates in a "deep" fashion. |
|
Deserialize case data into a DataFrame. |
|
Determine which specific ISO8601 format the passed in date is in. |
|
Print based on debug levels. |
|
Convert epoch to date if epoch is not None or nan else, return as it is. |
|
Format DataFrame columns to original type using feature attributes. |
|
Decompose kwargs into a tuple of return values. |
|
Calculates the absolute value of a matrix for feature pairs. |
|
Return a dict-like feature attributes object with useful accessor methods. |
|
Check if a given string is a valid uuid. |
|
Preprocess a matrix including options to normalize, take the absolute value, and fill in the diagonals. |
|
Return number of dimensions for a list. |
|
Replace values of Double.MAX_VALUE (1.79769313486232E+308) with Infinity. |
|
Replace None values with NaN values. |
|
Replace None values with NaN values. |
|
Reshapes X as a matrix and y as a vector. |
|
Convert seconds to a time object. |
|
Serialize case data into list of lists. |
|
Serialize datetimes in the given list of cases, in-place. |
|
Convert a time object to seconds since midnight. |
|
Validate the case_indices parameter to the react() method of a Howso client. |
|
Check that the passed in datetime value adheres to the ISO 8601 format. |
|
Validate the feature types in features. |
|
Validate the shape of a list. |
This module contains various utilities for the Howso clients.
- exception howso.utilities.StopExecution#
Bases:
Exception
Raise a StopExecution as this is a cleaner exit() for Notebooks.
- class howso.utilities.FeatureAttributesBase(feature_attributes, params={}, unsupported=[])#
Bases:
dict
Provides accessor methods for and dict-like access to inferred feature attributes.
- Parameters:
feature_attributes (
Mapping
)params (
Dict
, default:{}
)unsupported (
List
[str
], default:[]
)
- get_names(*, types=None, without=None)#
Get feature names associated with this FeatureAttributes object.
- Parameters:
types (
str
|Container
|None
, default:None
) – (Optional) A feature type as a string (E.g., ‘continuous’) or a list of feature types to limit the output feature names.without (
Iterable
[str
] |None
, default:None
) – (Optional) An Iterable of feature names to exclude from the return object.
- Returns:
A list of feature names.
- Return type:
List[str]
- get_parameters()#
Get the keyword arguments used with the initial call to infer_feature_attributes.
- Returns:
A dictionary containing the kwargs used in the call to infer_feature_attributes.
- Return type:
dict
- to_json()#
Get a JSON string representation of this FeatureAttributes object.
- Returns:
A JSON representation of the inferred feature attributes.
- Return type:
str
- abstract validate(data, coerce=False, raise_errors=False, validate_bounds=True, allow_missing_features=False, localize_datetimes=True)#
Validate the given data against this FeatureAttributes object.
Check that feature bounds and data types loosely describe the data. Optionally attempt to coerce the data into conformity. :type data:
Any
:param data: The data to validate :param coerce: Whether to attempt to coerce DataFrame columns into correct data types. Coerceddatetimes will be localized to UTC.
- Parameters:
raise_errors – If True, raises a ValueError if nonconforming columns are found; else issue a warning
validate_bounds – Whether to validate the data against the attributes’ inferred bounds
allow_missing_features – Allows features that are missing from the DataFrame to be ignored
localize_datetimes – Whether to localize datetime features to UTC.
data (Any)
- Returns:
None or the coerced DataFrame if ‘coerce’ is True and there were no errors.
- class howso.utilities.FeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#
Bases:
Enum
Feature type enum.
- class howso.utilities.LocaleOverride(language_code, encoding=None, category=6)#
Bases:
object
Implements a thread-safe context manager for switching locales temporarily.
Background#
Python’s locale.setlocale() is not thread safe. In order to work with alternate locales temporarily, this ContextDecorator will use a thread lock on __enter__ and release said lock on __exit__.
Important Notes#
All other threads will be blocked within the scope of the context. It is important to avoid time-consuming execution inside.
Example Usage#
>>> # Parse date string from French and format it in English. >>> >>> # System locale is 'en-us' (in this example) >>> from datetime import datetime >>> dt_format = '<some format>' >>> dt_obj = datetime() >>> with locale_override('fr-fr', category=locale.LC_DATE): >>> # We're in French date-formatting zone here... >>> date_obj = datetime.strptime(dt_value, dt_format) >>> >>> # Back in the 'en-us' locale again. >>> dt_value = dt_obj.strftime(dt_format)
- param language_code:
- A language code /usually/ given as either:
2 lower case letters for the base language Ex: fr for French.
5 characters such as fr_CA where the first 2 designate the base language (French in this example) followed by an _ followed by 2 upper case characters designating the country- specific dialect (Canada, in this example). This example designates the French-Canadian locale.
Any of the above, plus an optional encoding following a ‘.’ Ex: fr_FR.UTF-8
- param encoding:
An encoding such as ‘UTF-8’ or ‘ISO8859-1’, etc. If not provided and there is no embedded encoding within the language_code parameter, ‘UTF-8’ is used. If an encoding is embedded in the language_code parameter and an explicit encoding provided here, the embedded encoding is dropped and ignored.
- param category:
This is one of the constants set within the locale object. See: https://docs.python.org/3.9/library/locale.html for details. locale.LC_ALL is used if nothing provided.
- restore()#
Restore the original locale and release the thread lock.
Use this method directly to restore the current context when not using this class as a context manager.
- setup()#
Set a thread lock and the locale as desired.
Use this method directly to setup a locale context when not using this class as a context manager.
- class howso.utilities.MultiTableFeatureAttributes(feature_attributes, params={}, unsupported=[])#
Bases:
FeatureAttributesBase
A dict-like object containing feature attributes for multiple tables.
- Parameters:
feature_attributes (
Mapping
)params (
Dict
, default:{}
)unsupported (
List
[str
], default:[]
)
- class howso.utilities.ProgressTimer(total_ticks=100, *, start_tick=0)#
Bases:
Timer
Monitor progress of a task.
- Parameters:
total_ticks (
int
, default:100
) – The total number of ticks in the progress meter.start_tick (
int
, default:0
) – The starting tick.
- reset()#
Reset the progress timer.
- Return type:
None
- start()#
Start the progress timer.
- Return type:
- update(ticks=1)#
Update the progress by given ticks.
- Parameters:
ticks (
int
, default:1
) – The number of ticks to increment/decrement by.- Return type:
None
- property is_complete: bool#
If progress has reached completion.
- property progress: float#
The current progress percentage.
- property tick_duration: timedelta | None#
The duration since the last tick.
- Returns:
The duration since the last tick, or None if not yet started.
- property time_remaining: timedelta#
The estimated time remaining.
- Returns:
The time estimated to be remaining.
- Raises:
ValueError – If timer not yet started.
- class howso.utilities.SingleTableFeatureAttributes(feature_attributes, params={}, unsupported=[])#
Bases:
FeatureAttributesBase
A dict-like object containing feature attributes for a single table or DataFrame.
- Parameters:
feature_attributes (
Mapping
)params (
Dict
, default:{}
)unsupported (
List
[str
], default:[]
)
- has_unsupported_data(feature_name)#
Returns whether the given feature has data that is unsupported by Howso Engine.
- Parameters:
feature_name (
str
) – The feature to check.- Returns:
Whether feature_name was determined to have unsupported data.
- Return type:
bool
- to_dataframe(*, include_all=False)#
Return a DataFrame of the feature attributes.
Among other reasons, this is useful for presenting feature attributes in a Jupyter notebook or other medium.
- Returns:
A DataFrame representation of the inferred feature attributes.
- Parameters:
include_all (
bool
, default:False
)- Return type:
DataFrame
- validate(**kwargs)#
Validate the given single table data against this FeatureAttributes object.
Check that feature bounds and data types loosely describe the data. Optionally attempt to coerce the data into conformity. :type data:
Any
:param data: The data to validate (single table only). :param coerce: Whether to attempt to coerce DataFrame columns into correct data types. :param raise_errors: If True, raises a ValueError if nonconforming columns are found; else, issue a warning. :param validate_bounds: Whether to validate the data against the attributes’ inferred bounds. :param allow_missing_features: Allows features that are missing from the DataFrame to be ignored. :param localize_datetimes: Whether to localize datetime features to UTC.- Returns:
None or the coerced DataFrame if ‘coerce’ is True and there were no errors.
- Parameters:
data (Any)
- class howso.utilities.Timer#
Bases:
object
Simple context manager to capture run duration of the inner context.
Usage:
with Timer() as my_timer: # perform time-consuming task here... print(f"The task took {my_timer.duration}."
Results in:
"The task took 1:30:10.454419"
- end()#
End the timer.
- Return type:
None
- reset()#
Reset the timer.
- Return type:
None
- property duration: timedelta | None#
The total computed duration of the timer.
- Returns:
The total duration of the timer. When the timer has not yet ended, the duration between now and when the timer started will be returned. If the timer has not yet started, returns None.
- property has_ended: bool#
If the timer has ended.
- property has_started: bool#
If the timer has started.
- property seconds: float | None#
The total seconds representing the duration of timer instance.
- class howso.utilities.UserFriendlyExit(verbose=False)#
Bases:
object
Return a callable that, when called, simply prints msg and cleanly exits.
- Parameters:
verbose – If True, emit more information
- howso.utilities.align_data(x, y=None)#
Check and fix type problems with the data and reshape it.
X is a Matrix and y is a vector.
- Parameters:
x – Feature values ndarray.
y – Target values ndarray.
- howso.utilities.build_react_series_df(react_series_response, series_index=None)#
Build a DataFrame from the response from react_series.
If series_index is set, use that as a name for an additional feature that will be the series index.
- Parameters:
react_series_response – The response dictionary from a call to react_series.
series_index – The name of the series index feature, which will index each series in the form ‘series_<idx>’, e.g., series_1, series_1, …, series_n. If None, does not include the series index feature in the returned DataFrame.
- Returns:
A Pandas DataFrame defined by the action features and series data in the react_series response. Optionally includes a series index feature.
- howso.utilities.check_feature_names(features, expected_feature_names, raise_error=False)#
Check if features in features dict matches expected_feature_names.
- Parameters:
features (
Mapping
) – A feature dictionary that maps feature names to its attributes.expected_feature_names (
Collection
) – A list (or a set) of expected column names in the given features dictionary.raise_error (
bool
, default:False
) – Raise a value error in case the feature names doesn’t match between features and expected_feature_names.
- Returns:
Returns True if the feature names in features matches the expected feature names passed via expected_feature_names. Otherwise, returns False.
- Raises:
If raise_error is True, raises ValueError to indicate that –
the feature names in features dict doesn't match the feature names –
expected_feature_names –
- Return type:
bool
- howso.utilities.date_format_is_iso(f)#
Check if datetime format is ISO8601.
Does format match the iso8601 set that can be handled by the C parser? Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different but must be consistent. Leading 0s in dates and times are optional.
Sourced from Pandas: pandas-dev/pandas
- howso.utilities.date_to_epoch(date_obj, time_format)#
Convert date into epoch (i.e seconds counted from Jan 1st 1970).
Note
If date_str is None or nan, it will be returned as is.
- Parameters:
date_obj (
date
|datetime
|time
|str
) – Time object.time_format (
str
) – Specify format of the time. Ex:%a %b %d %H:%M:%S %Y
- Returns:
The epoch date as a floating point value or ‘np.nan’, et al.
- Return type:
str | float | None
- howso.utilities.deep_update(base, updates)#
Update dict base with updates from dict updates in a “deep” fashion.
NOTE: This is a recursive function. Care should be taken to ensure that neither of the input dictionaries are self-referencing.
- Parameters:
base – A dictionary
updates – A dictionary of updates
- Returns:
dict
- howso.utilities.deserialize_cases(data, columns, features=None)#
Deserialize case data into a DataFrame.
If feature attributes contain original typing information, columns will be converted to the same data type as original training cases.
- Parameters:
data (
Iterable
[Iterable
[Any
] |Mapping
[str
,Any
]]) – The context data.columns (
Iterable
[str
]) –The case column mapping. The order corresponds to the order of cases in output. columns must be provided for non-DataFrame Iterables.
The order corresponds to how the data will be mapped to columns in the output. Ignored for list of dict where the dict key is the column name.
features (
Mapping
|None
, default:None
) –(Optional) The dictionary of feature name to feature attributes.
If not specified, no column typing will be attempted.
- Returns:
The deserialized data.
- Return type:
DataFrame
- howso.utilities.determine_iso_format(str_date, fname)#
Determine which specific ISO8601 format the passed in date is in.
Specifically if it’s just a date, if it’s zoned, and if zoned, whether it’s a zone or an offset.
- Parameters:
str_date (
str
) – The Date time passed in as a string.fname (
str
) – Name of feature to guess bounds for.
- Returns:
The ISO_8601 format string that most matches the passed in date.
- Return type:
str
- howso.utilities.dprint(debug, *argc, **kwargs)#
Print based on debug levels.
- Parameters:
debug – If true, user_debug level would be 1. Possible levels: 1, 2, 3 (print all)
kwargs –
- default_priorityint, default 1
The message is printed only if the debug >= default_priority.
Examples
>>> dprint(True, "hello", "howso", priority=1) `hello howso`
- howso.utilities.epoch_to_date(epoch, time_format, tzinfo=None)#
Convert epoch to date if epoch is not None or nan else, return as it is.
- Parameters:
epoch (
str
|float
) – The epoch date as a floating point value (or str if np.nan, et al)time_format (
str
) – Specify format of the time. Ex:%a %b %d %H:%M:%S %Y
tzinfo (
tzinfo
|None
, default:None
) – Time zone information to include in datetime.
- Returns:
A date string in the format similar to “Wed May 21 00:00:00 2008”
- Return type:
str
- howso.utilities.format_confusion_matrix(confusion_matrix)#
Converts a Howso dict confusion matrix into the same format as sklearn.metrics.confusion_matrix.
- Parameters:
confusion_matrix (
dict
[str
,dict
[str
,int
]]) – Confusion matrix in dictionary form. Standard form of confusion marices returned when retrieving Howso’s prediction stats throughhowso.engine.Trainee.react_aggregate()
.- Return type:
tuple
[ndarray
,list
[str
]]- Returns:
ndarray – The array of the confusion matrix values.
list of str – List of the confusion matrix row labels. These labels denotes the labels of the confusion matrix going top to bottom and left to right.
- howso.utilities.format_dataframe(df, features)#
Format DataFrame columns to original type using feature attributes.
Note
Modifies DataFrame in place.
- Parameters:
df (
DataFrame
) – The DataFrame to format columns of.features (
Mapping
) – The dictionary of feature name to feature attributes.
- Returns:
The formatted data.
- Return type:
DataFrame
- howso.utilities.get_kwargs(kwargs, descriptors, warn_on_extra=False)#
Decompose kwargs into a tuple of return values.
Each tuple corresponds to a descriptor in ‘descriptors’. Optionally issue a warning on any items in kwargs that are not “consumed” by the descriptors.
- Parameters:
kwargs – Mapping of keys and values (kwargs)
descriptors –
An iterable of descriptors for how to handle each item in kwargs. Each descriptor can be a mapping, another iterable, or a single string.
If a mapping, it must at least include the key: ‘key’ but can also optionally include the keys: ‘default’ and ‘test’.
If a non-mapping iterable, the values will be interpreted as ‘key’ ‘default’, ‘test, in that order. Only the first is absolutely required the remaining will be evaluated to None if not provided.
If a string provided, it is used as the ‘key’. ‘default’ and ‘test are set to None.
If a ‘key’ is not found in the kwargs, then the ‘default’ value is returned.
If a descriptor contains a ‘test’, it should be a callable that returns a boolean. If False, the ‘default’ value is returned.
If the ‘default’ provided is an instance of an Exception, then, the exception is raised when the ‘key’ is not present, or the ‘test’ fails.
warn_on_extra – If True, will issue warnings about any keys provided in kwargs that were not consumed by the descriptors. Default is False
- Returns:
A tuple of the found values in the same order as the provided descriptor.
- Raises:
May raise any exception given as a 'default' in the –
descriptors –
Usage#
An example of usage showing various ways to use descriptors:
>>> def my_method(self, required, **kwargs): >>> apple, banana, cherry, durian, elderberry = get_kwargs(kwargs, ( >>> # A simple string is interpreted as the 'key' with 'default of >>> # `None` and no test. Very common use-case made simple. >>> 'apple', >>> >>> # Another common use-case. Set value to 5 if not in kwargs. >>> # This also shows using an tuple for the descriptor. >>> ('banana', 5), >>> >>> # Verbose input including a test using dict >>> {'key': 'cherry', 'default': 5, 'test': lambda x: x > 0}, >>> >>> # The test, `is_durian`, is defined elsewhere >>> ('durian', None, is_durian), >>> >>> # Full example using iterable descriptor rather than mapping. >>> ('elderberry', ValueError('"elderberry" must be > 5.'), >>> lambda x: x > 5), >>> ))
- howso.utilities.get_matrix_diff(matrix)#
Calculates the absolute value of a matrix for feature pairs.
- Parameters:
matrix (
DataFrame
) – The matrix in DataFrame format.- Returns:
Sorted dictionary of absolute differences between the feature value pairs. The values are stored in a dictionary with keys consisting of a tuple of the features.
- Return type:
dict
- howso.utilities.infer_feature_attributes(data, *, tables=None, time_feature_name=None, **kwargs)#
Return a dict-like feature attributes object with useful accessor methods.
The returned object is a subclass of FeatureAttributesBase that is appropriate for the provided data type.
- Parameters:
data (
DataFrame
|SQLRelationalDatastoreProtocol
) – The data source to infer feature attributes from. Must be a supported data type.tables (
Iterable
[TableNameProtocol
] |None
, default:None
) –(Optional, required for datastores) An Iterable of table names to infer feature attributes for.
If included, feature attributes will be generated in the form
{table_name: {feature_attribute: value}}
.time_feature_name (
str
|None
, default:None
) – (Optional, required for time series) The name of the time feature.features –
(Optional) A partially filled features dict. If partially filled attributes for a feature are passed in, those parameters will be retained as is and the rest of the attributes will be inferred.
- For example:
>>> from pprint import pprint >>> df.head(2) ... sepal-length sepal-width petal-length petal-width target ... 0 6.7 3.0 5.2 2.3 2 ... 1 6.0 2.2 5.0 1.5 2 >>> # Partially filled features dict >>> partial_features = { ... "sepal-length": { ... "type": "continuous", ... 'bounds': { ... 'min': 2.72, ... 'max': 3, ... 'allow_null': True ... }, ... }, ... "sepal-width": { ... "type": "continuous" ... } ... } >>> # Infer rest of the attributes >>> features = infer_feature_attributes( ... df, features=partial_features ... ) >>> # Inferred Feature dictionary >>> pprint(features) ... { ... 'sepal-length', { ... 'bounds': { ... 'allow_null': True, 'max': 3, 'min': 2.72 ... }, ... 'type': 'continuous', ... 'sample': 2.86 ... }, ... 'sepal-width', { ... 'bounds': { ... 'allow_null': True, 'max': 7.38905609893065, ... 'min': 1.0 ... }, ... 'type': 'continuous', ... 'sample': 4.56 ... }, ... 'petal-length', { ... 'bounds': { ... 'allow_null': True, 'max': 7.38905609893065, ... 'min': 1.0 ... }, ... 'type': 'continuous', ... 'sample': 5.52 ... }, ... 'petal-width', { ... 'bounds': { ... 'allow_null': True, 'max': 2.718281828459045, ... 'min': 0.049787068367863944 ... }, ... 'type': 'continuous', ... 'sample': 1.33 ... }, ... 'target', { ... 'bounds': {'allow_null': True}, ... 'type': 'nominal' ... 'sample': 1 ... } ... }
Note that valid ‘data_type’ values for both nominal and continuous types are: ‘string’, ‘number’, ‘json’, ‘amalgam’, and ‘yaml’. The ‘boolean’ data_type is valid only when type is nominal. ‘string_mixable’ is valid only when type is continuous (predicted values may result in interpolated strings containing a combination of characters from multiple original values).
infer_bounds – (Optional) If True, bounds will be inferred for the features if the feature column has at least one non NaN value
datetime_feature_formats –
(Optional) Dict defining a custom (non-ISO8601) datetime format and an optional locale for features with datetimes. By default datetime features are assumed to be in ISO8601 format. Non-English datetimes must have locales specified. If locale is omitted, the default system locale is used. The keys are the feature name, and the values are a tuple of date time format and locale string.
Example:
{ "start_date": ("%Y-%m-%d %A %H.%M.%S", "es_ES"), "end_date": "%Y-%m-%d" }
delta_boundaries –
(Optional) For time series, specify the delta boundaries in the form {“feature” : {“min|max” : {order : value}}}. Works with partial values by specifying only particular order of derivatives you would like to overwrite. Invalid orders will be ignored.
Examples:
{ "stock_value": { "min": { '0' : 0.178, '1': 3.4582e-3, '2': None } } }
derived_orders – (Optional) Dict of features to the number of orders of derivatives that should be derived instead of synthesized. For example, for a feature with a 3rd order of derivative, setting its derived_orders to 2 will synthesize the 3rd order derivative value, and then use that synthed value to derive the 2nd and 1st order.
include_sample – Set to True to include a sample of each feature’s data in the output.
lags –
(Optional) A list containing the specific indices of the desired lag features to derive for each feature (not including the series time feature). Specifying derived lag features for the feature specified by time_feature_name must be done using a dictionary. A dictionary can be used to specify a list of specific lag indices for specific features. For example: {“feature1”: [1, 3, 5]} would derive three different lag features for feature1. The resulting lag features hold values 1, 3, and 5 timesteps behind the current timestep respectively.
Note
Using the lags parameter will override the num_lags parameter per feature
Note
A lag feature is a feature that provides a “lagging value” to a case by holding the value of a feature from a previous timestep. These lag features allow for cases to hold more temporal information.
num_lags –
(Optional) An integer specifying the number of lag features to derive for each feature (not including the series time feature). Specifying derived lag features for the feature specified by time_feature_name must be done using a dictionary. A dictionary can be used to specify numbers of lags for specific features. Features that are not specified will default to 1 lag feature.
Note
The num_lags parameter will be overridden by the lags parameter per feature.
orders_of_derivatives – (Optional) Dict of features and their corresponding order of derivatives for the specified type (delta/rate). If provided will generate the specified number of derivatives and boundary values. If set to 0, will not generate any delta/rate features. By default all continuous features have an order value of 1.
rate_boundaries –
(Optional) For time series, specify the rate boundaries in the form {“feature” : {“min|max” : {order : value}}}. Works with partial values by specifying only particular order of derivatives you would like to overwrite. Invalid orders will be ignored.
Examples:
{ "stock_value": { "min": { '0' : 0.178, '1': 3.4582e-3, '2': None } } }
tight_bounds – (Optional) Set tight min and max bounds for the features specified in the Iterable.
time_feature_is_universal – If True, the time feature will be treated as universal and future data is excluded while making predictions. If False, the time feature will not be treated as universal and only future data within the same series is excluded while making predictions. It is recommended to set this value to True if there is any possibility of global relevancy of time, which is the default behavior.
time_series_type_default – (Optional) Type specifying how time series is generated. One of ‘rate’ or ‘delta’, default is ‘rate’. If ‘rate’, it uses the difference of the current value from its previous value divided by the change in time since the previous value. When ‘delta’ is specified, just uses the difference of the current value from its previous value regardless of the elapsed time.
time_series_types_override – (Optional) Dict of features and their corresponding time series type, one of ‘rate’ or ‘delta’, used to override time_series_type_default for the specified features.
mode_bound_features – (Optional) Explicit list of feature names to use mode bounds for when inferring loose bounds. If None, assumes all features. A mode bound is used instead of a loose bound when the mode for the feature is the same as an original bound, as it may represent an application-specific min/max.
id_feature_name – (Optional) The name(s) of the ID feature(s).
time_invariant_features – (Optional) Names of time-invariant features.
attempt_infer_extended_nominals –
(Optional) If set to True, detections of extended nominals will be attempted. If the detection fails, the categorical variables will be set to int-id subtype.
Note
Please refer to kwargs for other parameters related to extended nominals.
nominal_substitution_config – (Optional) Configuration of the nominal substitution engine and the nominal generators and detectors.
include_extended_nominal_probabilities – (Optional) If true, extended nominal probabilities will be appended as metadata into the feature object.
datetime_feature_formats –
(optional) Dict defining a custom (non-ISO8601) datetime format and an optional locale for columns with datetimes. By default datetime columns are assumed to be in ISO8601 format. Non-English datetimes must have locales specified. If locale is omitted, the default system locale is used. The keys are the column name, and the values are a tuple of date time format and locale string:
Example:
{ "start_date" : ("%Y-%m-%d %A %H.%M.%S", "es_ES"), "end_date" : "%Y-%m-%d" }
ordinal_feature_values –
(optional) Dict for ordinal string features defining an ordered list of string values for each feature, ordered low to high. If specified will set ‘type’ to be ‘ordinal’ for all features in this map.
Example:
{ "grade" : [ "F", "D", "C", "B", "A" ], "size" : [ "small", "medium", "large", "huge" ] }
dependent_features –
Dict mapping a feature to a list of other feature(s) that it depends on or that are dependent on it. This restricts the cases that can be selected as neighbors (such as in
react()
) to ones that satisfy the dependency, if possible. If this is not possible, either due to insufficient data which satisfy the dependency or because dependencies are probabilistic, the dependency may not be maintained. Be aware that dependencies introduce further constraints to data and so several dependencies or dependencies on already constrained datasets may restrict which operations are possible while maintaining the dependency. As a rule of thumb, sets of features that have dependency relationships should generally not include more than 1 continuous feature, unless the continuous features have a small number of values that are commonly used.- Examples:
If there’s a feature name ‘measurement’ that contains measurements such as BMI, heart rate and weight, while the feature ‘measurement_amount’ contains the numerical values corresponding to the measurement, dependent features could be passed in as follows:
{ "measurement": [ "measurement_amount" ] }
Since dependence directionality is not important, this will also work:
{ "measurement_amount": [ "measurement" ] }
include_sample – If True, include a
sample
field containing a sample of the data from each feature in the output feature attributes dictionary.max_workers –
If unset or set to None (recommended), let the ProcessPoolExecutor choose the best maximum number of process pool workers to process columns in a multi-process fashion. In this case, if the product of the data’s rows and columns < 25,000,000, multiprocessing will not be used.
If defined with an integer > 0, manually set the number of max workers. Otherwise, the feature attributes will be calculated serially. Setting this parameter to zero (0) will disable multiprocessing.
- Returns:
A subclass of FeatureAttributesBase (Single/MultiTableFeatureAttributes) that extends dict, thus providing dict-like access to feature attributes and useful accessor methods.
- Return type:
Examples
# 'data' is a DataFrame >> attrs = infer_feature_attributes(data) # Can access feature attributes like a dict >> attrs { "feature_one": { "type": "continuous", "bounds": {"allow_null": True}, }, "feature_two": { "type": "nominal", } } >> attrs["feature_one"] { "type": "continuous", "bounds": {"allow_null": True} } # Or can call methods to do other stuff >> attrs.get_parameters() {'type': "continuous"} # Now 'data' is an object that implements SQLRelationalDatastoreProtocol >> attrs = infer_feature_attributes(data, tables) >> attrs { "table_1": { "feature_one": { "type": "continuous", "bounds": {"allow_null": True}, }, "feature_two": { "type": "nominal", } }, "table_2" : {...}, } >> attrs.to_json() '{"table_1" : {...}}'
- howso.utilities.is_valid_uuid(value, version=4)#
Check if a given string is a valid uuid.
- Parameters:
value – The value to test
version – The uuid version (Default: 4)
- Returns:
True if value is a valid uuid string
- howso.utilities.matrix_processing(matrix, normalize=False, normalize_method='fractional', ignore_diagonals_normalize=True, absolute=False, fill_diagonal=False, fill_diagonal_value=1)#
Preprocess a matrix including options to normalize, take the absolute value, and fill in the diagonals.
The order of operation for this method is first it then normalizes, then takes the absolute value, and lastly fills in the diagonals. This method automatically sorts the matrix indexes.
- Parameters:
matrix (pd.DataFrame) – Matrix in Dataframe form.
normalize (bool, default:
False
) – Whether to normalize the matrix row wise. Normalization method is set by the normalize_method parameter.normalize_method (Iterable[NormalizeMethod | Callable] | NormalizeMethod | Callable, default:
'fractional'
) –The normalization method. The method may either one of the strings below that correspond to a default method or a custom Callable.
These methods may be passed in as an individual string or in a iterable where they will be processed sequentially.
Default Methods: - ‘relative’: normalizes each row by dividing each value by the maximum absolute value in the row. - ‘fractional’: normalizes each row by dividing each value by the sum of the values in the row, so the relative
values sum to 1.
’fractional_absolute’: normalizes each row by dividing each value by the sum of absolute values in the row.
Custom Callable: - If a custom Callable is provided, then it will be passed onto the DataFrame apply function:
matrix.apply(Callable)
ignore_diagonals_normalize (bool, default:
True
) – Whether to ignore the diagonals when normalizing the matrix. Useful for matrices where the diagonals are a constant value such as correlation matrices.absolute (bool, default:
False
) – Whether to transform the matrix values into the absolute values.fill_diagonal (bool, default:
False
) – Whether to fill in the diagonals of the matrix. If set to true, the diagonal values will be filled in based on the fill_diagonal_value value.fill_diagonal_value (float | int, default:
1
) – The value to fill in the diagonals with. fill_diagonal must be set to True in order for the diagonal values to be filled in. If fill_diagonal is set to false, then this parameter will be ignored.
- Returns:
Dataframe of the result.
- Return type:
pd.DataFrame
- howso.utilities.num_list_dimensions(obj)#
Return number of dimensions for a list.
Assumption is that the input nested lists are also lists, or a list of DataFrames.
- Parameters:
lst – The nested list of objects.
obj (
list
)
- Returns:
The number of dimensions in the passed in list.
- Return type:
int
- howso.utilities.replace_doublemax_with_infinity(dat)#
Replace values of Double.MAX_VALUE (1.79769313486232E+308) with Infinity.
For use when retrieving data from Howso.
- Parameters:
dat (
Any
) – The data to replace infinity in.- Returns:
The same value back, with float max values converted to infinity.
- Return type:
Any
- howso.utilities.replace_nan_with_none(dat)#
Replace None values with NaN values.
For use when feeding data to Howso from the scikit module to account for the different ways howso and sklearn represent missing values.
- Parameters:
dat – A 2d list of values.
- howso.utilities.replace_none_with_nan(dat)#
Replace None values with NaN values.
For use when retrieving data from Howso via the scikit module to conform to sklearn convention on missing values.
- Parameters:
dat (
Mapping
)- Return type:
list[dict]
- howso.utilities.reshape_data(x, y)#
Reshapes X as a matrix and y as a vector.
- Parameters:
x (
ndarray
) – Feature values ndarray.y (
ndarray
) – target values ndarray.
- Return type:
tuple
[ndarray
,ndarray
]- Returns:
np.ndarray – X
np.ndarray – y
- howso.utilities.seconds_to_time(seconds, *, tzinfo=None)#
Convert seconds to a time object.
- Parameters:
seconds (
int
|float
|None
) – The seconds to convert to time.tzinfo (
tzinfo
|None
, default:None
) – Time zone to use for resulting time object.
- Returns:
The time object.
- Return type:
time | None
- howso.utilities.serialize_cases(data, columns, features, *, warn=False)#
Serialize case data into list of lists.
- Parameters:
data (
DataFrame
|ndarray
|Iterable
[Any
] |None
) – The data to serialize, typically in a Pandas DataFrame, Numpy ndarray or Python Iterable such as a list.columns (
Iterable
[str
] |None
) – The case column mapping. The order corresponds to the order of cases in output. columns must be provided for non-DataFrame Iterables.features (
Mapping
) – The dictionary of feature name to feature attributes.warn (
bool
, default:False
) – If warnings should be raised by serializer.
- Return type:
list
[list
[Any
]] |None
- Returns:
list of list or Any or None – The serialized data from DataFrame.
…
- Raises:
HowsoError – An pd.ndarray or Iterable is provided, columns was left undefined or the given columns does not match the columns defined within a given pd.DataFrame.
ValueError – The provided pd.DataFrame contains non-unique columns or, an unexpected datatype was received (should be either pd.DataFrame, np.ndarray or Python Iterable (non-str)).
- howso.utilities.serialize_datetimes(cases, columns, features, *, warn=False)#
Serialize datetimes in the given list of cases, in-place.
Iterate over the passed in case values and serializes any datetime values according to the specified datetime format in feature attributes.
- Parameters:
cases (
list
[list
]) – A 2d list of case values corresponding to the features of the cases.columns (
Iterable
[str
]) – A list of feature names.features (
Mapping
) – Dictionary of feature attributes.warn (
bool
, default:False
) – If set to true, will warn user when specified datetime format doesn’t match the datetime strings.
- Return type:
None
- howso.utilities.time_to_seconds(time)#
Convert a time object to seconds since midnight.
- Parameters:
time (
time
|None
) – The time to convert.- Returns:
Seconds since midnight.
- Return type:
float | None
- howso.utilities.validate_case_indices(case_indices, thorough=False)#
Validate the case_indices parameter to the react() method of a Howso client.
Raises a ValueError if case_indices has sequences that do not contain the expected data types of (str, int).
- Parameters:
case_indices (
Sequence
[Sequence
[str
|int
]]) – The case_indices argument to validate.thorough – Whether to verify the data types in all sequences or only some (for performance)
- Raises:
ValueError – if case_indices : sequences that do not contain the expect data types of str | int
- Return type:
None
- howso.utilities.validate_datetime_iso8061(datetime_value, feature)#
Check that the passed in datetime value adheres to the ISO 8601 format.
Warn the user if it doesn’t check out.
- Parameters:
datetime_value – The date value as a string
feature – Name of feature
- howso.utilities.validate_features(features, extended_feature_types=None)#
Validate the feature types in features.
- Parameters:
features (
Mapping
[str
,Mapping
]) –The dict of feature name to feature attributes.
The valid feature names are:
”nominal”
”continuous”
”ordinal”
along with passed in extended_feature_types
extended_feature_types (
Iterable
[str
] |None
, default:None
) – (Optional) If a list is passed in, the feature types specified in the list will be considered as valid features.
- Return type:
None
- howso.utilities.validate_list_shape(values, dimensions, variable_name, var_types, allow_none=True)#
Validate the shape of a list.
- Parameters:
values (
Collection
|None
) – A single or multidimensional list.dimensions (
int
) – The number of dimensions the list should be.variable_name (
str
) – The variable name for output.var_types (
str
) – The expected type of the data.allow_none (
bool
, default:True
) – If None should be allowed.
- Raises:
ValueError if variable_name is None –
- Return type:
None