Basic Workflow#
Objectives: what you will take away#
Definitions & an understanding of terminology unique to Howso Engine and what the basic workflow of using the Howso Engine looks like.
How-To import data, map features, train, analyze, and make inferences using the Howso Engine.
Prerequisites: before you begin#
Installation
You’ve successfully installed Howso Engine
You’ve installed these libraries:
Data#
Our example dataset for this recipe is the well known Adult
dataset. It is accessible via the pmlb package installed earlier. We use the fetch_data()
function to retrieve the dataset in Step 1 below.
Concepts & Terminology#
Howso Engine is a generalized Machine Learning (ML) and Artificial Intelligence platform that creates powerful decision-making models that are fully explainable, auditable, and editable. Howso Engine uses Instance-Based Machine Learning which stores instances, i.e., data points, in memory and makes predictions about new instances given their relationship to existing instances. This technology harnesses a fast spatial query system and information theory for performance and accuracy.
Notebook Recipe#
The following recipe will demonstrate some of the capabilities demonstrated in this guide as well as a few additional capabilities.
How-To Guide#
Here we will walk through the steps of what a basic workflow might look like when using Howso Engine. First, we will load data into a pandas DataFrame for use with Howso Engine. Then, we will use the Howso Engine to map attributes of the features, train a trainee, analyze, and react.
[1]:
import pandas as pd
from pmlb import fetch_data
from howso.engine import Trainee
from howso.utilities import infer_feature_attributes
Step 1 - Load Data and Infer Feature Attributes#
First, we load the adult
dataset from the PMLB repository. This dataset consists of 15 features, which will have their
attributes inferred by infer_feature_attributes()
. This will determine attributes about each feature
including bounds, allowed values, and feature type. Before the following steps, the inferred feature attributes should be
inspected to ensure their correctness.
[2]:
df = fetch_data('adult').sample(1_000)
features = infer_feature_attributes(df)
features.to_dataframe()
[2]:
type | decimal_places | bounds | data_type | original_type | ||||||
---|---|---|---|---|---|---|---|---|---|---|
min | max | allow_null | observed_min | observed_max | data_type | size | ||||
age | continuous | 0 | 0.0 | 127.0 | True | 17.0 | 84.0 | number | numeric | 8 |
workclass | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
fnlwgt | continuous | 0 | 0.0 | 1174774.0 | True | 20057.0 | 720428.0 | number | numeric | 8 |
education | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
education-num | continuous | 0 | 0.0 | 25.0 | True | 2.0 | 16.0 | number | numeric | 8 |
marital-status | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
occupation | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
relationship | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
race | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
sex | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
capital-gain | continuous | 0 | 0.0 | 164870.0 | True | 0.0 | 99999.0 | number | numeric | 8 |
capital-loss | continuous | 0 | 0.0 | 7182.0 | True | 0.0 | 4356.0 | number | numeric | 8 |
hours-per-week | continuous | 0 | 0.0 | 163.0 | True | 1.0 | 99.0 | number | numeric | 8 |
native-country | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
target | nominal | 0 | NaN | NaN | False | NaN | NaN | number | integer | 8 |
Step 2 - Create the Trainee, Train, and Analyze#
A Trainee is similar in function to a model in other machine learning paradigms, but is not locked to any particular use-case or to predicting a particular feature. Note that both the data and the feature attributes are supplied at this time. Since the feature attributes are essentially a part of the training data it is extremely important to ensure they are correct.
[3]:
trainee = Trainee(features=features)
trainee.train(df)
After the data are trained, we can Trainee.analyze()
the Trainee. This method will determine the best
hyperparameters for the data and cache some important values that are used to ensure the highest model performance. By default,
Trainee.analyze()
will optimize the Trainee’s parameters for any possible target feature.
[4]:
trainee.analyze()
Step 3 - React#
Now that the Trainee has been prepared, it is ready for use. A common use-case, determining how well a model performs when predicting the dataset, can be done with a single call to the Trainee:
[5]:
prediction_stats = trainee.get_prediction_stats(
action_feature="target",
details={"prediction_stats": True},
)
prediction_stats
[5]:
age | capital-loss | education-num | capital-gain | fnlwgt | hours-per-week | workclass | marital-status | target | native-country | occupation | relationship | race | education | sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
smape | 21.721781 | 106.754876 | 2.577898 | 120.616591 | 44.343579 | 22.272240 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
recall | NaN | NaN | NaN | NaN | NaN | NaN | 0.320286 | 0.381761 | 0.735564 | 0.052632 | 0.235500 | 0.507159 | 0.243019 | 1.000000 | 0.830992 |
rmse | 10.909193 | 435.073111 | 0.745392 | 6674.152734 | 106935.008766 | 12.205738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
missing_value_accuracy | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
spearman_coeff | 0.574761 | 0.414594 | 0.959636 | 0.536497 | 0.004877 | 0.460353 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
accuracy | NaN | NaN | NaN | NaN | NaN | NaN | 0.753754 | 0.832000 | 0.823000 | 0.908081 | 0.300300 | 0.744000 | 0.842000 | 1.000000 | 0.841000 |
adjusted_smape | 21.426289 | 105.986343 | 2.393858 | 112.689293 | 44.343441 | 21.940517 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
r2 | 0.357644 | -0.046150 | 0.915815 | 0.024432 | -0.038919 | 0.117681 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mae | 8.470486 | 169.505832 | 0.190363 | 1560.825093 | 80861.717265 | 8.177813 | 0.367540 | 0.217501 | 0.230220 | 0.155897 | 0.763507 | 0.308157 | 0.236043 | 0.000013 | 0.210756 |
mcc | NaN | NaN | NaN | NaN | NaN | NaN | 0.424543 | 0.742383 | 0.496273 | 0.000000 | 0.224628 | 0.638777 | 0.117551 | 1.000000 | 0.649872 |
precision | NaN | NaN | NaN | NaN | NaN | NaN | 0.353950 | 0.409072 | 0.761380 | 0.047794 | 0.237747 | 0.583463 | 0.319231 | 1.000000 | 0.818991 |
An action_feature is the same as a target feature or dependent variable. This call will compute a number of different statistics,
including accuracy, precision, recall, \(R^2\), and others. Rather than performing a train-test split, which is common with
other machine learning techniques, the Trainee uses leave-one-out to provide a more comprehensive understanding of the data.
More traditional approaches can still be used with the Trainee.react()
method.