Predictions#

Objectives: what you will take away#

  • Definitions & an understanding of basic regression, classification, continuous vs categorical/nominal Action Features, Trainee, react(), react_aggregate().

  • How-To perform a basic regression or classification analysis using the Howso Engine to predict the Highway MPG or Fuel Type based on vehicle Context Features.

Prerequisites: before you begin#

Data#

Download 23,606 vehicles from 1984 - 2022, including make, model, MPG, drive-type, size, class and fuel type.

Concepts & Terminology#

Regression - is used to describe the relationship between one or more Context Features and a continuous numeric Action Feature, as in this guide predicting the Highway MPG of a vehicle based on its physical characteristics and year manufactured.

Classification - is used to describe the relationship between one or more Context Features and a categorical/nominal Action Feature, as in this guide predicting the FuelType of a vehicle based on its physical characteristics and year manufactured. For Howso Engine, the action feature may be left in string format and does not need to be converted to numeric format.

Trainee and React - In this simple example, we will be creating a Trainee that we can be used React to new case data, such as a new car we might be looking to build.

Train and Analyze - To create a Trainee, we will first load data, define Feature Attributes of the data and Train the Trainee. The Trainee can be used for many tasks, but because we know exactly what we want to do, we will Analyze to improve the performance of our trainee by defining the specific set of Context Features that we know we want to use to predict an Action Features. The action feature in this example will be Highway MPG.

Evaluating the Trainee - To understand the accuracy of the trainee for our tasks, we can use the built-in Trainee.react_aggregate(). Since we are not using a train-test split approach in this example, we will use the react_aggregate() method, which performs a react() on each of the cases that is trained into the model using a leave-one-out approach.

That method allows us to use prediction stats to evaluate regression accuracy statistics such as:

  • R-Squared - \(R^2\) is a value that represents how well the predictions fit the data, the closer to 1.0 the better the fit

  • Mean Absolute Error (MAE) average absolute error between actual and predicted values over the whole dataset, and relative to the scale of what is being measured

  • Root Mean Square Error (RMSE) mean square root of errors over whole dataset, similar to MAE and relative to scale of what is measured

Or classification metrics including those derived from the true positive (TP), true negative (TN), false positive (FP), false negative (FN) metrics:

  • Accuracy - Describes the model performance across all classes and is comprised of the ratio of number of correct predictions to the total number of predictions. - (TP+TN)/(TP+FP+FN+TN).

  • Precision - Describes what proportion of positive predictions were correct. - (TP+TN)/(TP+FP+FN+TN).

  • Recall - Describes what proportion of actual positives were predicted correctly. - (TN)/(TN+FP).

  • Mean Absolute Error (MAE) average absolute error between actual and predicted Categorical Action Probabilities (CAP) over the whole dataset. - CAP is the prediction probability for each class of the action feature.

React to New Cases - Lastly, we will simply request the Trainee to react() to new cases we present to it, giving us predictions of what the Highway MPG would be.

How-To Guide#

We want to predict the Highway MPG and the Fuel Type of a new vehicle based on a Trainee we create from the vehicles dataset. In this guide, we will directly show the code for Highway MPG prediction while including the code for Fuel Type as comments wherever the code differs.

Step 1 - Load Libraries#

import pandas as pd
import matplotlib.pyplot as plt

from howso.engine import Trainee
from howso.utilities import infer_feature_attributes

Step 2 - Load Data#

Using a pandas DataFrame, load the vehicles dataset from the csv file. We are going to drop make/model features because that is kinda cheating… Make sure it’s what you expect, take a quick look at some of the data and use describe to make sure it has the shape you’d expect.

df = pd.read_csv("./data/vehicle_predict.csv")
df = df.drop(['Make', 'Model'], axis=1)
df.describe()

Step 3 - Define Features#

Howso can auto-detect features from data, using infer_feature_attributes() but it is a best practice to review and configure. In this tutorial, we will proceed as if the features were not detected as we want them to be, so we will make necessary adjustments.

Note

Howso automatically determines whether to perform a regression or classification task by the feature attributes of the action feature you are trying to predict, specifically the feature type as shown below, thus it is very important to make sure that the feature types are correct.

# Auto detect features
features = infer_feature_attributes(df)

# For Regression, we will set `HighwayMPG` feature type to continuous
features['HighwayMPG']['type'] = 'continuous'

# For Classification, we will set `FuelType` feature type to nominal
features['FuelType']['type'] = 'nominal'

# We will also set these context features to continuous
features['CityMPG']['type'] = 'continuous'
features['Year']['type'] = 'continuous'
features['PassengerVolume']['type'] = 'continuous'
features['LuggageVolume']['type'] = 'continuous'

Step 4 - Create a Trainee and Train#

Next we will create a Trainee and train() based on data we have loaded into the DataFrame from the vehicles.csv.

# Create a new Trainee, specify features
trainee = Trainee(features=features)

# Train trainee
trainee.train(df)

Step 5 - Analyze Trainee, Set Context & Action Features#

We know a specific task we want our Trainee to react() to, that is, to predict Highway MPG (the action feature) - using the context features: Year, DriveType, FuelType, CityMPG, PassengerVolume, LuggageVolume, and VehicleClass. We can use analyze() to improve performance of our model by analyzing for this specific target.

action_features = ['HighwayMPG']
# Code for `FuelType` prediction
# action_features = ['FuelType']
context_features = features.get_names(without=action_features)

trainee.analyze(context_features=context_features, action_features=action_features)

Step 6 - Generate Accuracy Metrics#

Review the accuracy of the Trainee by using the built-in react_aggregate() method, which performs a react() on each of the cases that is trained into the model. Then we can evaluate accuracy with the returned R-Squared (\(R^2\)), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) metrics since this is a regression task.

# Recommended metrics
stats = trainee.react_aggregate(
   action_feature=action_features[0],
   details={
      'prediction_stats': True,
      'selected_prediction_stats': ['rmse', 'spearman_coeff', 'r2', 'mae']
   }
)
stats

Step 7 - Review Accuracy Metrics#

We see the Trainee has a very good fit for predicting Highway MPG with an \(R^2\) of 0.99, which shows the Trainee should be effective at predicting new cases of Highway MPG.

rmse              1.20
spearman_coeff    0.96
r2                0.99
mae               0.72
Name: HighwayMPG, dtype: float64

Step 8 - React to New Case#

We have a new vehicle we want to predict Highway MPG for. The test case is a 2022, All Wheel Drive, Mid-Sized Car, using Premium fuel, with a PassengerVolume of 95, LuggageVolume of 23 and gets City MPG of 21.

The Trainee can react() to this new case, and makes a prediction.

data = {
    'Year': [2022],
    'DriveType': ['All-Wheel Drive'],
    'FuelType' : ['Premium'],
    'VehicleClass': ['Midsize Cars'],
    'CityMPG': [21],
    'PassengerVolume': [95],
    'LuggageVolume': [23]
}

test_case = pd.DataFrame(data)

result = trainee.react(
    test_case,
    action_features=action_features,
    context_features=context_features
)

Note

The method Trainee.predict() can also be used for predictions instead of Trainee.react(). Trainee.predict() serves as a convenience function that eliminates the extra output if all you want is the prediction.

Step 9 - Review Prediction#

Reviewing the prediction shows HighwayMPG of 29.

result['action']

HighwayMPG
29

Combined Code#

import pandas as pd
import matplotlib.pyplot as plt

from howso.engine import Trainee
from howso.utilities import infer_feature_attributes

df = pd.read_csv("./data/vehicle_predict.csv")
df = df.drop(['Make', 'Model'], axis=1)

# Auto detect features
features = infer_feature_attributes(df)

# For Regression, we will set `HighwayMPG` feature type to continuous
features['HighwayMPG']['type'] = 'continuous'

# For Classification, we will set `FuelType` feature type to nominal
features['FuelType']['type'] = 'nominal'

# We will also set these context features to continuous
features['CityMPG']['type'] = 'continuous'
features['Year']['type'] = 'continuous'
features['PassengerVolume']['type'] = 'continuous'
features['LuggageVolume']['type'] = 'continuous'

# Create a new Trainee, specify features
trainee = Trainee(features=features)

# Train trainee
trainee.train(df)

action_features = ['HighwayMPG']
# Code for `FuelType` prediction
# action_features = ['FuelType']
context_features = features.get_names(without=action_features)

trainee.analyze(context_features=context_features, action_features=action_features)

# Recommended metrics
stats = trainee.react_aggregate(
   action_feature=action_features[0],
   details={
      'prediction_stats': True,
      'selected_prediction_stats': ['rmse', 'spearman_coeff', 'r2', 'mae']
   }
)

stats

data = {
    'Year': [2022],
    'DriveType': ['All-Wheel Drive'],
    'FuelType' : ['Premium'],
    'VehicleClass': ['Midsize Cars'],
    'CityMPG': [21],
    'PassengerVolume': [95],
    'LuggageVolume': [23]
}

test_case = pd.DataFrame(data)

result = trainee.react(
    test_case,
    action_features=action_features,
    context_features=context_features
)

API References#