unique identifier for this workflow model
params that were used during model training
Efficiently applies all fitted stages grouping by level in the DAG where possible
Efficiently applies all fitted stages grouping by level in the DAG where possible
data to transform
computation graph
breaks in computation to persist
spark session
transformed dataframe
Check that readers and features are set and that params match them
Check that readers and features are set and that params match them
Determine if any of the raw features do not have a matching reader
Determine if any of the raw features do not have a matching reader
Returns a dataframe containing all the columns generated up to and including the feature input
Returns a dataframe containing all the columns generated up to and including the feature input
input feature to compute up to
persist data in transforms every k stages for performance improvement
Dataframe containing columns corresponding to all of the features generated up to the feature given
IllegalArgumentException
if a feature is not part of this workflow
Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format
Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format
Creates a copy of this OpWorkflowModel instance
Creates a copy of this OpWorkflowModel instance
copy of this OpWorkflowModel instance
Load up the data by the reader, transform it and then evaluate
Load up the data by the reader, transform it and then evaluate
OP Evaluator
path to write out the metrics
spark session
evaluation metrics
Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)
Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)
feature want to find origin stage for
index of the parent stage
Used to generate dataframe from reader and raw features list
Used to generate dataframe from reader and raw features list
Dataframe with all the features generated + persisted
Get all the features that potentially are generated by the workflow: raw, intermediate and result features
Get all the features that potentially are generated by the workflow: raw, intermediate and result features
all the features that potentially are generated by the workflow: raw, intermediate and result features
Get the list of raw features which have been blocklisted
Get the list of raw features which have been blocklisted
blocklisted features
Get the list of Map Keys which have been blocklisted
Get the list of Map Keys which have been blocklisted
blocklisted map keys
Get the metadata associated with the features
Get the metadata associated with the features
features to get metadata for
metadata associated with the features
IllegalArgumentException
if a feature is not part of this workflow
Gets the fitted stage that generates the input feature
Gets the fitted stage that generates the input feature
Type of feature
feature want the origin stage for
Fitted origin stage for feature
IllegalArgumentException
if a feature is not part of this workflow
Get the parameter settings passed into the workflow
Get the parameter settings passed into the workflow
OpWorkflowParams set for this workflow
Get raw feature distribution information computed on training and scoring data during raw feature filter
Get raw feature distribution information computed on training and scoring data during raw feature filter
sequence of feature distribution information
Get raw feature filter results (filter configuration, feature distributions, and feature exclusion reasons)
Get raw feature filter results (filter configuration, feature distributions, and feature exclusion reasons)
raw feature filter results
Get the raw features generated by the workflow
Get the raw features generated by the workflow
raw features for workflow
Get raw feature distribution information computed on scoring data during raw feature filter
Get raw feature distribution information computed on scoring data during raw feature filter
sequence of feature distribution information
Get raw feature distribution information computed on training data during raw feature filter
Get raw feature distribution information computed on training data during raw feature filter
sequence of feature distribution information
Get data reader that will be used to generate data frame for stages
Get data reader that will be used to generate data frame for stages
reader for workflow
Get the final features generated by the workflow
Get the final features generated by the workflow
result features for workflow
Get the stages used in this workflow
Get the stages used in this workflow
stages in the workflow
Gets the updated version of a feature when the DAG has been modified with a raw feature filter
Gets the updated version of a feature when the DAG has been modified with a raw feature filter
feature want a the updated history for
Updated instance of feature
IllegalArgumentException
if a feature is not part of this workflow
Whether the cross-validation/train-validation-split will be done at workflow level
Whether the cross-validation/train-validation-split will be done at workflow level
true if the cross-validation will be done at workflow level, false otherwise
Get model insights for the model used to create the input feature.
Get model insights for the model used to create the input feature. Will traverse the DAG to find the LAST model selector and sanity checker used in the creation of the selected feature
feature to find model info for
Model insights class containing summary of modeling and sanity checking
Save this model to a path
Save this model to a path
path to save the model
should overwrite if the path exists
local folder to copy and unpack stored model to for loading
Load up the data as specified by the data reader then transform that data using the transformers specified in this workflow.
Load up the data as specified by the data reader then transform that data using the transformers specified in this workflow. We will always keep the key and result features in the returned dataframe, but there are options to keep the other raw & intermediate features.
This method optimizes scoring by grouping applying bulks of OpTransformer stages on each step. The rest of the stages go are applied sequentially (as org.apache.spark.ml.Pipeline does)
optional path to write out the scores to a file
flag to enable keeping raw features in the output DataFrame as well
flag to enable keeping intermediate features in the output DataFrame as well
how often to break up catalyst by persisting the data (applies for non OpTransformer stages only), to turn off set to Int.MaxValue (not recommended)
should persist the final scores dataframe
Dataframe that contains all the columns generated by the transformers in this workflow model as well as the key and result features, along with other features if the above flags are set to true.
Load up the data as specified by the data reader then transform that data using the transformers specified in this workflow.
Load up the data as specified by the data reader then transform that data using the transformers specified in this workflow. We will always keep the key and result features in the returned dataframe, but there are options to keep the other raw & intermediate features.
This method optimizes scoring by grouping applying bulks of OpTransformer stages on each step. The rest of the stages go are applied sequentially (as org.apache.spark.ml.Pipeline does)
evalutator to use for metrics generation
optional path to write out the scores to a file
flag to enable keeping raw features in the output DataFrame as well
flag to enable keeping intermediate features in the output DataFrame as well
how often to break up catalyst by persisting the data (applies for non OpTransformer stages only), to turn off set to Int.MaxValue (not recommended)
should persist the final scores dataframe
optional path to write out the metrics to a file
Dataframe that contains all the columns generated by the transformers in this workflow model as well as the key and result features, along with other features if the above flags are set to true. Also returns metrics computed with evaluator.
Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
input dataset for workflow
key extract function
this workflow
Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
input rdd for workflow
key extract function
this workflow
Set reader parameters from OpWorkflowParams object for run (stage parameters passed in will have no effect)
Set reader parameters from OpWorkflowParams object for run (stage parameters passed in will have no effect)
new parameter values
Set data reader that will be used to generate data frame for stages
Set data reader that will be used to generate data frame for stages
reader for workflow
this workflow
Extracts all summary metadata from transformers in JSON format
Extracts all summary metadata from transformers in JSON format
json string summary
Extracts all summary metadata from transformers in JSON format
Extracts all summary metadata from transformers in JSON format
json summary
Generated high level model summary in a compact print friendly format containing: selected model info, model evaluation results and feature correlations/contributions/cramersV values.
Generated high level model summary in a compact print friendly format containing: selected model info, model evaluation results and feature correlations/contributions/cramersV values.
model insights to compute the summary against
top K of feature correlations/contributions/cramersV values to print
high level model summary in a compact print friendly format
params that were used during model training
unique identifier for this workflow model
unique identifier for this workflow model
:: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime
:: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime
this workflow that will train part of the DAG in the cross-validation/train validation split
Workflow model is a container and executor for the sequence of transformations that have been fit to the data to produce the desired output features