unique id for the workflow
Efficiently applies all fitted stages grouping by level in the DAG where possible
Efficiently applies all fitted stages grouping by level in the DAG where possible
data to transform
computation graph
breaks in computation to persist
spark session
transformed dataframe
Check that readers and features are set and that params match them
Check that readers and features are set and that params match them
Determine if any of the raw features do not have a matching reader
Determine if any of the raw features do not have a matching reader
Returns a dataframe containing all the columns generated up to and including the feature input
Returns a dataframe containing all the columns generated up to and including the feature input
input feature to compute up to
persist data in transforms every k stages for performance improvement
Dataframe containing columns corresponding to all of the features generated up to the feature given
Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format
Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format
Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)
Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)
feature want to find origin stage for
index of the parent stage
Fit the estimators to return a sequence of only transformers Modified version of Spark 2.x Pipeline
Fit the estimators to return a sequence of only transformers Modified version of Spark 2.x Pipeline
dataframe to fit on
stages that need to be converted to transformers
persist data in transforms every k stages for performance improvement
fitted transformers
Used to generate dataframe from reader and raw features list
Used to generate dataframe from reader and raw features list
Dataframe with all the features generated + persisted
Get all the features that potentially are generated by the workflow: raw, intermediate and result features
Get all the features that potentially are generated by the workflow: raw, intermediate and result features
all the features that potentially are generated by the workflow: raw, intermediate and result features
Get the list of raw features which have been blocklisted
Get the list of raw features which have been blocklisted
blocklisted features
Get the list of Map Keys which have been blocklisted
Get the list of Map Keys which have been blocklisted
blocklisted map keys
Get the parameter settings passed into the workflow
Get the parameter settings passed into the workflow
OpWorkflowParams set for this workflow
Get raw feature distribution information computed on training and scoring data during raw feature filter
Get raw feature distribution information computed on training and scoring data during raw feature filter
sequence of feature distribution information
Get raw feature filter results (filter configuration, feature distributions, and feature exclusion reasons)
Get raw feature filter results (filter configuration, feature distributions, and feature exclusion reasons)
raw feature filter results
Get the raw features generated by the workflow
Get the raw features generated by the workflow
raw features for workflow
Get raw feature distribution information computed on scoring data during raw feature filter
Get raw feature distribution information computed on scoring data during raw feature filter
sequence of feature distribution information
Get raw feature distribution information computed on training data during raw feature filter
Get raw feature distribution information computed on training data during raw feature filter
sequence of feature distribution information
Get data reader that will be used to generate data frame for stages
Get data reader that will be used to generate data frame for stages
reader for workflow
Get the final features generated by the workflow
Get the final features generated by the workflow
result features for workflow
Get the stages used in this workflow
Get the stages used in this workflow
stages in the workflow
Whether the cross-validation/train-validation-split will be done at workflow level
Whether the cross-validation/train-validation-split will be done at workflow level
true if the cross-validation will be done at workflow level, false otherwise
Load a previously trained workflow model from path
Load a previously trained workflow model from path
to the trained workflow model
load the transformers as spark native or mleap transformers and tmog transformers
local folder to copy and unpack stored model to for loading
workflow model
Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
input dataset for workflow
key extract function
this workflow
Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features
input rdd for workflow
key extract function
this workflow
Set stage and reader parameters from OpWorkflowParams object for run
Set stage and reader parameters from OpWorkflowParams object for run
new parameter values
this workflow
Set data reader that will be used to generate data frame for stages
Set data reader that will be used to generate data frame for stages
reader for workflow
this workflow
This is used to set the stages of the workflow.
This is used to set the stages of the workflow.
By setting the final features the stages used to generate them can be traced back through the parent features and origin stages. The input is a tuple of features to support leaf feature generation (multiple endpoints in feature generation).
Final features generated by the workflow
Fit all of the estimators in the pipeline and return a pipeline model of only transformers.
Fit all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set.
persist data in transforms every k stages for performance improvement
a fitted pipeline model
unique id for the workflow
unique id for the workflow
Replaces any estimators in this workflow with their corresponding fit models from the OpWorkflowModel passed in.
Replaces any estimators in this workflow with their corresponding fit models from the OpWorkflowModel passed in. Note that the Stages UIDs must EXACTLY correspond in order to be replaced so the same features and stages must be used in both the fitted OpWorkflowModel and this OpWorkflow. Any estimators that are not part of the OpWorkflowModel passed in will be trained when .train() is called on this OpWorkflow.
model containing fitted stages to be used in this workflow
an OpWorkflow containing all of the stages from this model plus any new stages needed to generate the features not included in the fitted model
Add a raw features filter to the workflow to look at fill rates and distributions of raw features and exclude features that do not meet specifications from modeling DAG
Add a raw features filter to the workflow to look at fill rates and distributions of raw features and exclude features that do not meet specifications from modeling DAG
Type of the data read in
training reader to use in filter if not supplied will fall back to reader specified for workflow (note that this reader will take precedence over readers directly input to the workflow if both are supplied)
scoring reader to use in filter if not supplied will do the checks possible with only training data available
number of bins to use in estimating feature distributions
minimum non-null fraction of instances that a feature should contain
maximum absolute difference in fill rate between scoring and training data for a feature
maximum difference in fill ratio (symmetric) between scoring and training data for a feature
maximum Jensen-Shannon divergence between the training and scoring distributions for a feature
list of features that should never be removed (features that are used to create them will also be protected)
features that are protected from removal by JS divergence check
formula to compute the text features bin size. Input arguments are Summary and number of bins to use in computing feature distributions (histograms for numerics, hashes for strings). Output is the bins for the text features.
Time period used to apply circulate date transformation for date features, if not specified will use numeric feature transformation
Minimum row threshold for scoring set comparisons to be used in checks. If the scoring set size is below this threshold, then only training data checks will be used
:: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime
:: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime
this workflow that will train part of the DAG in the cross-validation/train validation split
Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model.