reader to get the training data
reader to get the scoring data for comparison (optional - if not present will exclude based on training data features only)
number of bins to use in computing feature distributions (histograms for numerics, hashes for strings)
minimum fill rate a feature must have in the training dataset and scoring dataset to be kept
maximum acceptable fill rate difference between training and scoring data to be kept
maximum acceptable fill ratio between training and scoring (larger / smaller)
maximum Jensen-Shannon divergence between training and scoring distributions to be kept
maximum absolute correlation allowed between raw predictor null indicator and label
type of correlation metric to use
features that are protected from removal by JS divergence check
features that are protected from removal
formula to compute the text features bin size. Input arguments are Summary and number of bins to use in computing feature distributions (histograms for numerics, hashes for strings). Output is the bins for the text features.
Time period used to apply circulate date transformation for date features, if not specified will use regular numeric feature transformation
Minimum row threshold for scoring set comparisons to be used in checks. If the scoring set size is below this threshold, then only training data checks will be used
number of bins to use in computing feature distributions (histograms for numerics, hashes for strings)
type of correlation metric to use
Function that gets raw features and params used in workflow.
Function that gets raw features and params used in workflow. Will use this information along with readers for this stage to determine which features should be dropped from the workflow
raw features used in the workflow
parameters used in the workflow
spark instance
dataframe that has had bad features and bad map keys removed and a list of all features that should be dropped from the DAG
features that are protected from removal by JS divergence check
maximum absolute correlation allowed between raw predictor null indicator and label
maximum acceptable fill rate difference between training and scoring data to be kept
maximum acceptable fill ratio between training and scoring (larger / smaller)
maximum Jensen-Shannon divergence between training and scoring distributions to be kept
minimum fill rate a feature must have in the training dataset and scoring dataset to be kept
Minimum row threshold for scoring set comparisons to be used in checks.
Minimum row threshold for scoring set comparisons to be used in checks. If the scoring set size is below this threshold, then only training data checks will be used
features that are protected from removal
reader to get the scoring data for comparison (optional - if not present will exclude based on training data features only)
formula to compute the text features bin size.
formula to compute the text features bin size. Input arguments are Summary and number of bins to use in computing feature distributions (histograms for numerics, hashes for strings). Output is the bins for the text features.
Time period used to apply circulate date transformation for date features, if not specified will use regular numeric feature transformation
reader to get the training data
Specialized stage that will load up data and compute distributions and empty counts on raw features. This information is then used to compute which raw features should be excluded from the workflow DAG Note: Currently, raw features that aren't explicitly blocklisted, but are not used because they are inputs to explicitly blocklisted features are not present as raw features in the model, nor in ModelInsights. However, they are accessible from an OpWorkflowModel via getRawFeatureFilterResults().
datatype of the reader