Class

com.salesforce.op

OpWorkflow

Related Doc: package op

Permalink

class OpWorkflow extends OpWorkflowCore

Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model.

Linear Supertypes
OpWorkflowCore, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. OpWorkflow
  2. OpWorkflowCore
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new OpWorkflow(uid: String = UID[OpWorkflow])

    Permalink

    uid

    unique id for the workflow

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def applyTransformationsDAG(rawData: DataFrame, dag: StagesDAG, persistEveryKStages: Int)(implicit spark: SparkSession): DataFrame

    Permalink

    Efficiently applies all fitted stages grouping by level in the DAG where possible

    Efficiently applies all fitted stages grouping by level in the DAG where possible

    rawData

    data to transform

    dag

    computation graph

    persistEveryKStages

    breaks in computation to persist

    spark

    spark session

    returns

    transformed dataframe

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. var blocklistedFeatures: Array[OPFeature]

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  7. var blocklistedMapKeys: Map[String, Set[String]]

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  8. def checkReadersAndFeatures(): Unit

    Permalink

    Check that readers and features are set and that params match them

    Check that readers and features are set and that params match them

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  9. def checkUnmatchedFeatures(): Unit

    Permalink

    Determine if any of the raw features do not have a matching reader

    Determine if any of the raw features do not have a matching reader

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  10. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  11. def computeDataUpTo(feature: OPFeature, persistEveryKStages: Int = OpWorkflowModel.PersistEveryKStages)(implicit spark: SparkSession): DataFrame

    Permalink

    Returns a dataframe containing all the columns generated up to and including the feature input

    Returns a dataframe containing all the columns generated up to and including the feature input

    feature

    input feature to compute up to

    persistEveryKStages

    persist data in transforms every k stages for performance improvement

    returns

    Dataframe containing columns corresponding to all of the features generated up to the feature given

    Definition Classes
    OpWorkflow → OpWorkflowCore
  12. def computeDataUpTo(feature: OPFeature, path: String)(implicit spark: SparkSession): Unit

    Permalink

    Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format

    Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format

    Definition Classes
    OpWorkflowCore
  13. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  14. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  15. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  16. def findOriginStageId(feature: OPFeature): Option[Int]

    Permalink

    Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)

    Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)

    feature

    feature want to find origin stage for

    returns

    index of the parent stage

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  17. def fitStages(data: DataFrame, stagesToFit: Array[OPStage], persistEveryKStages: Int)(implicit spark: SparkSession): Array[OPStage]

    Permalink

    Fit the estimators to return a sequence of only transformers Modified version of Spark 2.x Pipeline

    Fit the estimators to return a sequence of only transformers Modified version of Spark 2.x Pipeline

    data

    dataframe to fit on

    stagesToFit

    stages that need to be converted to transformers

    persistEveryKStages

    persist data in transforms every k stages for performance improvement

    returns

    fitted transformers

    Attributes
    protected
  18. def generateRawData()(implicit spark: SparkSession): DataFrame

    Permalink

    Used to generate dataframe from reader and raw features list

    Used to generate dataframe from reader and raw features list

    returns

    Dataframe with all the features generated + persisted

    Attributes
    protected
    Definition Classes
    OpWorkflow → OpWorkflowCore
  19. final def getAllFeatures(): Array[OPFeature]

    Permalink

    Get all the features that potentially are generated by the workflow: raw, intermediate and result features

    Get all the features that potentially are generated by the workflow: raw, intermediate and result features

    returns

    all the features that potentially are generated by the workflow: raw, intermediate and result features

    Definition Classes
    OpWorkflowCore
  20. final def getBlocklist(): Array[OPFeature]

    Permalink

    Get the list of raw features which have been blocklisted

    Get the list of raw features which have been blocklisted

    returns

    blocklisted features

    Definition Classes
    OpWorkflowCore
  21. final def getBlocklistMapKeys(): Map[String, Set[String]]

    Permalink

    Get the list of Map Keys which have been blocklisted

    Get the list of Map Keys which have been blocklisted

    returns

    blocklisted map keys

    Definition Classes
    OpWorkflowCore
  22. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  23. final def getParameters(): OpParams

    Permalink

    Get the parameter settings passed into the workflow

    Get the parameter settings passed into the workflow

    returns

    OpWorkflowParams set for this workflow

    Definition Classes
    OpWorkflowCore
  24. final def getRawFeatureDistributions(): Seq[FeatureDistribution]

    Permalink

    Get raw feature distribution information computed on training and scoring data during raw feature filter

    Get raw feature distribution information computed on training and scoring data during raw feature filter

    returns

    sequence of feature distribution information

    Definition Classes
    OpWorkflowCore
  25. final def getRawFeatureFilterResults(): RawFeatureFilterResults

    Permalink

    Get raw feature filter results (filter configuration, feature distributions, and feature exclusion reasons)

    Get raw feature filter results (filter configuration, feature distributions, and feature exclusion reasons)

    returns

    raw feature filter results

    Definition Classes
    OpWorkflowCore
  26. final def getRawFeatures(): Array[OPFeature]

    Permalink

    Get the raw features generated by the workflow

    Get the raw features generated by the workflow

    returns

    raw features for workflow

    Definition Classes
    OpWorkflowCore
  27. final def getRawScoringFeatureDistributions(): Seq[FeatureDistribution]

    Permalink

    Get raw feature distribution information computed on scoring data during raw feature filter

    Get raw feature distribution information computed on scoring data during raw feature filter

    returns

    sequence of feature distribution information

    Definition Classes
    OpWorkflowCore
  28. final def getRawTrainingFeatureDistributions(): Seq[FeatureDistribution]

    Permalink

    Get raw feature distribution information computed on training data during raw feature filter

    Get raw feature distribution information computed on training data during raw feature filter

    returns

    sequence of feature distribution information

    Definition Classes
    OpWorkflowCore
  29. final def getReader(): Reader[_]

    Permalink

    Get data reader that will be used to generate data frame for stages

    Get data reader that will be used to generate data frame for stages

    returns

    reader for workflow

    Definition Classes
    OpWorkflowCore
  30. final def getResultFeatures(): Array[OPFeature]

    Permalink

    Get the final features generated by the workflow

    Get the final features generated by the workflow

    returns

    result features for workflow

    Definition Classes
    OpWorkflowCore
  31. final def getStages(): Array[OPStage]

    Permalink

    Get the stages used in this workflow

    Get the stages used in this workflow

    returns

    stages in the workflow

    Definition Classes
    OpWorkflowCore
  32. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  33. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  34. final def isWorkflowCV: Boolean

    Permalink

    Whether the cross-validation/train-validation-split will be done at workflow level

    Whether the cross-validation/train-validation-split will be done at workflow level

    returns

    true if the cross-validation will be done at workflow level, false otherwise

    Definition Classes
    OpWorkflowCore
  35. var isWorkflowCVEnabled: Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  36. def loadModel(path: String, asSpark: Boolean = true, modelStagingDir: String = WorkflowFileReader.modelStagingDir): OpWorkflowModel

    Permalink

    Load a previously trained workflow model from path

    Load a previously trained workflow model from path

    path

    to the trained workflow model

    asSpark

    load the transformers as spark native or mleap transformers and tmog transformers

    modelStagingDir

    local folder to copy and unpack stored model to for loading

    returns

    workflow model

  37. lazy val log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  38. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  39. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  40. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  41. var parameters: OpParams

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  42. var rawFeatureFilterResults: RawFeatureFilterResults

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  43. var rawFeatures: Array[OPFeature]

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  44. var reader: Option[Reader[_]]

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  45. var resultFeatures: Array[OPFeature]

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  46. def setBlocklistMapKeys(mapKeys: Map[String, Set[String]]): Unit

    Permalink
    Attributes
    protected[com.salesforce.op]
  47. final def setInputDataset[T](ds: Dataset[T], key: (T) ⇒ String = ReaderKey.randomKey)(implicit arg0: scala.reflect.api.JavaUniverse.WeakTypeTag[T]): OpWorkflow.this.type

    Permalink

    Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    ds

    input dataset for workflow

    key

    key extract function

    returns

    this workflow

    Definition Classes
    OpWorkflowCore
  48. final def setInputRDD[T](rdd: RDD[T], key: (T) ⇒ String = ReaderKey.randomKey)(implicit arg0: scala.reflect.api.JavaUniverse.WeakTypeTag[T]): OpWorkflow.this.type

    Permalink

    Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    rdd

    input rdd for workflow

    key

    key extract function

    returns

    this workflow

    Definition Classes
    OpWorkflowCore
  49. final def setParameters(newParams: OpParams): OpWorkflow.this.type

    Permalink

    Set stage and reader parameters from OpWorkflowParams object for run

    Set stage and reader parameters from OpWorkflowParams object for run

    newParams

    new parameter values

    returns

    this workflow

  50. final def setReader(r: Reader[_]): OpWorkflow.this.type

    Permalink

    Set data reader that will be used to generate data frame for stages

    Set data reader that will be used to generate data frame for stages

    r

    reader for workflow

    returns

    this workflow

    Definition Classes
    OpWorkflowCore
  51. def setResultFeatures(features: OPFeature*): OpWorkflow.this.type

    Permalink

    This is used to set the stages of the workflow.

    This is used to set the stages of the workflow.

    By setting the final features the stages used to generate them can be traced back through the parent features and origin stages. The input is a tuple of features to support leaf feature generation (multiple endpoints in feature generation).

    features

    Final features generated by the workflow

  52. var stages: Array[OPStage]

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  53. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  54. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  55. def train(persistEveryKStages: Int = OpWorkflowModel.PersistEveryKStages)(implicit spark: SparkSession): OpWorkflowModel

    Permalink

    Fit all of the estimators in the pipeline and return a pipeline model of only transformers.

    Fit all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set.

    persistEveryKStages

    persist data in transforms every k stages for performance improvement

    returns

    a fitted pipeline model

  56. val uid: String

    Permalink

    unique id for the workflow

    unique id for the workflow

    Definition Classes
    OpWorkflow → OpWorkflowCore
  57. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  58. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  59. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  60. def withModelStages(model: OpWorkflowModel): OpWorkflow.this.type

    Permalink

    Replaces any estimators in this workflow with their corresponding fit models from the OpWorkflowModel passed in.

    Replaces any estimators in this workflow with their corresponding fit models from the OpWorkflowModel passed in. Note that the Stages UIDs must EXACTLY correspond in order to be replaced so the same features and stages must be used in both the fitted OpWorkflowModel and this OpWorkflow. Any estimators that are not part of the OpWorkflowModel passed in will be trained when .train() is called on this OpWorkflow.

    model

    model containing fitted stages to be used in this workflow

    returns

    an OpWorkflow containing all of the stages from this model plus any new stages needed to generate the features not included in the fitted model

  61. def withRawFeatureFilter[T](trainingReader: Option[Reader[T]], scoringReader: Option[Reader[T]], bins: Int = 100, minFillRate: Double = 0.001, maxFillDifference: Double = 0.90, maxFillRatioDiff: Double = 20.0, maxJSDivergence: Double = 0.90, maxCorrelation: Double = 0.95, correlationType: CorrelationType = CorrelationType.Pearson, protectedFeatures: Array[OPFeature] = Array.empty, protectedJSFeatures: Array[OPFeature] = Array.empty, textBinsFormula: (Summary, Int) ⇒ Int = RawFeatureFilter.textBinsFormula, timePeriod: Option[TimePeriod] = None, minScoringRows: Int = ..., resultFeatureRetentionPolicy: ResultFeatureRetention = ResultFeatureRetention.Strict): OpWorkflow.this.type

    Permalink

    Add a raw features filter to the workflow to look at fill rates and distributions of raw features and exclude features that do not meet specifications from modeling DAG

    Add a raw features filter to the workflow to look at fill rates and distributions of raw features and exclude features that do not meet specifications from modeling DAG

    T

    Type of the data read in

    trainingReader

    training reader to use in filter if not supplied will fall back to reader specified for workflow (note that this reader will take precedence over readers directly input to the workflow if both are supplied)

    scoringReader

    scoring reader to use in filter if not supplied will do the checks possible with only training data available

    bins

    number of bins to use in estimating feature distributions

    minFillRate

    minimum non-null fraction of instances that a feature should contain

    maxFillDifference

    maximum absolute difference in fill rate between scoring and training data for a feature

    maxFillRatioDiff

    maximum difference in fill ratio (symmetric) between scoring and training data for a feature

    maxJSDivergence

    maximum Jensen-Shannon divergence between the training and scoring distributions for a feature

    protectedFeatures

    list of features that should never be removed (features that are used to create them will also be protected)

    protectedJSFeatures

    features that are protected from removal by JS divergence check

    textBinsFormula

    formula to compute the text features bin size. Input arguments are Summary and number of bins to use in computing feature distributions (histograms for numerics, hashes for strings). Output is the bins for the text features.

    timePeriod

    Time period used to apply circulate date transformation for date features, if not specified will use numeric feature transformation

    minScoringRows

    Minimum row threshold for scoring set comparisons to be used in checks. If the scoring set size is below this threshold, then only training data checks will be used

    Annotations
    @Experimental()
  62. final def withWorkflowCV: OpWorkflow.this.type

    Permalink

    :: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime

    :: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime

    returns

    this workflow that will train part of the DAG in the cross-validation/train validation split

    Definition Classes
    OpWorkflowCore
    Annotations
    @Experimental()

Inherited from OpWorkflowCore

Inherited from AnyRef

Inherited from Any

Ungrouped