Class

com.salesforce.op

OpWorkflow

Related Doc: package op

Permalink

class OpWorkflow extends OpWorkflowCore

Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model.

Linear Supertypes
OpWorkflowCore, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. OpWorkflow
  2. OpWorkflowCore
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new OpWorkflow(uid: String = UID[OpWorkflow])

    Permalink

    uid

    unique id for the workflow

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def applyTransformationsDAG(rawData: DataFrame, dag: StagesDAG, persistEveryKStages: Int)(implicit spark: SparkSession): DataFrame

    Permalink

    Efficiently applies all fitted stages grouping by level in the DAG where possible

    Efficiently applies all fitted stages grouping by level in the DAG where possible

    rawData

    data to transform

    dag

    computation graph

    persistEveryKStages

    breaks in computation to persist

    spark

    spark session

    returns

    transformed dataframe

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. def checkReadersAndFeatures(): Unit

    Permalink

    Check that readers and features are set and that params match them

    Check that readers and features are set and that params match them

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  7. def checkUnmatchedFeatures(): Unit

    Permalink

    Determine if any of the raw features do not have a matching reader

    Determine if any of the raw features do not have a matching reader

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  8. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. def computeDataUpTo(feature: OPFeature, persistEveryKStages: Int = OpWorkflowModel.PersistEveryKStages)(implicit spark: SparkSession): DataFrame

    Permalink

    Returns a dataframe containing all the columns generated up to and including the feature input

    Returns a dataframe containing all the columns generated up to and including the feature input

    feature

    input feature to compute up to

    persistEveryKStages

    persist data in transforms every k stages for performance improvement

    returns

    Dataframe containing columns corresponding to all of the features generated up to the feature given

    Definition Classes
    OpWorkflow → OpWorkflowCore
  10. def computeDataUpTo(feature: OPFeature, path: String)(implicit spark: SparkSession): Unit

    Permalink

    Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format

    Computes a dataframe containing all the columns generated up to the feature input and saves it to the specified path in avro format

    Definition Classes
    OpWorkflowCore
  11. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  12. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  13. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  14. def findOriginStageId(feature: OPFeature): Option[Int]

    Permalink

    Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)

    Looks at model parents to match parent stage for features (since features are created from the estimator not the fitted transformer)

    feature

    feature want to find origin stage for

    returns

    index of the parent stage

    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  15. def fitStages(data: DataFrame, stagesToFit: Array[OPStage], persistEveryKStages: Int)(implicit spark: SparkSession): Array[OPStage]

    Permalink

    Fit the estimators to return a sequence of only transformers Modified version of Spark 2.x Pipeline

    Fit the estimators to return a sequence of only transformers Modified version of Spark 2.x Pipeline

    data

    dataframe to fit on

    stagesToFit

    stages that need to be converted to transformers

    persistEveryKStages

    persist data in transforms every k stages for performance improvement

    returns

    fitted transformers

    Attributes
    protected
  16. def generateRawData()(implicit spark: SparkSession): DataFrame

    Permalink

    Used to generate dataframe from reader and raw features list

    Used to generate dataframe from reader and raw features list

    returns

    Dataframe with all the features generated + persisted

    Attributes
    protected
    Definition Classes
    OpWorkflow → OpWorkflowCore
  17. final def getBlacklist(): Array[OPFeature]

    Permalink

    Get the list of raw features which have been blacklisted

    Get the list of raw features which have been blacklisted

    returns

    blacklisted features

    Definition Classes
    OpWorkflowCore
  18. final def getBlacklistMapKeys(): Map[String, Set[String]]

    Permalink

    Get the list of Map Keys which have been blacklisted

    Get the list of Map Keys which have been blacklisted

    returns

    blacklisted map keys

    Definition Classes
    OpWorkflowCore
  19. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  20. final def getParameters(): OpParams

    Permalink

    Get the parameter settings passed into the workflow

    Get the parameter settings passed into the workflow

    returns

    OpWorkflowParams set for this workflow

    Definition Classes
    OpWorkflowCore
  21. final def getRawFeatureDistributions(): Array[FeatureDistribution]

    Permalink

    Get raw feature distribution information computed on training and scoring data during raw feature filter

    Get raw feature distribution information computed on training and scoring data during raw feature filter

    returns

    sequence of feature distribution information

    Definition Classes
    OpWorkflowCore
  22. final def getRawScoringFeatureDistributions(): Array[FeatureDistribution]

    Permalink

    Get raw feature distribution information computed on scoring data during raw feature filter

    Get raw feature distribution information computed on scoring data during raw feature filter

    returns

    sequence of feature distribution information

    Definition Classes
    OpWorkflowCore
  23. final def getRawTrainingFeatureDistributions(): Array[FeatureDistribution]

    Permalink

    Get raw feature distribution information computed on training data during raw feature filter

    Get raw feature distribution information computed on training data during raw feature filter

    returns

    sequence of feature distribution information

    Definition Classes
    OpWorkflowCore
  24. final def getResultFeatures(): Array[OPFeature]

    Permalink

    Get the final features generated by the workflow

    Get the final features generated by the workflow

    returns

    result features for workflow

    Definition Classes
    OpWorkflowCore
  25. final def getStages(): Array[OPStage]

    Permalink

    Get the stages used in this workflow

    Get the stages used in this workflow

    returns

    stages in the workflow

    Definition Classes
    OpWorkflowCore
  26. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  27. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  28. def loadModel(path: String): OpWorkflowModel

    Permalink

    Load a previously trained workflow model from path

    Load a previously trained workflow model from path

    path

    to the trained workflow model

    returns

    workflow model

  29. lazy val log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    OpWorkflowCore
  30. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  31. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  32. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  33. def setBlacklistMapKeys(mapKeys: Map[String, Set[String]]): Unit

    Permalink
    Attributes
    protected[com.salesforce.op]
  34. final def setInputDataset[T](ds: Dataset[T], key: (T) ⇒ String = ReaderKey.randomKey)(implicit arg0: scala.reflect.api.JavaUniverse.WeakTypeTag[T]): OpWorkflow.this.type

    Permalink

    Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    Set input dataset which contains columns corresponding to the raw features used in the workflow The type of the dataset (Dataset[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    ds

    input dataset for workflow

    key

    key extract function

    returns

    this workflow

    Definition Classes
    OpWorkflowCore
  35. final def setInputRDD[T](rdd: RDD[T], key: (T) ⇒ String = ReaderKey.randomKey)(implicit arg0: scala.reflect.api.JavaUniverse.WeakTypeTag[T]): OpWorkflow.this.type

    Permalink

    Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    Set input rdd which contains columns corresponding to the raw features used in the workflow The type of the rdd (RDD[T]) must match the type of the FeatureBuilders[T] used to generate the raw features

    rdd

    input rdd for workflow

    key

    key extract function

    returns

    this workflow

    Definition Classes
    OpWorkflowCore
  36. final def setParameters(newParams: OpParams): OpWorkflow.this.type

    Permalink

    Set stage and reader parameters from OpWorkflowParams object for run

    Set stage and reader parameters from OpWorkflowParams object for run

    newParams

    new parameter values

    returns

    this workflow

  37. final def setReader(r: Reader[_]): OpWorkflow.this.type

    Permalink

    Set data reader that will be used to generate data frame for stages

    Set data reader that will be used to generate data frame for stages

    r

    reader for workflow

    returns

    this workflow

    Definition Classes
    OpWorkflowCore
  38. def setResultFeatures(features: OPFeature*): OpWorkflow.this.type

    Permalink

    This is used to set the stages of the workflow.

    This is used to set the stages of the workflow.

    By setting the final features the stages used to generate them can be traced back through the parent features and origin stages. The input is an tuple of features to support leaf feature generation (multiple endpoints in feature generation).

    features

    Final features generated by the workflow

  39. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  40. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  41. def train(persistEveryKStages: Int = OpWorkflowModel.PersistEveryKStages)(implicit spark: SparkSession): OpWorkflowModel

    Permalink

    Fit all of the estimators in the pipeline and return a pipeline model of only transformers.

    Fit all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set.

    persistEveryKStages

    persist data in transforms every k stages for performance improvement

    returns

    a fitted pipeline model

  42. val uid: String

    Permalink

    unique id for the workflow

    unique id for the workflow

    Definition Classes
    OpWorkflow → OpWorkflowCore
  43. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  44. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  45. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  46. def withModelStages(model: OpWorkflowModel): OpWorkflow.this.type

    Permalink

    Replaces any estimators in this workflow with their corresponding fit models from the OpWorkflowModel passed in.

    Replaces any estimators in this workflow with their corresponding fit models from the OpWorkflowModel passed in. Note that the Stages UIDs must EXACTLY correspond in order to be replaced so the same features and stages must be used in both the fitted OpWorkflowModel and this OpWorkflow. Any estimators that are not part of the OpWorkflowModel passed in will be trained when .train() is called on this OpWorkflow.

    model

    model containing fitted stages to be used in this workflow

    returns

    an OpWorkflow containing all of the stages from this model plus any new stages needed to generate the features not included in the fitted model

  47. def withRawFeatureFilter[T](trainingReader: Option[Reader[T]], scoringReader: Option[Reader[T]], bins: Int = 100, minFillRate: Double = 0.001, maxFillDifference: Double = 0.90, maxFillRatioDiff: Double = 20.0, maxJSDivergence: Double = 0.90, maxCorrelation: Double = 0.95, correlationType: CorrelationType = CorrelationType.Pearson, protectedFeatures: Array[OPFeature] = Array.empty, protectedJSFeatures: Array[OPFeature] = Array.empty, textBinsFormula: (Summary, Int) ⇒ Int = RawFeatureFilter.textBinsFormula, timePeriod: Option[TimePeriod] = None): OpWorkflow.this.type

    Permalink

    Add a raw features filter to the workflow to look at fill rates and distributions of raw features and exclude features that do not meet specifications from modeling DAG

    Add a raw features filter to the workflow to look at fill rates and distributions of raw features and exclude features that do not meet specifications from modeling DAG

    T

    Type of the data read in

    trainingReader

    training reader to use in filter if not supplied will fall back to reader specified for workflow (note that this reader will take precedence over readers directly input to the workflow if both are supplied)

    scoringReader

    scoring reader to use in filter if not supplied will do the checks possible with only training data available

    bins

    number of bins to use in estimating feature distributions

    minFillRate

    minimum non-null fraction of instances that a feature should contain

    maxFillDifference

    maximum absolute difference in fill rate between scoring and training data for a feature

    maxFillRatioDiff

    maximum difference in fill ratio (symmetric) between scoring and training data for a feature

    maxJSDivergence

    maximum Jensen-Shannon divergence between the training and scoring distributions for a feature

    protectedFeatures

    list of features that should never be removed (features that are used to create them will also be protected)

    protectedJSFeatures

    features that are protected from removal by JS divergence check

    textBinsFormula

    formula to compute the text features bin size. Input arguments are Summary and number of bins to use in computing feature distributions (histograms for numerics, hashes for strings). Output is the bins for the text features.

    timePeriod

    Time period used to apply circulate date transformation for date features, if not specified will use numeric feature transformation

    Annotations
    @Experimental()
  48. final def withWorkflowCV: OpWorkflow.this.type

    Permalink

    :: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime

    :: Experimental :: Decides whether the cross-validation/train-validation-split will be done at workflow level This will remove issues with data leakage, however it will impact the runtime

    returns

    this workflow that will train part of the DAG in the cross-validation/train validation split

    Definition Classes
    OpWorkflowCore
    Annotations
    @Experimental()

Inherited from OpWorkflowCore

Inherited from AnyRef

Inherited from Any

Ungrouped