filters

Type Members

case class ExclusionReasons(name: String, key: Option[String], trainingUnfilledState: Boolean, trainingNullLabelLeaker: Boolean, scoringUnfilledState: Boolean, jsDivergenceMismatch: Boolean, fillRateDiffMismatch: Boolean, fillRatioDiffMismatch: Boolean, excluded: Boolean) extends Product with Serializable

Contains results of Raw Feature Filter tests for a given feature

Contains results of Raw Feature Filter tests for a given feature

name

feature name

key

map key associated with distribution (when the feature is a map)

trainingUnfilledState

training fill rate did not meet min required

trainingNullLabelLeaker

null indicator correlation (absolute) exceeded max allowed

scoringUnfilledState

scoring fill rate did not meet min required

jsDivergenceMismatch

distribution mismatch: JS Divergence exceeded max allowed

fillRateDiffMismatch

distribution mismatch: fill rate difference exceeded max allowed

fillRatioDiffMismatch

distribution mismatch: fill ratio difference exceeded max allowed

excluded

feature excluded after failing one or more tests
case class FeatureDistribution(name: String, key: Option[String], count: Long, nulls: Long, distribution: Array[Double], summaryInfo: Array[Double], moments: Option[Moments] = None, cardEstimate: Option[TextStats] = None, type: FeatureDistributionType = FeatureDistributionType.Training) extends FeatureDistributionLike with Product with Serializable

Class containing summary information for a feature

Class containing summary information for a feature

name

name of the feature

key

map key associated with distribution (when the feature is a map)

count

total count of feature seen

nulls

number of empties seen in feature

distribution

binned counts of feature values (hashed for strings, evenly spaced bins for numerics)

summaryInfo

either min and max number of tokens for text data, or splits used for bins for numeric data
case class FilteredRawData(cleanedData: DataFrame, featuresToDrop: Array[OPFeature], mapKeysToDrop: Map[String, Set[String]], rawFeatureFilterResults: RawFeatureFilterResults) extends Product with Serializable

Contains RFF filtered data, features to drop, and results from RFF

Contains RFF filtered data, features to drop, and results from RFF

cleanedData

RFF cleaned data

featuresToDrop

raw features dropped by RFF

mapKeysToDrop

keys in map features dropped by RFF

rawFeatureFilterResults

feature information calculated from the training data
class RawFeatureFilter[T] extends Serializable

Specialized stage that will load up data and compute distributions and empty counts on raw features.

Specialized stage that will load up data and compute distributions and empty counts on raw features. This information is then used to compute which raw features should be excluded from the workflow DAG Note: Currently, raw features that aren't explicitly blocklisted, but are not used because they are inputs to explicitly blocklisted features are not present as raw features in the model, nor in ModelInsights. However, they are accessible from an OpWorkflowModel via getRawFeatureFilterResults().

T

datatype of the reader
case class RawFeatureFilterConfig(minFill: Double = 0.0, maxFillDifference: Double = Double.PositiveInfinity, maxFillRatioDiff: Double = Double.PositiveInfinity, maxJSDivergence: Double = 1.0, maxCorrelation: Double = 1.0, correlationType: CorrelationType = CorrelationType.Pearson, jsDivergenceProtectedFeatures: Seq[String] = Seq.empty, protectedFeatures: Seq[String] = Seq.empty) extends Product with Serializable

Contains configuration settings for Raw Feature Filter
trait RawFeatureFilterFormats extends AnyRef
case class RawFeatureFilterMetrics(name: String, key: Option[String], trainingFillRate: Double, trainingNullLabelAbsoluteCorr: Option[Double], scoringFillRate: Option[Double], jsDivergence: Option[Double], fillRateDiff: Option[Double], fillRatioDiff: Option[Double]) extends Product with Serializable

Contains raw feature metrics computing in Raw Feature Filter

Contains raw feature metrics computing in Raw Feature Filter

name

feature name

key

map key associated with distribution (when the feature is a map)

trainingFillRate

proportion of values that are null in the training distribution

trainingNullLabelAbsoluteCorr

correlation between null indicator and the label in the training distribution

scoringFillRate

proportion of values that are null in the scoring distribution

jsDivergence

Jensen-Shannon (JS) divergence between the training and scoring distributions

fillRateDiff

absolute difference in fill rates between the training and scoring distributions

fillRatioDiff

ratio of difference in fill rates between the training and scoring distributions
case class RawFeatureFilterResults(rawFeatureFilterConfig: RawFeatureFilterConfig = RawFeatureFilterConfig(), rawFeatureDistributions: Seq[FeatureDistribution] = Seq.empty, rawFeatureFilterMetrics: Seq[RawFeatureFilterMetrics] = Seq.empty, exclusionReasons: Seq[ExclusionReasons] = Seq.empty) extends Product with Serializable

Contains configuration and results from RawFeatureFilter

Contains configuration and results from RawFeatureFilter

rawFeatureFilterConfig

configuration settings for RawFeatureFilter

rawFeatureDistributions

feature distributions calculated from training data

rawFeatureFilterMetrics

feature metrics calculated by RawFeatureFilter

exclusionReasons

results of RawFeatureFilter tests (reasons why feature is dropped or not)
case class Summary(min: Double, max: Double, sum: Double, count: Double) extends Product with Serializable

Class used to get summaries of prepared features to determine distribution binning strategy

Class used to get summaries of prepared features to determine distribution binning strategy

min

minimum value seen for double, minimum number of tokens in one text for text

max

maximum value seen for double, maximum number of tokens in one text for text

sum

sum of values for double, total number of tokens for text

count

number of doubles for double, number of texts for text

Value Members

object FeatureDistribution extends Serializable
object RawFeatureFilter extends Serializable
object RawFeatureFilterConfig extends RawFeatureFilterFormats with Serializable
object RawFeatureFilterResults extends RawFeatureFilterFormats with Serializable
object Summary extends Product with Serializable

package filters

Type Members

case class FilteredRawData(cleanedData: DataFrame, featuresToDrop: Array[OPFeature], mapKeysToDrop: Map[String, Set[String]], rawFeatureFilterResults: RawFeatureFilterResults) extends Product with Serializable

class RawFeatureFilter[T] extends Serializable

trait RawFeatureFilterFormats extends AnyRef

case class RawFeatureFilterMetrics(name: String, key: Option[String], trainingFillRate: Double, trainingNullLabelAbsoluteCorr: Option[Double], scoringFillRate: Option[Double], jsDivergence: Option[Double], fillRateDiff: Option[Double], fillRatioDiff: Option[Double]) extends Product with Serializable

case class Summary(min: Double, max: Double, sum: Double, count: Double) extends Product with Serializable

Value Members

object FeatureDistribution extends Serializable

object RawFeatureFilter extends Serializable

object RawFeatureFilterConfig extends RawFeatureFilterFormats with Serializable

object RawFeatureFilterResults extends RawFeatureFilterFormats with Serializable

object Summary extends Product with Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped