com.salesforce.op.stages.impl.feature
uid for instance
Input Features type
Input Features type
Indicates whether to attempt language detection.
Indicates whether to attempt language detection.
Language detection threshold.
Language detection threshold. If none of the detected languages have confidence greater than the threshold then defaultLanguage is used.
Checks the input length
Checks the input length
input features
true is input size as expected, false otherwise
Check if the stage is serializable
Check if the stage is serializable
Failure if not serializable
This method is used to make a copy of the instance with new parameters in several methods in spark internals Default will find the constructor and make a copy for any class (AS LONG AS ALL CONSTRUCTOR PARAMS ARE VALS, this is why type tags are written as implicit vals in base classes).
This method is used to make a copy of the instance with new parameters in several methods in spark internals Default will find the constructor and make a copy for any class (AS LONG AS ALL CONSTRUCTOR PARAMS ARE VALS, this is why type tags are written as implicit vals in base classes).
Note: that the convention in spark is to have the uid be a constructor argument, so that copies will share a uid with the original (developers should follow this convention).
new parameters want to add to instance
a new instance with the same uid
Count unique values of each of the sequence & map key components in the dataset using HyperLogLog HLL
Count unique values of each of the sequence & map key components in the dataset using HyperLogLog HLL
value type
dataset to count unique values
size of each sequence component
number of bits for HyperLogLog HLL
kryo serializer to serialize V value into array of bytes
class tag of V - needed by kryo
HyperLogLog HLL of unique values count for each of the sequence components and total rows count
Count unique values of each of the sequence components in the dataset using HyperLogLog HLL
Count unique values of each of the sequence components in the dataset using HyperLogLog HLL
value type
dataset to count unique values
size of each sequence component
number of bits for HyperLogLog HLL
kryo serializer to serialize V value into array of bytes
class tag of V - needed by kryo
HyperLogLog HLL of unique values count for each of the sequence components and total rows count
Default language to assume in case autoDetectLanguage is disabled or failed to make a good enough prediction.
Default language to assume in case autoDetectLanguage is disabled or failed to make a good enough prediction.
Spark operation on dataset to produce Dataset for constructor fit function and then turn output function into a Model
Spark operation on dataset to produce Dataset for constructor fit function and then turn output function into a Model
input data for this stage
a fitted model that will perform the transformation specified by the function defined in constructor fit
Function that fits the sequence model
Function that fits the sequence model
Gets an input feature Note: this method IS NOT safe to use outside the driver, please use getTransientFeature method instead
Gets an input feature Note: this method IS NOT safe to use outside the driver, please use getTransientFeature method instead
array of features
NoSuchElementException
if the features are not set
RuntimeException
in case one of the features is null
Gets the input features Note: this method IS NOT safe to use outside the driver, please use getTransientFeatures method instead
Gets the input features Note: this method IS NOT safe to use outside the driver, please use getTransientFeatures method instead
array of features
NoSuchElementException
if the features are not set
RuntimeException
in case one of the features is null
Output features that will be created by this stage
Output features that will be created by this stage
feature of type OutputFeatures
Name of output feature (i.e.
Name of output feature (i.e. column created by this stage)
Gets an input feature at index i
Gets an input feature at index i
input index
maybe an input feature
Gets the input Features
Hashes input sequence of values into OPVector using the supplied hashing params
Hashes input sequence of values into OPVector using the supplied hashing params
HashingTF instance
HashingTF instance
Function to convert InputFeatures to an Array of FeatureLike
Function to convert InputFeatures to an Array of FeatureLike
an Array of FeatureLike
Determine if the transformer should use a shared hash space for all features or not
Determine if the transformer should use a shared hash space for all features or not
true if the shared hashing space to be used, false otherwise
Minimum token length, >= 1.
Minimum token length, >= 1.
Function to be called on getMetadata
Function to be called on getMetadata
Function to be called on setInput
Function to be called on setInput
unique name of the operation this stage performs
unique name of the operation this stage performs
Function to convert OutputFeatures to an Array of FeatureLike
Function to convert OutputFeatures to an Array of FeatureLike
an Array of FeatureLike
Should output feature be a response? Yes, if any of the input features are.
Should output feature be a response? Yes, if any of the input features are.
true if the the output feature should be a response
Function that prepares the input columns to be hashed Note that MurMur3 hashing algorithm only defined for primitive types so need to convert tuples to strings.
Function that prepares the input columns to be hashed Note that MurMur3 hashing algorithm only defined for primitive types so need to convert tuples to strings. MultiPickList sets are hashed as is since there is no meaningful order in the selected choices. Lists and vectors can be hashed with or without their indices, since order may be important. Maps are hashed as (key,value) strings.
element we are hashing (eg. an OPList, OPMap, etc.)
an Iterable object corresponding to the hashed element
Input features that will be used by the stage
Input features that will be used by the stage
feature of type InputFeatures
Sets input features
Sets input features
feature like type
array of input features
this stage
Option to keep track of values that were missing
Option to keep track of values that were missing
Option to keep track of text lengths
Option to keep track of text lengths
Stage unique name consisting of the stage operation name and uid
Stage unique name consisting of the stage operation name and uid
stage name
Indicates whether to convert all characters to lowercase before string operation.
Indicates whether to convert all characters to lowercase before string operation.
This function translates the input and output features into spark schema checks and changes that will occur on the underlying data frame
This function translates the input and output features into spark schema checks and changes that will occur on the underlying data frame
schema of the input data frame
a new schema with the output features added
type tag for input
type tag for input
type tag for input value
type tag for input value
type tag for input
type tag for input
type tag for output value
type tag for output value
uid for instance
uid for instance
Convert a sequence of text features into a vector by detecting categoricals that are disguised as text. A categorical will be represented as a vector consisting of occurrences of top K most common values of that feature plus occurrences of non top k values and a null indicator (if enabled). Non-categoricals will be converted into a vector using the hashing trick. In addition, a null indicator is created for each non-categorical (if enabled).
Detection and removal of names in the input columns can be enabled with the
sensitiveFeatureMode
param.