Input Features type
Input Features type
Checks the input length
Checks the input length
input features
true is input size as expected, false otherwise
Check if the stage is serializable
Check if the stage is serializable
Failure if not serializable
This method is used to make a copy of the instance with new parameters in several methods in spark internals Default will find the constructor and make a copy for any class (AS LONG AS ALL CONSTRUCTOR PARAMS ARE VALS, this is why type tags are written as implicit vals in base classes).
This method is used to make a copy of the instance with new parameters in several methods in spark internals Default will find the constructor and make a copy for any class (AS LONG AS ALL CONSTRUCTOR PARAMS ARE VALS, this is why type tags are written as implicit vals in base classes).
Note: that the convention in spark is to have the uid be a constructor argument, so that copies will share a uid with the original (developers should follow this convention).
new parameters want to add to instance
a new instance with the same uid
the estimator to wrap
the estimator to wrap
Gets names of parameters that control input columns for Spark stage
Gets names of parameters that control input columns for Spark stage
Gets an input feature Note: this method IS NOT safe to use outside the driver, please use getTransientFeature method instead
Gets an input feature Note: this method IS NOT safe to use outside the driver, please use getTransientFeature method instead
array of features
NoSuchElementException
if the features are not set
RuntimeException
in case one of the features is null
Gets the input features Note: this method IS NOT safe to use outside the driver, please use getTransientFeatures method instead
Gets the input features Note: this method IS NOT safe to use outside the driver, please use getTransientFeatures method instead
array of features
NoSuchElementException
if the features are not set
RuntimeException
in case one of the features is null
Method to access the local version of stage being wrapped
Method to access the local version of stage being wrapped
Option of ml leap runtime version of the spark stage after reloading as local
Output features that will be created by this stage
Output features that will be created by this stage
feature of type OutputFeatures
Gets names of parameters that control output columns for Spark stage
Gets names of parameters that control output columns for Spark stage
Name of output feature (i.e.
Name of output feature (i.e. column created by this stage)
Method to access the spark stage being wrapped
Method to access the spark stage being wrapped
Option of spark ml stage
Gets a save path for wrapped spark stage
Gets a save path for wrapped spark stage
Gets an input feature at index i
Gets an input feature at index i
input index
maybe an input feature
Gets the input Features
Function to convert InputFeatures to an Array of FeatureLike
Function to convert InputFeatures to an Array of FeatureLike
an Array of FeatureLike
Function to be called on getMetadata
Function to be called on getMetadata
Function to be called on setInput
Function to be called on setInput
unique name of the operation this stage performs
unique name of the operation this stage performs
Function to convert OutputFeatures to an Array of FeatureLike
Function to convert OutputFeatures to an Array of FeatureLike
an Array of FeatureLike
Should output feature be a response? Yes, if any of the input features are.
Should output feature be a response? Yes, if any of the input features are.
true if the the output feature should be a response
Set param for checkpoint interval (>= 1) or disable checkpoint (-1).
Set param for checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations.
Set param for concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
Set param for concentration parameter (commonly named "alpha") for the prior placed on documents' distributions over topics ("theta").
This is the parameter to a Dirichlet distribution, where larger values mean more smoothing (more regularization).
If not set by the user, then docConcentration is set automatically. If set to singleton vector [alpha], then alpha is replicated to a vector of length k in fitting. Otherwise, the docConcentration vector must be length k. (default = automatic)
Optimizer-specific parameter settings:
Input features that will be used by the stage
Input features that will be used by the stage
feature of type InputFeatures
Sets input features
Sets input features
feature like type
array of input features
this stage
Set param for number of topics (clusters) to infer.
Set param for number of topics (clusters) to infer. Must be > 1. Default: 10.
Set param for maximum number of iterations (>= 0).
Set param for maximum number of iterations (>= 0). Default: 20
Set param for optimizer or inference algorithm used to estimate the LDA model.
Set param for optimizer or inference algorithm used to estimate the LDA model.
Currently supported (case-insensitive):
For details, see the following papers:
Set param for random seed.
Sets a save path for wrapped spark stage
Sets a save path for wrapped spark stage
For Online optimizer only: optimizer = "online".
For Online optimizer only: optimizer = "online".
Set param for fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
Note that this should be adjusted in synch with LDA.maxIter so the entire corpus is used. Specifically, set both so that maxIterations * miniBatchFraction >= 1.
Note: This is the same as the miniBatchFraction
parameter in org.apache.spark.mllib.clustering.OnlineLDAOptimizer.
Default: 0.05, i.e., 5% of total documents.
Set param for concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
Set param for concentration parameter (commonly named "beta" or "eta") for the prior placed on topics' distributions over terms.
This is the parameter to a symmetric Dirichlet distribution.
Note: The topics' distributions over terms are called "beta" in the original LDA paper by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
If not set by the user, then topicConcentration is set automatically. (default = automatic)
Optimizer-specific parameter settings:
Stage unique name consisting of the stage operation name and uid
Stage unique name consisting of the stage operation name and uid
stage name
This function translates the input and output features into spark schema checks and changes that will occur on the underlying data frame
This function translates the input and output features into spark schema checks and changes that will occur on the underlying data frame
schema of the input data frame
a new schema with the output features added
type tag for input
type tag for input
type tag for output
type tag for output
type tag for output value
type tag for output value
stage uid
stage uid
Wrapper around spark ml LDA (Latent Dirichlet Allocation) for use with OP pipelines