com.salesforce.op.stages.impl.classification
Input Features type
Input Features type
Checks the input length
Checks the input length
input features
true is input size as expected, false otherwise
Check if the stage is serializable
Check if the stage is serializable
Failure if not serializable
This method is used to make a copy of the instance with new parameters in several methods in spark internals Default will find the constructor and make a copy for any class (AS LONG AS ALL CONSTRUCTOR PARAMS ARE VALS, this is why type tags are written as implicit vals in base classes).
This method is used to make a copy of the instance with new parameters in several methods in spark internals Default will find the constructor and make a copy for any class (AS LONG AS ALL CONSTRUCTOR PARAMS ARE VALS, this is why type tags are written as implicit vals in base classes).
Note: that the convention in spark is to have the uid be a constructor argument, so that copies will share a uid with the original (developers should follow this convention).
new parameters want to add to instance
a new instance with the same uid
Function that fits the binary model
Function that fits the binary model
Gets names of parameters that control input columns for Spark stage
Gets names of parameters that control input columns for Spark stage
Gets an input feature Note: this method IS NOT safe to use outside the driver, please use getTransientFeature method instead
Gets an input feature Note: this method IS NOT safe to use outside the driver, please use getTransientFeature method instead
array of features
NoSuchElementException
if the features are not set
RuntimeException
in case one of the features is null
Gets the input features Note: this method IS NOT safe to use outside the driver, please use getTransientFeatures method instead
Gets the input features Note: this method IS NOT safe to use outside the driver, please use getTransientFeatures method instead
array of features
NoSuchElementException
if the features are not set
RuntimeException
in case one of the features is null
Method to access the local version of stage being wrapped
Method to access the local version of stage being wrapped
Option of ml leap runtime version of the spark stage after reloading as local
Output features that will be created by this stage
Output features that will be created by this stage
feature of type OutputFeatures
Gets names of parameters that control output columns for Spark stage
Gets names of parameters that control output columns for Spark stage
Name of output feature (i.e.
Name of output feature (i.e. column created by this stage)
Method to access the spark stage being wrapped
Method to access the spark stage being wrapped
Option of spark ml stage
Gets a save path for wrapped spark stage
Gets a save path for wrapped spark stage
Gets an input feature at index i
Gets an input feature at index i
input index
maybe an input feature
Gets the input Features
Function to convert InputFeatures to an Array of FeatureLike
Function to convert InputFeatures to an Array of FeatureLike
an Array of FeatureLike
Function to be called on getMetadata
Function to be called on getMetadata
Function to be called on setInput
Function to be called on setInput
Short unique name of the operation this stage performs
Short unique name of the operation this stage performs
operation name
Function to convert OutputFeatures to an Array of FeatureLike
Function to convert OutputFeatures to an Array of FeatureLike
an Array of FeatureLike
Should output feature be a response? Yes, if any of the input features are.
Should output feature be a response? Yes, if any of the input features are.
true if the the output feature should be a response
the predictor to wrap
the predictor to wrap
L1 regularization term on weights, increase this value will make model more conservative.
L1 regularization term on weights, increase this value will make model more conservative. [default=0]
Initial prediction (aka base margin) column name.
Specify the learning task and the corresponding learning objective.
Specify the learning task and the corresponding learning objective. options: reg:linear, reg:logistic, binary:logistic, binary:logitraw, count:poisson, multi:softmax, multi:softprob, rank:pairwise, reg:gamma. default: reg:linear
Checkpoint interval (>= 1) or disable checkpoint (-1).
Checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the trained model will get checkpointed every 10 iterations. Note: checkpoint_path
must also be set if the checkpoint interval is greater than 0.
The hdfs folder to load and save checkpoint boosters.
The hdfs folder to load and save checkpoint boosters. default: empty_string
Subsample ratio of columns for each split, in each level.
Subsample ratio of columns for each split, in each level. [default=1] range: (0,1]
Subsample ratio of columns when constructing each tree.
Subsample ratio of columns when constructing each tree. [default=1] range: (0,1]
Customized evaluation function provided by user.
Customized evaluation function provided by user. default: null
Customized objective function provided by user.
Customized objective function provided by user. default: null
Step size shrinkage used in update to prevents overfitting.
Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinks the feature weights to make the boosting process more conservative. [default=0.3] range: [0,1]
Evaluation metrics for validation data, a default metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking).
Evaluation metrics for validation data, a default metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). options: rmse, mae, logloss, error, merror, mlogloss, auc, aucpr, ndcg, map, gamma-deviance
Minimum loss reduction required to make a further partition on a leaf node of the tree.
Minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. [default=0] range: [0, Double.MaxValue]
Growth policy for fast histogram algorithm
Input features that will be used by the stage
Input features that will be used by the stage
feature of type InputFeatures
Sets input features
Sets input features
feature like type
array of input features
this stage
L2 regularization term on weights, increase this value will make model more conservative.
L2 regularization term on weights, increase this value will make model more conservative. [default=1]
Parameter of linear booster L2 regularization term on bias, default 0(no L1 reg on bias because it is not important)
Maximum number of bins in histogram
Maximum delta step we allow each tree's weight estimation to be.
Maximum delta step we allow each tree's weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update. [default=0] range: [0, Double.MaxValue]
Maximum depth of a tree, increase this value will make model more complex / likely to be overfitting.
Maximum depth of a tree, increase this value will make model more complex / likely to be overfitting. [default=6] range: [1, Int.MaxValue]
Maximum number of nodes to be added.
Maximum number of nodes to be added. Only relevant when grow_policy=lossguide is set.
Define the expected optimization to the evaluation metrics, true to maximize otherwise minimize it
Minimum sum of instance weight(hessian) needed in a child.
Minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. [default=1] range: [0, Double.MaxValue]
The value treated as missing
Parameter of Dart booster.
Parameter of Dart booster. type of normalization algorithm, options: {'tree', 'forest'}. [default="tree"]
Number of threads used by per worker.
Number of threads used by per worker. default 1
Number of classes
If non-zero, the training will be stopped after a specified number of consecutive increases in any evaluation metric.
The number of rounds for boosting
Number of workers used to train xgboost model.
Number of workers used to train xgboost model. default: 1
Specify the learning task and the corresponding learning objective.
Specify the learning task and the corresponding learning objective. options: reg:squarederror, reg:logistic, binary:logistic, binary:logitraw, count:poisson, multi:softmax, multi:softprob, rank:pairwise, reg:gamma. default: reg:squarederror
Objective type used for training.
Objective type used for training. For options see ml.dmlc.xgboost4j.scala.spark.params.LearningTaskParams
Parameter of Dart booster.
Parameter of Dart booster. dropout rate. [default=0.0] range: [0.0, 1.0]
Parameter for Dart booster.
Parameter for Dart booster. Type of sampling algorithm. "uniform": dropped trees are selected uniformly. "weighted": dropped trees are selected in proportion to weight. [default="uniform"]
Control the balance of positive and negative weights, useful for unbalanced classes.
Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases). [default=1]
Random seed for the C++ part of XGBoost and train/test splitting.
0 means printing running messages, 1 means silent mode.
0 means printing running messages, 1 means silent mode. default: 0
This is only used for approximate greedy algorithm.
This is only used for approximate greedy algorithm. This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. [default=0.03] range: (0, 1)
Parameter of Dart booster.
Parameter of Dart booster. probability of skip dropout. If a dropout is skipped, new trees are added in the same manner as gbtree. [default=0.0] range: [0.0, 1.0]
Sets a save path for wrapped spark stage
Sets a save path for wrapped spark stage
Subsample ratio of the training instance.
Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. [default=1] range:(0,1]
The maximum time to wait for the job requesting new workers.
The maximum time to wait for the job requesting new workers. default: 30 minutes
Rabit tracker configurations.
Rabit tracker configurations. The parameter must be provided as an instance of the TrackerConf class, which has the following definition:
case class TrackerConf(workerConnectionTimeout: Duration, trainingTimeout: Duration, trackerImpl: String)
See below for detailed explanations.
Choice between "python" or "scala". The former utilizes the Java wrapper of the Python Rabit tracker (in dmlc_core), and does not support timeout settings. The "scala" version removes Python components, and fully supports timeout settings.
The timeout value should take the time of data loading and pre-processing into account, due to the lazy execution of Spark's operations. Alternatively, you may force Spark to perform data transformation before calling XGBoost.train(), so that this timeout truly reflects the connection delay. Set a reasonable timeout value to prevent model training/testing from hanging indefinitely, possible due to network issues. Note that zero timeout value means to wait indefinitely (equivalent to Duration.Inf). Ignored if the tracker implementation is "python".
Fraction of training points to use for testing.
The tree construction algorithm used in XGBoost.
The tree construction algorithm used in XGBoost. options: {'auto', 'exact', 'approx'} [default='auto']
Whether to use external memory as cache.
Whether to use external memory as cache. default: false
Weight column name.
Weight column name. If this is not set or empty, we treat all instance weights as 1.0.
Stage unique name consisting of the stage operation name and uid
Stage unique name consisting of the stage operation name and uid
stage name
This function translates the input and output features into spark schema checks and changes that will occur on the underlying data frame
This function translates the input and output features into spark schema checks and changes that will occur on the underlying data frame
schema of the input data frame
a new schema with the output features added
Type tag of the output
Type tag of the output
Type tag of the output value
Type tag of the output value
stage uid
stage uid
Wrapper around XGBoost classifier XGBoostClassifier