Case class for holding results of the Chi-squared statistical test we use for calculating Cramer's V
Case class for holding results of the Chi-squared statistical test we use for calculating Cramer's V
Cramer's V value
Actual Chi-squared statistic
P-value
Container for association rule confidence and supports
Container for association rule confidence and supports
Array of maximum confidence values, one per contingency matrix row
Array of support values for each categorical value, one per contingency matrix row
Container class for statistics calculated from contingency matrices constructed from categorical variables
Container class for statistics calculated from contingency matrices constructed from categorical variables
Chi-squared test results for the given contingency matrix
Map between feature name in feature vector and map of pointwise mutual information values between that feature and all values the label can take
Actual (unfiltered) contingency matrix that the rest of the results are calculated from
Map between feature name in feature vector and the mutual information with the label
Association rule details (confidences + supports)
Two-element result tuple containing a map of labels to values which is used for eg.
Two-element result tuple containing a map of labels to values which is used for eg. pointwise mutual information or the contingency matrix itself.
Assumes that we have already computed a MultivariateStatisticsSummary on the RDD, so we can use that info here.
Assumes that we have already computed a MultivariateStatisticsSummary on the RDD, so we can use that info here. This defines an RDD aggregation that calculates all the correlations with the label. Data is assumed to be laid out in an RDD[org.apache.spark.mllib.linalg.Vector] where the label is the last element.
Input RDD consisting of a single array containing the feature vector with the label as the last element
Array of correlations of each feature vector element with the label
Calculates all of the statistics we use that come from contingency matrices between categorical features and categorical labels and stores them in a ContingencyStats case class.
Calculates all of the statistics we use that come from contingency matrices between categorical features and categorical labels and stores them in a ContingencyStats case class.
Matrix of co-occurrences of feature values with label values. Each row represents a different feature choice, while each column represents a different label value.
ContingencyStats object containing all the statistics we calculate from contingency matrices
Same as contingencyStats method, but specialized to MultiPickLists.
Same as contingencyStats method, but specialized to MultiPickLists. The standard contingency table stats are not technically valid for MultiPickLists because the choices are not independent from each other (multipicklists are multi-hot encoded instead of one-hot encoded).
There are several strategies to deal with this to calculate statistics similar to Cramer's V. We follow https://cran.r-project.org/web/packages/MRCV/vignettes/MRCV-vignette.pdf for inspiration, but use a slightly different scheme where we compute stats from a 2 x numLabels contingency matrix for each choice separately, and take the max of these Cramer's V values (one per choice) as the Cramer's V value for the entire MultiPickList. See BadFeatureZooTest for testing how this performs on different types of relations between MultiPickLists and the label.
Matrix of co-occurrences of feature values with label values. Each row represents a different feature choice, while each column represents a different label value.
Array of counts of each label, used to construct the 2 x numLabels contingency matrices for each choice
ContingencyStats object containing all the statistics we calculate from contingency matrices