
We would not want to falsely identify data that have low granularity but are evenly distributed, such as data from a discrete uniform distribution. If the frequency ratio is less than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance. To identify these types of predictors, the following two metrics can be calculated: * the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data> * the”percent of unique values’’ is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases> These “near-zero-variance” predictors may need to be identified and eliminated prior to modeling. The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. For these data, the distance measure has less of an impact than the scoring method for determining which compounds are most dissimilar.ĭata.frame( table(mdrrDescr$nR11)) # Var1 Freq The panels in the figure show the results using several combinations of distance metrics and scoring functions. Using an initial random sample of 5 compounds, we can select 20 more compounds from the data so that the new compounds are most dissimilar from the initial 5 that were specified. caret includes two functions, minDiss and sumDiss that can be used to maximize the minimum and total dissimilarities.Īs an example, the figure below shows a scatter plot of two chemical descriptors for the Cox2 data. There are many methods in R to calculate dissimilarity. The most dissimilar point in B is added to A and the process continues. For each sample in B, the function calculates the m dissimilarities between each point in A. We may want to create a sub–sample from B that is diverse when compared to A. Suppose data set A with m samples and a larger data set B with n samples.


This is particularly useful for unsupervised learning where there are no response variables. MaxDissim is used to create sub–samples using a maximum dissimilarity approach.
