The Smith Method — TSM • TSM

This function TSM (aka. The Smith Method) selects a desired number of features (4 by default) by purposefully dropping highly correlated features, i.e, picking up a set of representative features that can best explain the binary outcomes. In a plain language, it works like the follwoing: The first representative feature is the one that shows the highest AUC (Area Under the ROC Curve) out of all the features. The next representative feature is the one that shows the highest AUC out of the remaing features after dropping highly correlated features with the first representative feature. The third, the fourth, and so on, represenative feature will be picked up as the same way the 2nd is picked up.

Usage

TSM(x, method = "spearman", corr = seq(0.1, 0.7, by = 0.1), k = 4, verbose = F)

Arguments

x: Path to the input file
method: A Character either pearson (default) or spearman, which is the same paramter method for cor().
corr: A numeric vector for the thresholds of correlation coefficients.
k: The number of desicred features (default:4)
verbose: Boolean

Value

a data.table (default) or a list of data.table (verbose=T)

Examples

input=read.csv(system.file("extdata","demo_input.csv",package="TSM")) # read the example input 
TSM(x=input) # run TSM with default parameters
#> calculating AUC for each features...
#> cor0.1
#> cor0.2
#> cor0.3
#> cor0.4
#> cor0.5
#> cor0.6
#> cor0.7
#>    Cor Num features                                                 Features
#> 1: 0.4            4                                             F1,F7,F8,F14
#> 2: 0.5            7                                  F1,F7,F13,F8,F9,F14,F15
#> 3: 0.7           16 F1,F4,F6,F5,F7,F13,F8,F9,F12,F11,F14,F15,F21,F17,F18,F20
#> 4: 0.6           11                   F1,F5,F7,F13,F8,F9,F14,F15,F21,F17,F18
#> 5: 0.1            1                                                       F1
#> 6: 0.2            1                                                       F1
#> 7: 0.3            1                                                       F1
#>    Best features      AIC      BIC       AUC AUC(LPOCV)
#> 1:  F1,F7,F8,F14 106.9498 120.8872 0.8815629  0.8659951
#> 2:  F1,F7,F13,F8 108.8605 122.7980 0.8806471  0.8580586
#> 3:   F1,F4,F6,F5 111.8757 125.8131 0.8684371  0.8489011
#> 4:  F1,F5,F7,F13 111.5937 125.5312 0.8666056  0.8476801
#> 5:            F1 113.1852 118.7602 0.8446276  0.8446276
#> 6:            F1 113.1852 118.7602 0.8446276  0.8446276
#> 7:            F1 113.1852 118.7602 0.8446276  0.8446276

TSM(x=input, corr=c(0.4, 0.5)) # two correlation coefficients only 
#> calculating AUC for each features...
#> cor0.4
#> cor0.5
#>    Cor Num features                Features Best features      AIC      BIC
#> 1: 0.4            4            F1,F7,F8,F14  F1,F7,F8,F14 106.9498 120.8872
#> 2: 0.5            7 F1,F7,F13,F8,F9,F14,F15  F1,F7,F13,F8 108.8605 122.7980
#>          AUC AUC(LPOCV)
#> 1: 0.8815629  0.8659951
#> 2: 0.8806471  0.8580586

TSM(x=input, method="pearson") # pearson method for cor()
#> calculating AUC for each features...
#> cor0.1
#> cor0.2
#> cor0.3
#> cor0.4
#> cor0.5
#> cor0.6
#> cor0.7
#>    Cor Num features                                           Features
#> 1: 0.6            9                    F1,F7,F13,F8,F9,F16,F14,F15,F17
#> 2: 0.5            4                                     F1,F13,F15,F18
#> 3: 0.4            3                                         F1,F11,F14
#> 4: 0.7           14 F1,F5,F7,F13,F8,F9,F12,F11,F14,F15,F21,F17,F18,F20
#> 5: 0.1            1                                                 F1
#> 6: 0.2            1                                                 F1
#> 7: 0.3            1                                                 F1
#>     Best features      AIC      BIC       AUC AUC(LPOCV)
#> 1:   F1,F7,F13,F8 108.8605 122.7980 0.8806471  0.8580586
#> 2: F1,F13,F15,F18 108.2224 122.1599 0.8717949  0.8522589
#> 3:     F1,F11,F14 109.9812 121.1312 0.8672161  0.8489011
#> 4:   F1,F5,F7,F13 111.5937 125.5312 0.8666056  0.8476801
#> 5:             F1 113.1852 118.7602 0.8446276  0.8446276
#> 6:             F1 113.1852 118.7602 0.8446276  0.8446276
#> 7:             F1 113.1852 118.7602 0.8446276  0.8446276