Clustering Multivariate Longitudinal Data using Tensor-Product Splines • TPSClus

TPSClus is an R package designed to implement fuzzy clustering of multivariate longitudinal data using tensor product smoothing splines. The TPSfit() function fits the splines to three or more variables that are measured repeatedly over time. The cluster.fitted() or cluster.coefs() functions can then be used to identify fuzzy clusters in the data. Lastly, the FKM.glm() function can be used to generate a generalized linear model for a distal outcome using the fuzzy clusters and other covariates as predictors. A number of plot options are available for visualization of the data and clusters.

Installation

You can install the development version of TPSClus like so (need to update):

devtools::install_github("silivingston/TPSClus", dependencies = TRUE, build_vignettes = TRUE)

Example

Basic example of flow using built-in simulated dataset:

Fit splines to multivariate data:

library(TPSClus)
data(TS.sim)

fitsplines2 <- TPSfit(TS.sim, vars=c("Var1", "Var2", "Var3"),
time="Time", ID="SubjectID", knots_time=c(0, 91, 182, 273, 365),
     fit_times=c(46, 91, 137, 182, 228, 273, 319))

Run fuzzy clustering algorithm on the smoothed data using 3 clusters:

clusters1 <- cluster.fitted(fitsplines2, k=3, m=1.3, seed=12345, RS=5, noise=TRUE)
summary(clusters1)
#> cluster.fitted(TPSdata = fitsplines2, k = 3, m = 1.3, noise = TRUE, 
#>     seed = 12345, RS = 5)
#> 
#> 150 subjects clustered into 3 clusters + noise cluster using m = 1.3 
#> Clusters based on fitted values at times 46 91 137 182 228 273 319 
#> and slopes between points
#> 
#> Cluster summary:
#>       Cl.size Min.degree Max.degree Mean.degree
#> Clus1      47      0.489          1       0.983
#> Clus2      50      0.483          1       0.980
#> Clus3      46      0.862          1       0.991
#> Noise       7      0.933          1       0.980
#> 
#> Component 'FKM_TPS' contains the fuzzy clustering details from package 'fclust'.
#> Components of 'FKM_TPS':
#>  [1] "U"         "H"         "F"         "clus"      "medoid"    "value"    
#>  [7] "criterion" "iter"      "k"         "m"         "ent"       "b"        
#> [13] "vp"        "delta"     "stand"     "Xca"       "X"         "D"        
#> [19] "call"     
#> 
#> Cluster validity indices:
#>         PC         PE        MPC        SIL      SIL.F         XB 
#> 0.92979979 0.04137373 0.89469968 0.46242525 0.48364829 0.24117979 
#> 
#> Output dataset 'FKM_TPS_U' contains degree of cluster membership and modal
#>       cluster assignment for each object.
#> Head of dataset 'FKM_TPS_U':
#>   Id2 SubjectID        Clus1        Clus2        Clus3        Noise ClusModal
#> 1   1        17 1.309679e-05 0.9999862733 8.929241e-08 5.406552e-07         2
#> 2   2        19 9.665821e-04 0.0004435711 9.972477e-01 1.342137e-03         3
#> 3   3        34 4.719129e-06 0.9999952117 1.255723e-08 5.657849e-08         2
#> 4   4        52 1.188519e-05 0.9999878294 4.352554e-08 2.418527e-07         2
#> 5   5        55 6.427492e-04 0.0003088823 9.982691e-01 7.792961e-04         3
#> 6   6        83 3.651509e-03 0.0012300893 9.802217e-01 1.489672e-02         3

Plot the mean cluster trajectories:

plot(clusters1, legend_label_size=10)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Fit a GLM using the clusters and other covariates as predictors:

model <- FKM.glm(clusters1, TS.sim, y="outcome", covariates=c("x1", "x2"),
family="binomial")
summary(model)
#> Full model:
#> Formula (f1):  outcome ~ Clus2 + Clus3 + Noise + x1 + x2 
#> Family: binomial 
#> 
#> Call:
#> glm(formula = f1, family = family, data = data3)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -1.97164  -0.22557  -0.02179   0.13942   2.11563  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) 10.42476    2.11640   4.926 8.41e-07 ***
#> Clus2        1.15006    0.86491   1.330   0.1836    
#> Clus3        2.37069    1.00594   2.357   0.0184 *  
#> Noise        3.52023    1.73903   2.024   0.0429 *  
#> x1           0.73342    0.68421   1.072   0.2838    
#> x2          -0.27902    0.05381  -5.185 2.16e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 201.065  on 149  degrees of freedom
#> Residual deviance:  63.838  on 144  degrees of freedom
#> AIC: 75.838
#> 
#> Number of Fisher Scoring iterations: 7
#> 
#> 
#> ANOVA chi-square p-value for significance of clusters in model:
#> 0.03360065

Predict the outcome for a new dataset:

data(TS.sim.new)
predicted <- predict(model, TS.sim.new)
head(predicted)
#>   SubjectID x1   x2 Id2        Clus1       Clus2        Clus3        Noise
#> 1      1626  0 23.5   1 3.263066e-06 0.999996610 1.915778e-08 1.075745e-07
#> 2      1627  0 87.2   2 1.091953e-02 0.004496758 9.803848e-01 4.198949e-03
#> 3      1629  0 59.3   3 6.915108e-02 0.045432056 8.668493e-01 1.856759e-02
#> 4      1634  0 36.3   4 2.411935e-07 0.999999754 8.120729e-10 3.633010e-09
#> 5      1641  1 49.6   5 2.365786e-01 0.119455660 1.033256e-02 6.336331e-01
#> 6      1650  0 34.1   6 4.438164e-09 0.999999995 1.583192e-11 6.998691e-11
#>   ClusModal    predicted
#> 1         2 9.934247e-01
#> 2         3 9.524077e-06
#> 3         3 1.891861e-02
#> 4         2 8.094402e-01
#> 5         0 4.282913e-01
#> 6         2 8.869764e-01