Illustration of Known-Groups Validity

TODO

  1. Need to generate some categorical scores and show the functions.
  2. How to pass more than 1 footnote to the flextable output

Simpler is better

My approach with the known-groups validity is to use simpler, less complex methods. The FDA has been requesting approaches more along the lines of descriptive statistics.

Question: When should we push for more complex methods?

Answer: When there’s a serious problem to address that requires complex methods.

Methods to address problems

Problem: We don’t actually have a problem generating evidence of reliability and validity using cross-sectional (baseline or screening) data. The problems occur with longitudinal data, things like meaningful within-patient change. This occurs for several reasons, chief of which we believe to be issues with missing data.

Missing Data: This is the foremost issue that needs to be addressed with COA studies. We are addressing that via application of the MMRM to estimating thresholds for meaningful change.

Summary

I am attempting to move forward with methods that are as simple as possible. That means scatterplots instead of eCDFs. Sometimes, more complexity is the way to go. That means using the MMRM to estimate thresholds for meaningful change rather than simple descriptive statistics.

Generate data


  set.seed(12162021)

  sim.out <- COA34::sim_pro_dat_v2(N=1000,
                            number.timepoints = 3,
                            Beta.PRO = NULL,
                            number.of.anchor.groups = 5,
                            polychor.value = 0.7,
                            corr = 'ar1',
                            cor.value = 0.8,
                            #var.values = c(7.136))
                            var.values = c(2))
  dat <- sim.out$dat

Implement drop-out


dat <- COA34::dropout(dat = dat,
               type_dropout  = c('mcar', 'mar', 'mnar'),
               prop.miss = 0.5,
               stochastic.component = 0.2)

Simulate Items


 # Simulate IRT items
 # use the scores generated as the theta in the IRT model 
  sim.out2 <- COA34::sim_irt_item(dat = dat, J = 5, K = 4, latent.variable = 'Y_comp')
    str(sim.out2)
#> List of 3
#>  $ dat           :'data.frame':  3000 obs. of  22 variables:
#>   ..$ USUBJID   : chr [1:3000] "Subject_0001" "Subject_0001" "Subject_0001" "Subject_0002" ...
#>   ..$ Time      : Factor w/ 3 levels "Time_1","Time_2",..: 1 2 3 1 2 3 1 2 3 1 ...
#>   ..$ PGIS      : num [1:3000] 3 4 3 0 0 0 0 0 1 1 ...
#>   ..$ PGIS_bl   : num [1:3000] 3 3 3 0 0 0 0 0 0 1 ...
#>   ..$ PGIS_delta: num [1:3000] 0 1 0 0 0 0 0 0 1 0 ...
#>   ..$ ag        : num [1:3000] 0 1 0 0 0 0 0 0 1 0 ...
#>   ..$ XB        : num [1:3000] 4 4.25 4 1 1 1 1 1 2 2 ...
#>   ..$ error     : num [1:3000] -0.336 -0.509 0.122 0.825 1.212 ...
#>   ..$ Y_comp    : num [1:3000] 3.66 3.74 4.12 1.82 2.21 ...
#>   ..$ Val_1     : int [1:3000] 2 NA NA 0 NA NA 0 NA NA 0 ...
#>   ..$ Val_2     : num [1:3000] 6.38 NA NA 3.87 NA ...
#>   ..$ Val_3     : num [1:3000] 3.73 NA NA 4.57 NA ...
#>   ..$ Val_4     : num [1:3000] 0.555 NA NA 1.558 NA ...
#>   ..$ Val_5     : int [1:3000] 4 NA NA 4 NA NA 0 NA NA 2 ...
#>   ..$ Y_mcar    : num [1:3000] 3.66 3.74 NA 1.82 2.21 ...
#>   ..$ Y_mar     : num [1:3000] 3.66 3.74 NA 1.82 2.21 ...
#>   ..$ Y_mnar    : num [1:3000] 3.66 3.74 NA 1.82 2.21 ...
#>   ..$ Item_1    : int [1:3000] 2 2 2 1 1 1 0 1 0 1 ...
#>   ..$ Item_2    : int [1:3000] 3 1 3 0 2 3 0 0 0 0 ...
#>   ..$ Item_3    : int [1:3000] 2 2 2 1 1 3 0 0 0 0 ...
#>   ..$ Item_4    : int [1:3000] 3 3 1 0 0 2 2 0 1 0 ...
#>   ..$ Item_5    : int [1:3000] 2 2 2 1 0 1 0 0 1 1 ...
#>  $ item.param    : num [1:5, 1:4] 2 2 2 2 2 -2 -2 -2 -2 -2 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "Item_1" "Item_2" "Item_3" "Item_4" ...
#>   .. ..$ : chr [1:4] "slope" "intercept_1" "intercept_2" "intercept_3"
#>  $ item.responses:'data.frame':  3000 obs. of  5 variables:
#>   ..$ Item_1: int [1:3000] 2 2 2 1 1 1 0 1 0 1 ...
#>   ..$ Item_2: int [1:3000] 3 1 3 0 2 3 0 0 0 0 ...
#>   ..$ Item_3: int [1:3000] 2 2 2 1 1 3 0 0 0 0 ...
#>   ..$ Item_4: int [1:3000] 3 3 1 0 0 2 2 0 1 0 ...
#>   ..$ Item_5: int [1:3000] 2 2 2 1 0 1 0 0 1 1 ...
    dat <- sim.out2$dat

# Score the PRO - just take a simple sum score here:
  dat$PRO.score <- apply(dat[, grep('Item', colnames(dat))], 1, sum)
  
# Create the same PRO score, but with MAR drop-out:
  dat$PRO.score_mar <- dat$PRO.score
  dat$PRO.score_mar[is.na(dat$Y_mar)] <- NA
# Note that you've just set the PRO score to missing wherever the Y_mar variable is missing
  
# Now for the other missing types:
    dat$PRO.score_mcar <- dat$PRO.score_mnar <- dat$PRO.score
    dat$PRO.score_mcar[is.na(dat$Y_mcar)] <- NA
    dat$PRO.score_mnar[is.na(dat$Y_mnar)] <- NA

  aggregate(cbind(PRO.score, PRO.score_mcar, PRO.score_mar, PRO.score_mnar) ~ Time,
                  function(x) mean(x, na.rm = T), 
                  data = dat, 
                  na.action = na.pass)
#>     Time PRO.score PRO.score_mcar PRO.score_mar PRO.score_mnar
#> 1 Time_1     7.445       7.445000         7.445          7.445
#> 2 Time_2     7.492       7.486667         5.912          5.748
#> 3 Time_3     7.520       7.736000         5.210          4.204

Managing Variables

I can’t overemphasize how important it is to keep an eye on your variables. The functions will do all the work, you just have to carefully manage your data.

# PGIS is ordinal; transform it for the purposes of correct correlation
# Leave it numeric to compute the PGIS_delta for the change scores later
dat.cor <- dat
dat.cor$PGIS <-
  factor(dat.cor$PGIS,
         levels = c(0, 1, 2, 3, 4),
         labels = c('None', 'Mild', 'Moderate', 'Severe', 'Very Severe'))
table(dat.cor$PGIS, useNA = 'always')
#> 
#>        None        Mild    Moderate      Severe Very Severe        <NA> 
#>         683         604         621         526         566           0

# Validator Variables 1 and 5 are supposed to be
# ordered categorical variables
dat.cor$Val_1 <- factor(dat.cor$Val_1)
dat.cor$Val_5 <- factor(dat.cor$Val_5)
table(dat.cor$Val_1, useNA = 'always')
#> 
#>    0    1    2 <NA> 
#>  334  333  333 2000
table(dat.cor$Val_5, useNA = 'always')
#> 
#>    0    1    2    3    4 <NA> 
#>  200  200  200  200  200 2000
# Only collected at baseline, hence the NAs

Afterwards, we are left with 5 validator variables (Val_1, Val_2…Val_5) and 4 PRO scores (Y_comp, Y_mcar, Y_mar, Y_mnar).

Known-Groups Validity

The function takes 1 PRO score at a time, and 1 validator variable (reference measure) at a time. There are several outputs:

out <- COA34::compute_known_groups_validity(dat = dat.cor,
                                            PRO.score = 'PRO.score',
                                            val.var = 'Val_5',
                                            time.var = 'Time')
out
#> $fn1.nh
#> [1] "Null hypothesis: mean of Known-Health Group not different from Reference Group."
#> 
#> $fn2.nh
#> [1] "P<0.05 indicates known-health group is significantly different from reference group"
#> 
#> $mod.r2
#> [1] "R-squared value: 0.08"
#> 
#> $out.descr
#>   PRO Score Variable Known-Health Group   N  Mean Median
#> 1 PRO.score    Val_5                  0 200 5.640      5
#> 2 PRO.score    Val_5                  1 200 6.805      6
#> 3 PRO.score    Val_5                  2 200 7.510      8
#> 4 PRO.score    Val_5                  3 200 7.990      8
#> 5 PRO.score    Val_5                  4 200 9.280     10
#> 
#> $out.mod
#>   PRO Score Variable  Known-Health Group   N  Mean      P-value
#> 1 PRO.score    Val_5 0 - Reference Group 200 5.640           NA
#> 2 PRO.score    Val_5                   1 200 6.805 4.210953e-03
#> 3 PRO.score    Val_5                   2 200 7.510 4.671351e-06
#> 4 PRO.score    Val_5                   3 200 7.990 9.633092e-09
#> 5 PRO.score    Val_5                   4 200 9.280 1.535433e-18
#> 
#> $PRO.score
#> [1] "PRO.score"
#> 
#> $val.var
#> [1] "Val_5"

The standard output would be the table of the known-group means and the t-test of whether they are different from the reference group.

library(R2Word)

R2Word::dump_df_mat_to_file(out$out.mod,
                            table.title = 'Known Groups, Validator Variable 5',
                            NA.string = '-',
                            decimals = c(0, 2, 3),
                            file.name = 'kg1', 
                            print.dir = print.dir,
                            table.footnote = out$fn2.nh)

The descriptive table containing the median is included due to a recent FDA request.

library(R2Word)

R2Word::dump_df_mat_to_file(out$out.descr,
                            table.title = 'Known Groups, Validator Variable 5',
                            NA.string = '-',
                            decimals = c(0, 2, 3),
                            file.name = 'kg2', 
                            print.dir = print.dir)

And again, you can use a shell table if you’re preparing a analysis plan and need/want to include some tables:

library(R2Word)

st <- R2Word::shell_table(out = out$out.mod, cols = 4:6, decimals = c(0,2,3), NA.string = '-')

R2Word::dump_df_mat_to_file(out = st,
                            table.title = 'Known Groups, Validator Variable 5',
                            NA.string = '-',
                            decimals = c(0, 2, 3),
                            file.name = 'kg3', 
                            print.dir = print.dir,
                            table.footnote = out$fn2.nh)