a05_Vignette_Known_Groups.Rmd
TODO
My approach with the known-groups validity is to use simpler, less complex methods. The FDA has been requesting approaches more along the lines of descriptive statistics.
Question: When should we push for more complex methods?
Answer: When there’s a serious problem to address that requires complex methods.
Problem: We don’t actually have a problem generating evidence of reliability and validity using cross-sectional (baseline or screening) data. The problems occur with longitudinal data, things like meaningful within-patient change. This occurs for several reasons, chief of which we believe to be issues with missing data.
Missing Data: This is the foremost issue that needs to be addressed with COA studies. We are addressing that via application of the MMRM to estimating thresholds for meaningful change.
I am attempting to move forward with methods that are as simple as possible. That means scatterplots instead of eCDFs. Sometimes, more complexity is the way to go. That means using the MMRM to estimate thresholds for meaningful change rather than simple descriptive statistics.
set.seed(12162021)
sim.out <- COA34::sim_pro_dat_v2(N=1000,
number.timepoints = 3,
Beta.PRO = NULL,
number.of.anchor.groups = 5,
polychor.value = 0.7,
corr = 'ar1',
cor.value = 0.8,
#var.values = c(7.136))
var.values = c(2))
dat <- sim.out$dat
# Simulate IRT items
# use the scores generated as the theta in the IRT model
sim.out2 <- COA34::sim_irt_item(dat = dat, J = 5, K = 4, latent.variable = 'Y_comp')
str(sim.out2)
#> List of 3
#> $ dat :'data.frame': 3000 obs. of 22 variables:
#> ..$ USUBJID : chr [1:3000] "Subject_0001" "Subject_0001" "Subject_0001" "Subject_0002" ...
#> ..$ Time : Factor w/ 3 levels "Time_1","Time_2",..: 1 2 3 1 2 3 1 2 3 1 ...
#> ..$ PGIS : num [1:3000] 3 4 3 0 0 0 0 0 1 1 ...
#> ..$ PGIS_bl : num [1:3000] 3 3 3 0 0 0 0 0 0 1 ...
#> ..$ PGIS_delta: num [1:3000] 0 1 0 0 0 0 0 0 1 0 ...
#> ..$ ag : num [1:3000] 0 1 0 0 0 0 0 0 1 0 ...
#> ..$ XB : num [1:3000] 4 4.25 4 1 1 1 1 1 2 2 ...
#> ..$ error : num [1:3000] -0.336 -0.509 0.122 0.825 1.212 ...
#> ..$ Y_comp : num [1:3000] 3.66 3.74 4.12 1.82 2.21 ...
#> ..$ Val_1 : int [1:3000] 2 NA NA 0 NA NA 0 NA NA 0 ...
#> ..$ Val_2 : num [1:3000] 6.38 NA NA 3.87 NA ...
#> ..$ Val_3 : num [1:3000] 3.73 NA NA 4.57 NA ...
#> ..$ Val_4 : num [1:3000] 0.555 NA NA 1.558 NA ...
#> ..$ Val_5 : int [1:3000] 4 NA NA 4 NA NA 0 NA NA 2 ...
#> ..$ Y_mcar : num [1:3000] 3.66 3.74 NA 1.82 2.21 ...
#> ..$ Y_mar : num [1:3000] 3.66 3.74 NA 1.82 2.21 ...
#> ..$ Y_mnar : num [1:3000] 3.66 3.74 NA 1.82 2.21 ...
#> ..$ Item_1 : int [1:3000] 2 2 2 1 1 1 0 1 0 1 ...
#> ..$ Item_2 : int [1:3000] 3 1 3 0 2 3 0 0 0 0 ...
#> ..$ Item_3 : int [1:3000] 2 2 2 1 1 3 0 0 0 0 ...
#> ..$ Item_4 : int [1:3000] 3 3 1 0 0 2 2 0 1 0 ...
#> ..$ Item_5 : int [1:3000] 2 2 2 1 0 1 0 0 1 1 ...
#> $ item.param : num [1:5, 1:4] 2 2 2 2 2 -2 -2 -2 -2 -2 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:5] "Item_1" "Item_2" "Item_3" "Item_4" ...
#> .. ..$ : chr [1:4] "slope" "intercept_1" "intercept_2" "intercept_3"
#> $ item.responses:'data.frame': 3000 obs. of 5 variables:
#> ..$ Item_1: int [1:3000] 2 2 2 1 1 1 0 1 0 1 ...
#> ..$ Item_2: int [1:3000] 3 1 3 0 2 3 0 0 0 0 ...
#> ..$ Item_3: int [1:3000] 2 2 2 1 1 3 0 0 0 0 ...
#> ..$ Item_4: int [1:3000] 3 3 1 0 0 2 2 0 1 0 ...
#> ..$ Item_5: int [1:3000] 2 2 2 1 0 1 0 0 1 1 ...
dat <- sim.out2$dat
# Score the PRO - just take a simple sum score here:
dat$PRO.score <- apply(dat[, grep('Item', colnames(dat))], 1, sum)
# Create the same PRO score, but with MAR drop-out:
dat$PRO.score_mar <- dat$PRO.score
dat$PRO.score_mar[is.na(dat$Y_mar)] <- NA
# Note that you've just set the PRO score to missing wherever the Y_mar variable is missing
# Now for the other missing types:
dat$PRO.score_mcar <- dat$PRO.score_mnar <- dat$PRO.score
dat$PRO.score_mcar[is.na(dat$Y_mcar)] <- NA
dat$PRO.score_mnar[is.na(dat$Y_mnar)] <- NA
aggregate(cbind(PRO.score, PRO.score_mcar, PRO.score_mar, PRO.score_mnar) ~ Time,
function(x) mean(x, na.rm = T),
data = dat,
na.action = na.pass)
#> Time PRO.score PRO.score_mcar PRO.score_mar PRO.score_mnar
#> 1 Time_1 7.445 7.445000 7.445 7.445
#> 2 Time_2 7.492 7.486667 5.912 5.748
#> 3 Time_3 7.520 7.736000 5.210 4.204
I can’t overemphasize how important it is to keep an eye on your variables. The functions will do all the work, you just have to carefully manage your data.
# PGIS is ordinal; transform it for the purposes of correct correlation
# Leave it numeric to compute the PGIS_delta for the change scores later
dat.cor <- dat
dat.cor$PGIS <-
factor(dat.cor$PGIS,
levels = c(0, 1, 2, 3, 4),
labels = c('None', 'Mild', 'Moderate', 'Severe', 'Very Severe'))
table(dat.cor$PGIS, useNA = 'always')
#>
#> None Mild Moderate Severe Very Severe <NA>
#> 683 604 621 526 566 0
# Validator Variables 1 and 5 are supposed to be
# ordered categorical variables
dat.cor$Val_1 <- factor(dat.cor$Val_1)
dat.cor$Val_5 <- factor(dat.cor$Val_5)
table(dat.cor$Val_1, useNA = 'always')
#>
#> 0 1 2 <NA>
#> 334 333 333 2000
table(dat.cor$Val_5, useNA = 'always')
#>
#> 0 1 2 3 4 <NA>
#> 200 200 200 200 200 2000
# Only collected at baseline, hence the NAs
Afterwards, we are left with 5 validator variables (Val_1, Val_2…Val_5) and 4 PRO scores (Y_comp, Y_mcar, Y_mar, Y_mnar).
The function takes 1 PRO score at a time, and 1 validator variable (reference measure) at a time. There are several outputs:
out <- COA34::compute_known_groups_validity(dat = dat.cor,
PRO.score = 'PRO.score',
val.var = 'Val_5',
time.var = 'Time')
out
#> $fn1.nh
#> [1] "Null hypothesis: mean of Known-Health Group not different from Reference Group."
#>
#> $fn2.nh
#> [1] "P<0.05 indicates known-health group is significantly different from reference group"
#>
#> $mod.r2
#> [1] "R-squared value: 0.08"
#>
#> $out.descr
#> PRO Score Variable Known-Health Group N Mean Median
#> 1 PRO.score Val_5 0 200 5.640 5
#> 2 PRO.score Val_5 1 200 6.805 6
#> 3 PRO.score Val_5 2 200 7.510 8
#> 4 PRO.score Val_5 3 200 7.990 8
#> 5 PRO.score Val_5 4 200 9.280 10
#>
#> $out.mod
#> PRO Score Variable Known-Health Group N Mean P-value
#> 1 PRO.score Val_5 0 - Reference Group 200 5.640 NA
#> 2 PRO.score Val_5 1 200 6.805 4.210953e-03
#> 3 PRO.score Val_5 2 200 7.510 4.671351e-06
#> 4 PRO.score Val_5 3 200 7.990 9.633092e-09
#> 5 PRO.score Val_5 4 200 9.280 1.535433e-18
#>
#> $PRO.score
#> [1] "PRO.score"
#>
#> $val.var
#> [1] "Val_5"
The standard output would be the table of the known-group means and the t-test of whether they are different from the reference group.
library(R2Word)
R2Word::dump_df_mat_to_file(out$out.mod,
table.title = 'Known Groups, Validator Variable 5',
NA.string = '-',
decimals = c(0, 2, 3),
file.name = 'kg1',
print.dir = print.dir,
table.footnote = out$fn2.nh)
Known Groups, Validator Variable 5 | |||||
PRO Score |
Variable |
Known-Health Group |
N |
Mean |
P-value |
PRO.score |
Val_5 |
0 - Reference Group |
200 |
5.64 |
- |
PRO.score |
Val_5 |
1 |
200 |
6.80 |
0.004 |
PRO.score |
Val_5 |
2 |
200 |
7.51 |
0.000 |
PRO.score |
Val_5 |
3 |
200 |
7.99 |
0.000 |
PRO.score |
Val_5 |
4 |
200 |
9.28 |
0.000 |
P<0.05 indicates known-health group is significantly different from reference group |
The descriptive table containing the median is included due to a recent FDA request.
library(R2Word)
R2Word::dump_df_mat_to_file(out$out.descr,
table.title = 'Known Groups, Validator Variable 5',
NA.string = '-',
decimals = c(0, 2, 3),
file.name = 'kg2',
print.dir = print.dir)
Known Groups, Validator Variable 5 | |||||
PRO Score |
Variable |
Known-Health Group |
N |
Mean |
Median |
PRO.score |
Val_5 |
0 |
200 |
5.64 |
5.000 |
PRO.score |
Val_5 |
1 |
200 |
6.80 |
6.000 |
PRO.score |
Val_5 |
2 |
200 |
7.51 |
8.000 |
PRO.score |
Val_5 |
3 |
200 |
7.99 |
8.000 |
PRO.score |
Val_5 |
4 |
200 |
9.28 |
10.000 |
And again, you can use a shell table if you’re preparing a analysis plan and need/want to include some tables:
library(R2Word)
st <- R2Word::shell_table(out = out$out.mod, cols = 4:6, decimals = c(0,2,3), NA.string = '-')
R2Word::dump_df_mat_to_file(out = st,
table.title = 'Known Groups, Validator Variable 5',
NA.string = '-',
decimals = c(0, 2, 3),
file.name = 'kg3',
print.dir = print.dir,
table.footnote = out$fn2.nh)
Known Groups, Validator Variable 5 | |||||
PRO Score |
Variable |
Known-Health Group |
N |
Mean |
P-value |
PRO.score |
Val_5 |
0 - Reference Group |
xxx |
x.xx |
- |
PRO.score |
Val_5 |
1 |
xxx |
x.xx |
x.xxx |
PRO.score |
Val_5 |
2 |
xxx |
x.xx |
x.xxx |
PRO.score |
Val_5 |
3 |
xxx |
x.xx |
x.xxx |
PRO.score |
Val_5 |
4 |
xxx |
x.xx |
x.xxx |
P<0.05 indicates known-health group is significantly different from reference group |