Fundamentals

Software Project Set up

Sponsor name, some descriptive for you to remember it, and project code:
- Example: ABBV_CONSISTENCY_20222
Within this folder:
1. R project with project number
2. README file to note things to yourself with dates and topics, anything you need to jot down, do it here.
  - Keep a running record of everything
3. Folders: DATA_20222, CODE_20222, TABLES_20222, FIGURES_20222, DOCUMENTS_2022; “DEPRECATED” subfolder to hold old, archived versions; “SANDBOX” subfolder to dump any code where you were messing around with something but it’s not for use - eg: “mmrm_testing_df_for_meeting_check”
Create browser folder containing bookmarks related to a project

Learn the SDTM/ADaM file formats. This is not the place for it, but there are extensive resources online: https://www.fda.gov/media/143550/download
The key here is that clinical data is aggressively standardized - this is super important when trying to manage data; once you get familiar with the naming processes, you don’t need much in the way of a data dictionary or data specs to manage the data. This is good, because you’ll be lucky to get a “good luck” accompanying the data dump.

First, match the endpoints reported in the CSR.
- At the very least, match the primary endpoints.
If this isn’t possible, match on sample size, or any available official outcomes.
The details of the data matching should be specified in the SAP

This is the correct way to manage data for analysis. This is the assumption in all of our internal trainings, that you have long data.
- If you DON’T have long data, it is recommended you use the pivot_long function from the tidyverse R packages to obtain long data.
- Note that other R packages, like ggplot2, are most compatible with long data.
- If you’re unsure what this should look like, use one of the functions to simulate data, use View(dat) and str(dat) it to get a feel for how your data should look.

All of your variables that are factors should be carefully ordered and labeled.
This is done using the factor() function, specifically the “levels” and “labels” calls.
- Do not do any analysis until you are confident in how you’re implementing this.
“Should this variable be a character or a factor?”
- Usually I make my variables as a factor or numeric. Only Subject IDs are left as character variables.
- I do this because I am not interested in the subject IDs and there’s so many that a factor variable will get unwieldy.

Create a function for data management
Source that file, run that function to yield managed data
Do this each time you run analysis code, this will ensure all results are fully replicable
- Any changes to data management should be done to that one single file
This prevents cluttering your Global environment with additional objects

Routine part of every analysis
- Example: in psychometric analysis, you’re expected to compute item response distributions
Descriptive statistics should computed and presented in tables and figures
gtsummary R package
- Use collection of tools to streamline this process
- http://www.danieldsjoberg.com/gtsummary/index.html

Getting started: Outline your code on paper
- Especially important for bigger projects with multiple functions
- Pass what objects to what functions?
Functions
- Sandbox: Sometimes creating functions makes it hard to trouble shoot what you’re doing if you’re just experimenting - If this is going to be a problem, just write it all out hard coded, confirm it works
- Make it a function: once you know it works, yes, you should make it a function.
Writing Functions
- Build functions sequentially
  1. First write the procedure, hard code all variables
  2. Test it! Make sure it works!
  3. One at a time, change the variables from hard-coded to things you can pass
  4. Test it, make sure it still works
  5. Proceed to change next variable
  6. After it has ALL been generalized, then wrap it in a function
  7. Pass all of the variables to the function
    - Make it throw an error if any of the variables are NOT passed to the function

Code guidelines: https://style.tidyverse.org/
- Most important - document heavily; another programmer should be able to step in, review your code, and figure out what you’re doing and why without too much difficulty.

Code very slowly in order to code very quickly

Write a line of code
Run line of code
Evaluate output
Check length/dimensions
View output – does it align with what you anticipated?
Test it – if you change something, does it break or output something undesired?
Missing value handling – what happens if there’s an NA or NAN in the input?
- Any other way it could break? Make a note.
Each time you test that line of code, re-run all the preceding code
- Ensures you are not overwriting anything
- Your output won’t depend on your previous test
After you’re confident that you don’t have any errors in that line of code, proceed to next line of code
TO MAXIMIZE SPEED, MINIMIZE ERRORS!
- Microsoft’s approach: https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/

We all need reminders for ourselves, and references to point clients to
Common statistical tests are linear models (or: how to do basic stats)
- https://lindeloev.github.io/tests-as-linear/
Missing Data in clinical trials (and analysis in RCT more broadly)
- https://www.ncbi.nlm.nih.gov/books/NBK209904/
Code Guidelines
- https://style.tidyverse.org/
Jargon
- NRC stat report had a section on jargon: https://www.ncbi.nlm.nih.gov/books/NBK209903/
Bug report
- How do we report issues/problems to developers/stackoverflow community/colleagues?
- https://github.com/rstudio/rstudio/wiki/Writing-Good-Bug-Reports
- Summary: send fully reproducible code and output from “sessionInfo()”