Software Project Set up

  • Sponsor name, some descriptive for you to remember it, and project code:
    • Example: ABBV_CONSISTENCY_20222
  • Within this folder:
    1. R project with project number
    2. README file to note things to yourself with dates and topics, anything you need to jot down, do it here.
      • Keep a running record of everything
    3. Folders: DATA_20222, CODE_20222, TABLES_20222, FIGURES_20222, DOCUMENTS_2022; “DEPRECATED” subfolder to hold old, archived versions; “SANDBOX” subfolder to dump any code where you were messing around with something but it’s not for use - eg: “mmrm_testing_df_for_meeting_check”
  • Create browser folder containing bookmarks related to a project

Data Management

SDTM/ADaM file formats

  • Learn the SDTM/ADaM file formats. This is not the place for it, but there are extensive resources online: https://www.fda.gov/media/143550/download

  • The key here is that clinical data is aggressively standardized - this is super important when trying to manage data; once you get familiar with the naming processes, you don’t need much in the way of a data dictionary or data specs to manage the data. This is good, because you’ll be lucky to get a “good luck” accompanying the data dump.

.xpt file format:

  • Use the haven R package to read in file formats
  • .xpt preferred - safest choice
  • We can use the haven R package to read in a variety of data formats: https://haven.tidyverse.org/
    • However, I have had some issues reading in sas7bdat, rare but i hold my breath when reading in sas7bdat files.
    • Also, R cannot write out sas7bdat files (https://github.com/tidyverse/haven/issues/357#issuecomment-376313103)
    • If you need to provide a sas7bdat file, ask internal colleague to convert .csv or .dat file to a .sas7bdat file.
    • Long story short, the xpt file format is mandated by the FDA to be used, so they should have it. In my experience it requires more management but that’s okay, it will work.
    • https://www.fda.gov/media/132457/download

Match on CSR Tables

  • First, match the endpoints reported in the CSR.
    • At the very least, match the primary endpoints.
  • If this isn’t possible, match on sample size, or any available official outcomes.
  • The details of the data matching should be specified in the SAP

Long Data format

  • This is the correct way to manage data for analysis. This is the assumption in all of our internal trainings, that you have long data.
    • If you DON’T have long data, it is recommended you use the pivot_long function from the tidyverse R packages to obtain long data.
    • Note that other R packages, like ggplot2, are most compatible with long data.
    • If you’re unsure what this should look like, use one of the functions to simulate data, use View(dat) and str(dat) it to get a feel for how your data should look.

Factor variables

  • All of your variables that are factors should be carefully ordered and labeled.
  • This is done using the factor() function, specifically the “levels” and “labels” calls.
    • Do not do any analysis until you are confident in how you’re implementing this.
  • “Should this variable be a character or a factor?”
    • Usually I make my variables as a factor or numeric. Only Subject IDs are left as character variables.
    • I do this because I am not interested in the subject IDs and there’s so many that a factor variable will get unwieldy.

Data Management Function - data_mgmt()

  • Create a function for data management
  • Source that file, run that function to yield managed data
  • Do this each time you run analysis code, this will ensure all results are fully replicable
    • Any changes to data management should be done to that one single file
  • This prevents cluttering your Global environment with additional objects

Descriptive Statistics

  • Routine part of every analysis
    • Example: in psychometric analysis, you’re expected to compute item response distributions
  • Descriptive statistics should computed and presented in tables and figures
  • gtsummary R package

Coding Best Practices

  • Getting started: Outline your code on paper
    • Especially important for bigger projects with multiple functions
    • Pass what objects to what functions?
  • Functions
    • Sandbox: Sometimes creating functions makes it hard to trouble shoot what you’re doing if you’re just experimenting - If this is going to be a problem, just write it all out hard coded, confirm it works
    • Make it a function: once you know it works, yes, you should make it a function.
  • Writing Functions
    • Build functions sequentially
      1. First write the procedure, hard code all variables
      2. Test it! Make sure it works!
      3. One at a time, change the variables from hard-coded to things you can pass
      4. Test it, make sure it still works
      5. Proceed to change next variable
      6. After it has ALL been generalized, then wrap it in a function
      7. Pass all of the variables to the function
        • Make it throw an error if any of the variables are NOT passed to the function

Code Guidelines

  • Code guidelines: https://style.tidyverse.org/

    • Most important - document heavily; another programmer should be able to step in, review your code, and figure out what you’re doing and why without too much difficulty.

Testing Code

Code very slowly in order to code very quickly

  • Write a line of code
  • Run line of code
  • Evaluate output
  • Check length/dimensions
  • View output – does it align with what you anticipated?
  • Test it – if you change something, does it break or output something undesired?
  • Missing value handling – what happens if there’s an NA or NAN in the input?
    • Any other way it could break? Make a note.
  • Each time you test that line of code, re-run all the preceding code
    • Ensures you are not overwriting anything
    • Your output won’t depend on your previous test
  • After you’re confident that you don’t have any errors in that line of code, proceed to next line of code
  • TO MAXIMIZE SPEED, MINIMIZE ERRORS!

Statistics Resources & References