Motivation

Our motivation behind this project is to explore the fragility index (FI) beyond what those involved in clinical trials have done. FI has been suggested as an easy-to-understand metric that bridges the intent of researchers and utilization by clinicians. It has the most interest in the oncology community but we want to explore its utility beyond this scope. You can go here to learn more about our motivation.

Note: You can expand our code chunks by clicking on code on the right pill-button.

Primary goals

Learn more about phase III clinical trials and the fragility index
Explore general trend of FI in recent phase III clinical trials
Discover potential associations between FI and factors such as disease type, treatment type, conducted location

Secondary goals

Explore other variables
Use available data the best we can to synthesize information about clinical trials
Construct a web-based FI calculator which allows user to input their own dataset

Initial Questions and Goal

As our initial goal, we want to find the relationship between fragility index and disease type in phase III RCT. To be more specific, we want to see if there’s significant difference in FI between trials targeting cancers and those targeting non-oncology immunologic disorders. Also, we would like to discover any other facotrs that may impact the FI of RCT studies. However, during the data collection process, we found out that all the criterias we’d applied would result in insufficient sample size to conduct further analysis. Accordingly, we’ve enlarged our filtering criteria to US based phase III clinical trials.

What We Actually Did for FI

What we ended up doing differed from our initial hypothesis described in the above paragraph. After looking at the data we gathered, the questions we initially posed were not answerable. While we managed to scrape usable clinical trial data to calculate fragility index, the number of trials were not enough to make adequate visualizations or really anything substantial in our initial query which was filtered by drug type and disease type.

So we pivoted and instead scraped data from 10,000 completed phase 3 clinical trials prior to 2017. This took a lot of trial and error, and the various methods we attempted can be found in the obtaining data for FI section below. We ended up with 39 trials that matched our specific criteria (primary outcome is a 2x2 table of the number of patients with placebo as one of the groups) to calculate fragility index.

Questions we could answer from this dataset include:

What is the median fragility index of our data obtained through our “MacGyvered” scraping method?
What sort of trends, if any, do we see from our basic information describing the trials?

Obtaining Data for FI

Our goal is to obtain data from clinical trials which is capable of calculating a fragility index in addition to any other information we may be able to easily gather. We attempted multiple methods to find such data.

Scraping from Google Scholar

Given the ease of access to Google Scholar (GS), we first attempted to scrape our sources from the site. We initially did this manually by searching some random entry in order to see a pattern where we can insert our “page number”. We found that we need to split Google’s search address into two partitions so that we can insert the search numbers.

# have to split link into 2 parts...
gs_url_base1 <- "https://scholar.google.com/scholar?start="
gs_url_base2 <- "&q=monoclonal+antibody+phase+3&hl=en&as_sdt=0,33&as_ylo=2007&as_yhi=2017&lookup=0"

# adds the search term by 10s (obtained by evaluating scholar's http address)
gs_vec_url <- str_c(gs_url_base1, seq(0, 100, 10), gs_url_base2)

Once we figured out the web address components, we attempted to incorporate this into our scraping function by using purrr::map function and search for the term “monoclonal antibody phase 3”. Utilizing “SelectorGadget”, we were able to obtain the css tag for the title, .gs_rt a and used this to scrape our articles/journals.

# GS scraper fn
read_page <- function(url) {
  
  h = read_html(url)               # reads url input
  
  title = h %>%
    html_nodes(".gs_rt a") %>%     # pulls the specific html tag (for titles)
    html_text()
  
  data_frame(title)                # turns scraped data into a dataframe
}

# map read test
gs_test <- map(gs_vec_url, read_page)

# unnested df test (success)
unnested_gs <- gs_test %>% 
  tibble::enframe(name = NULL) %>% 
  unnest()

# peek into the first 3 results
head(unnested_gs, n = 3L)

The code we made was successful in scraping the titles from GS. However, an apparent issue with this is that we’re unable to get the links to the actual paper. Furthermore, the various sources of the article will be highly varied when accessing these articles, which complicates the generalizability of our functions. Additionally, we’re unable to exclusively scrape the year of publication, which further worsens this method.

Given the apparent high ceiling to scrape via GS, we decided to stop using GS and tried PubMed scraping.

Scraping from PubMed

Attempting to scrape PubMed results for our relevant issues quickly comes to a halt because we noticed that as we try to search results, their html address does NOT change or provide a page number. We quickly scraped this from our viable methods to obtain our data.

Scraping from clinicaltrials.gov

rclinicaltrials library

rclinicaltrials is a library meant to serve as an R interface to clinicaltrials.gov! It allows you to perform basic and advanced searches, query the database, and then download study information in a “useful” format. The author even included a useful vignette which served as a useful example during initial testing.

Unforuntately, it ended up not being useful for our purposes. This R package’s last commit was in early 2017, and it looks like it is no longer being supported fully. However, there are still uses for this package as it can quickly grab information from clinicaltrials.gov and put it into an R object. It was not quickly apparant how it could be used in our situation.

The data from the clinicaltrials_download function was in a listcol with plenty of descriptive information. Unfortunately, the outcomes section did not include the actual n included in each participant arm, and the structure of the data was inconsistent between trials. It ended up being more of a hassle to try to use this package than other methods learned in class.

clinicaltrials.gov trial and error example code:

#install_github("sachsmc/rclinicaltrials")
library(devtools)
library(rclinicaltrials)

test <- clinicaltrials_download(query = 'asthma', count = 10, include_results = TRUE)
test$study_information$outcomes
myoutcomes<- test$study_information$outcomes
head(myoutcomes)
#test$study_information
#Doing a query of multiple clinical trials results in a listcol

test2<- clinicaltrials_download(query = 'NCT01123083', count = 1, include_results = TRUE)
test2$study_results$outcome_data$measure
outcomes2<- test2$study_results$outcome_data
head(outcomes2)
#information is not exactly in a great format

test3 <- clinicaltrials_download(query = 'NCT00195702', count = 1, include_results = TRUE)
outcomes3 <- test3$study_results$outcome_data
baseline3 <- test3$study_results$baseline_data
#information is inconsistent between trials

Scraping via clinical trial ID

Our last bastion, clinicaltrials.gov, was successful (in some ways). We ended up employing a combination of API search to obtain specific trial ID and use those to scrape relevant contents from clinicaltrials.gov.

We first obtained (downloaded) a .csv file from clinicaltrials.gov that contained our advanced search options:

phase III trials
completed trial
2017 or earlier
“monoclonal antibodies” search term

After obtaining this file, we read it into R and clean-up the relevant part of this file. In particular, we have to clean the provided url as we’re only interested in the trial ID. Each clinical trial on the website has a special ClinicalTrials.gov Identifier which is used to distinguish trials. We will use this identifier going forward to scrape data. We further filtered trials to include those that has a “Placebo” arm.

ctgov_scrape_test <- read_csv("./data/ctgov_test_API.csv") %>%      # read csv
  janitor::clean_names() %>% 
  mutate(
    url = str_replace(url, "https://ClinicalTrials.gov/show/", "")    # keep trial ID only
  ) %>% 
  rename("nct_id" = url) %>% 
  select(rank, nct_id, title, conditions, interventions) %>%          # remove location; irrelevant
  filter(str_detect(interventions, "Placebo"))


## disabled code to check distinct categories of potential interest:
#  ctgov_scrape_test %>% 
#  distinct(conditions) %>% 
#  view()

Now our “database” containing relevant clinical trials are ready. This raw data contains 60 trials with roughly Before we continue, we need to obtain the html tags from clinicaltrials.gov which contains our variables of interest. With the help of SelectorGadget, we managed to isolate the tags to contain:

#EXPAND-outcome-data-1 .labelSubtle : label within the table this is the thing that needs to have participants or “Unit of Measure: Participants”
#EXPAND-outcome-data-1 td.de-outcomeLabelCell : this is arm description for primary outcome
#EXPAND-outcome-data-1 tbody:nth-child(2) th : this is description of table

These tags will be used to identify certain cells in the messy table within clinicaltrials.gov.

Example trial from clinicaltrials.gov seen below:

Testing Data Scraping and Transformation to Dataframe

The goal of this portion is to iteratively grab primary outcome data from clinicaltrials.gov, determine if the type of data allows us to calculate a fragility index, then tidy the data into a tibble to allow us to actually calculate it.

We started by testing our code to create a dataframe from a single source. It took quite a few trial and errors to be able to obtain the right form.

Firstly, we attempt to find an identifier relevant to our project that may help us automate the determination of which trial has data eligible for calculating FI.

# test to obtain the particular unit of measure; "participants"
test_url_ctgov <- read_html("https://ClinicalTrials.gov/show/results/NCT00195702") %>%
  html_nodes("#EXPAND-outcome-data-1 .labelSubtle") %>% 
  html_text() %>% 
  str_replace("Unit of Measure: ", "")

This returns the “unit of measure” of the primary outcome of any given clinical trial. We are particularly looking for outcomes that contain “participants”. We do this because we are looking for trials that have a primary endpoint that involves a standard 2x2 table, which will allow us to calculate a fisher exact test, which is necessary to calculate the fragility index. Through trial and error, we found “unit of measure” to be the best proxy to filter the data.

Having found a way to detect this, we continued to build our function under the assumption that we have “participants” as our unit of measure.

We obtained the relevant column names:

# pulling the relevant column names that corresponds to the eventual numbers we pulled.
# this always has the first term be "\n    Arm/Group Title \n  ", which we will ignore
col_names_test <- read_html("https://ClinicalTrials.gov/show/results/NCT00195702") %>% 
  html_nodes("#EXPAND-outcome-data-1 tbody:nth-child(2) tr:nth-child(1) .de-outcomeLabelCell") %>%
  html_text()

col_names_test

## [1] "\n    Arm/Group Title \n  "         
## [2] "\n      DB Adalimumab 20 mg ew    " 
## [3] "\n      DB Adalimumab 40 mg Eow    "
## [4] "\n      DB Placebo ew    "

The row names, in case we need it as well:

# obtains relevant row names
# ideally this will be participants and the specified units of measure
row_names_test <- read_html("https://ClinicalTrials.gov/show/results/NCT00195702") %>% 
  html_nodes("#EXPAND-outcome-data-1 .labelSubtle , #EXPAND-outcome-data-1 tbody:nth-child(2) tr:nth-child(4) .de-outcomeLabelCell") %>%
  html_text()

row_names_test

## [1] "\n    Overall Number of Participants Analyzed   "
## [2] "Unit of Measure: Participants"

And finally, our most important component, which is the number of participants in each group. The code chunks above and below combined detect the number of rows and columns in any clinicaltrials.gov website, grab available data in a vector, then recreate the table in a tidy fashion within R!

# scrape the actual data, with variable table length
content_test <- read_html("https://ClinicalTrials.gov/show/results/NCT00195702") %>% 
  html_nodes("#EXPAND-outcome-data-1 .de-numValue_outcomeDataCell") %>%
  html_text() %>% 
  as.numeric() %>%                           # cleans our data to contain just the numbers
  matrix(ncol = length(col_names_test[-1]),  # make into a matrix of correct size 
         nrow = length(row_names_test), 
         byrow = TRUE) %>% 
  as_tibble()                                # convert the matrix into a tibble

# add column names to our df
names(content_test) <- col_names_test[-1]

# adds rownames to our tibbles
# a tibble doesnt show rownames when you call it but it is saved into R
rownames(content_test) <- row_names_test

content_test

## # A tibble: 2 x 3
##   `\n      DB Adalimumab 2… `\n      DB Adalimumab 40… `\n      DB Placebo…
## *                     <dbl>                      <dbl>                <dbl>
## 1                       212                        207                  200
## 2                       129                        131                   59

Though it took a while, our code managed to pull the relevant info that we need to manipulate in order to evaluate our fragility indexes. Our next step then, is to be able to map this and make a listcol within our original dataframe.

Building the function with the proper conditions and mapping

Similar to how we approach making the single-url code, we had various trial and error to come up with the function. We start with a simple object that stores our base url and ClinicalTrials.gov Identifier. We iteratively go to each website and extract the “unit of measure” of the primary outcome of the trial and evaluate whether or not it contains “participants” or “patients”; scrape if TRUE, return NA if FALSE.

A slightly different step was to split the content-scraping and data-frame transformation into two separate steps in order to clarify the segmentation.

# function to scrape from url
outcome_extractor <- function(data) {
  
  # stores our list of urls
  site = read_html(str_c("https://ClinicalTrials.gov/show/results/", data))
  
  # pulls our "measure" detector
  measure = site %>%
    html_nodes("#EXPAND-outcome-data-1 .labelSubtle") %>% 
    html_text() %>% 
    str_replace("Unit of Measure: ", "")
  
  # if/else function that skips to the next url if our measure does not contain "participants" or "patients"
  if (length(measure) == 0) {      # added after bug found; accounts for length = 0
    NA
  } else if (!str_detect(measure, "[pP]articipants|[pP]atients")) {
    NA            # returns NA if it doesn't exist
  } else {
    
    # pulls relevant column names
    col_names = site %>%
      html_nodes("#EXPAND-outcome-data-1 tbody:nth-child(2) tr:nth-child(1) .de-outcomeLabelCell") %>%
      html_text()
    
    # pulls row names (might be unnecessary)
    row_names = site %>% 
      html_nodes("#EXPAND-outcome-data-1 .labelSubtle , #EXPAND-outcome-data-1 tbody:nth-child(2) tr:nth-child(4) .de-outcomeLabelCell") %>% 
      html_text()
    
    # pulls the actual numbers (content)
    content = site %>% 
      html_nodes(".de-numValue_outcomeDataCell") %>% 
      html_text() %>% 
      as.numeric()
    
    # makes a dataframe using matrix
    df = matrix(content, 
                ncol = length(col_names[-1]), 
                nrow = length(row_names), 
                byrow = TRUE) %>% 
      as_tibble() 
    
    # adds column names
    names(df) <- col_names[-1]
    rownames(df) <- row_names
    
    # returns the df output
    df
  }
}

We then test this by scraping our .csv file, ctgov_test_API.csv located inside our data folder (loaded earlier) and map our newly-made function. A critical step after we tested this was that we need to filter our NA for our listcol to appear properly.

# testing our function using purrr::map function.
mapped_ctgov_scrape_test <- ctgov_scrape_test %>% 
  mutate(
    scrape_data = map(nct_id, outcome_extractor)
  ) %>% 
  filter(!is.na(scrape_data))               # necessary step to clear out the NA rows.

We then checked those trials that contain 2x2 table from our original file with stricter search criteria but we only found 6 (uh-oh). After realizing the situation we were in (yikes) and the amount of data we were likely able to get using the initial hypothesis (not enough), we expanded our initial search options to any completed phase 3 clinical trials.

Mapping our 10,000 dataset

Now that we expanded our scope, we use our working function on our new data source, ctgov_10k_API. This however, takes about 30-40 minutes to compile.

Note: During one of our tries, we realize that there’s an extra element in the “outcome measures” that we failed to take into account; character(0). This is different from NA. We had to debug this for a while to make our function work by adding another condition: if(length(input) == 0), which returns NA if TRUE.

ctgov_10k_df <- read_csv("./data/ctgov_10k_API.csv") %>%              # read csv
  janitor::clean_names() %>%
  mutate(
    url = str_replace(url, "https://ClinicalTrials.gov/show/", "")    # keep trial ID only
  ) %>% 
  rename("nct_id" = url) %>% 
  select(rank, nct_id, title, conditions, interventions) %>%     
  filter(str_detect(interventions, "Placebo"))

# testing our function using purrr::map function.
ctgov_10k_scraped <- ctgov_10k_df %>% 
  mutate(
    scrape_data = map(nct_id, outcome_extractor)
  ) %>% 
  filter(!is.na(scrape_data))               # necessary step to clear out the NA rows.

After all this, our resulting viable dataset is 620 datasets that each contains the primary outcomes of interest. There are roughly 394 unique conditions found using distinct() function.

Cleaning the Scraped Dataset

After scraping the hard numbers in the scrape_data column, we found that according to clinicaltrials.gov output issue, the html node doesn’t necessarily extract the information we want. Therefore, we have no choice but to filter out trials that did not capture this info (contains NA) from our dataset.

# left as _test for now, change it back to ctgov_10k_scraped later
ctgov_10k_scraped <- ctgov_10k_scraped %>% 
  mutate(
    nainside = map_chr(scrape_data, anyNA)) 

# show the count of those with NA vs without
ctgov_10k_scraped %>% 
  count(nainside)

## # A tibble: 2 x 2
##   nainside     n
##   <chr>    <int>
## 1 FALSE      274
## 2 TRUE       346

As shown above, we have 274 intact datasets for FI analysis. We further filter this by selecting only the datasets that has number of participants as unit of measure.

# select those doesn't contain NAs
clean_ctgov_df <- ctgov_10k_scraped %>% 
  filter(nainside == FALSE)

# function to obtain unit of measure
unit_detector = function(data) {
  
  site = read_html(str_c("https://ClinicalTrials.gov/show/results/", data))  # stores url
  
  measure = site %>%                                                         # grabs unit of measure
    html_nodes("#EXPAND-outcome-data-1 .labelSubtle") %>% 
    html_text() %>% 
    str_replace("Unit of Measure:", "")
  
  return(measure)                                                            # returns as character
}

# adding unit of measure to the dataset
clean_ctgov_df <- clean_ctgov_df %>% 
  mutate(
    units = map_chr(nct_id, unit_detector)
  )

# remove those containing percent/proportions in their unit of measure
# as well as those that are not 2x2 tables
new_df <- clean_ctgov_df %>% 
  filter(!str_detect(units, "[pP]ercent|[pP]ercentage|[pP]roportion")) %>% 
  mutate(
    nrow = map_dbl(scrape_data, nrow),                  # counts rows in each of our scraped data
    ncol = map_dbl(scrape_data, ncol)                   # counts columns in each of our scraped data
  ) %>% 
  filter(ncol == 2 & nrow == 2)

After all these, we now have trials with number of patients as unit of measures and we further filtered out those that have complicated results more than 2 arms. The resulting dataset has 39 trials in it. A quick distinct() suggests each trials are for unique conditions/disease.

We further tried to extract, modify, access the dataframe we made for each trials but despite numerous attempts, none worked. This is likely due to how purrr::map functions with list versus tibbles. Therefore, we have no choice but to manually type out the 39 studies and save it as a new csv file. Since we made it manually, we made sure to split the trial arms into separate columns.

# some miracle happened!!!
manual <- read_csv("./data/FI2by2_manual.csv")

# rejoin the results back to the orignial data
final_df <- left_join(new_df, manual, by = "nct_id") %>% 
  select(everything(), -nrow, -ncol, -nainside, -scrape_data)

We thought we were safe but more issues await. Mapping the function fragilityindex::fragility.index, (R documentation here) appears to use a lot of “workspace” in RStudio, thereby preventing us from mapping fragility.index to each row. As a result, we had to make a for loop.

# stores nct_id
nct = c()

# stores result of fragility.index function
FI = c()

# the loop function
for (i in 1:nrow(final_df)) {
  # stores nct_id of each
  nct[i] = final_df$nct_id[i]
  
  # stores the FI result
  FI[i] = fragility.index(intervention_event = final_df$effective_trt[i], 
                          control_event = final_df$effective_placebo[i], 
                          intervention_n = final_df$total_number_trt[i],  
                          control_n = final_df$total_number_placebo[i])
} 

# temporary dataframe to turn our loop result into a df 
temp_df <- tibble(
  nct_id = nct,
  FI = as.numeric(FI)
)

# left join the final results
clean_final_df <- left_join(final_df, temp_df, by = "nct_id")

After all of the above, we finally have a dataframe that has just enough data to get a preliminary look at the fragility index. Our final dataset contains 39 trials with 10 relevant variables:

nct_id: trial ID from clinicaltrials.gov
title: title for the trial submitted to clinicaltrials.gov
conditions: trial’s condition/disease of interest
interventions: interventions done in the trial
units: units of measure (filtered for number of participants)
total_number_trt: sample size in the treatment group
total_number_placebo: sample size in the control group
effective_trt: those that developed a response in treatment group
effective_placebo: those that developed a response in control group
FI: fragility index calculated using fragilityindex::fragility.index

EDA for Fragility Index

summary(clean_final_df$FI)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   6.949   1.000  67.000

A quick summary() reveals another, unexpected issue: 28 of our data appears to be 0, which is about 71.79% of our data. This is likely because these trials already has statistically non-significant results in the first place.

Given our limited data we obtained, we could only create general descriptive table of FI corresponding to the disease conditions.

# quick review of available FI data > 0
clean_final_df %>% 
  select(title, conditions, interventions, FI) %>% 
  filter(FI != 0) %>% 
  arrange(desc(FI)) %>% 
  knitr::kable("html",  
         caption = "Clinical Trials and their Fragility Index")

Clinical Trials and their Fragility Index
title	conditions	interventions	FI
Effects of Ivabradine on Cardiovascular Events in Patients With Moderate to Severe Chronic Heart Failure and Left Ventricular Systolic Dysfunction. A Three-year International Multicentre Study	Chronic Heart Failure	Drug: Ivabradine\|Drug: Placebo	67
Study of Secukinumab With 2 mL Pre-filled Syringes	Psoriasis	Drug: Placebo\|Drug: Secukinumab 2 mL form\|Drug: Secukinumab 1 mL form	54
Efficacy, Safety, and Immunogenicity of V260 in Healthy Chinese Infants (V260-024)	Rotavirus Gastroenteritis	Biological: V260\|Biological: Placebo to V260\|Biological: OPV\|Biological: DTaP	48
A Study of Ixekizumab (LY2439821) in Participants With Moderate-to-Severe Genital Psoriasis	Genital Psoriasis\|Psoriasis	Drug: Ixekizumab\|Drug: Placebo	37
Does Intraoperative Clonidine Reduce Post Operative Agitation in Children?	Psychomotor Agitation	Drug: Clonidine\|Drug: Placebo	23
Prevention of Tuberculosis in Prisons	Tuberculosis, Pulmonary\|Antibiotic Prophylaxis	Drug: Isoniazid 900 milligrams\|Drug: Placebo	18
A Study of Eltrombopag or Placebo in Combination With Azacitidine in Subjects With International Prognostic Scoring System (IPSS) Intermediate-1, Intermediate-2 or High-risk Myelodysplastic Syndromes (MDS)	Thrombocytopaenia	Drug: Eltrombopag\|Drug: Azacitidine\|Drug: Placebo	11
Safety, Pharmacokinetics, and Preliminary Efficacy Study of CDZ173 in Patients With Primary Sjögren’s Syndrome	Primary Sjögren’s Syndrome	Drug: CDZ173\|Drug: Placebo	8
Safety and Efficacy of Vanoxerine for the Conversion of Subjects With Recent Onset Atrial Fibrillation or Flutter to Normal Sinus Rhythm	Atrial Fibrillation or Flutter	Drug: Vanoxerine HCl\|Drug: Placebo	3
The Effect of Vitamin D Supplementation Among Overweight Jordanian Women With Polycystic Ovary Syndrome (PCOS)	Polycystic Ovary Syndrome\|Hypovitaminosis D	Drug: 50,000 IU vitamin D3\|Drug: Placebo	1
Phase 2 Study of XAF5 (XOPH5) Ointment for Reduction of Excess Eyelid Fat (Steatoblepharon)	Lower Eyelid Steatoblepharon (Excess Eyelid Fat)	Drug: XOPH5 Ointment\|Drug: Placebo	1

A visualization of this “trend” in FI is shown below:

# barplot according to corresponding disease/ treatment
clean_final_df %>% 
  filter(FI != 0) %>% 
  arrange(desc(FI)) %>% 
  ggplot(aes(x = reorder(conditions, FI), y = FI)) +
  geom_bar(stat = "identity", color = "steelblue2", fill = "steelblue2") +
  coord_flip() +                                              # flip x-y to better "stratify" categories
  theme(legend.position = "none",                             # remove legends
        axis.text.y = 
          element_text(hjust = 1, vjust = 0.5,                # adjusts the hor/ver alignment of y-variables
                       size = 8,                              # resize text size and rotate at an angle
                       margin = margin(0, -10, 0, 0)),        # reduce margin to close gap between label/graph
        panel.grid.major.y = element_blank()                  # remove horizontal lines to improve readability
        ) + 
  labs(x = "Disease", y = "Fragility Index",
       caption = "Count of fragility index by disease")

Quest for more All Data from clinicaltrials.gov

The dataset we collected above was light in the sense that our scraping method did not gather other descriptive information. We needed more data to allow us to graph information required for this project. Here, we pivoted, again, to look at phase III clinical trials more generally. Our group members are interested in clinical trials, which despite their importance to US healthcare seem very shrouded in mystery. The fragility index was a very pointed question in this area of our healthcare system we do not know much about. There is a lot of general information about trials unknown to our group which might be interesting to look at.

So, to supplement our data scraping of clinical trial data, we decided instead to download every clinical trial from clinicaltrials.gov through XML indicator. This included a 800 MB csv including descriptive information about each trial. This dataset was quite large, and crashed multiple members’ computers multiple times.

There were many questions which we could answer with this data, although it is not about the fragility index at all. Instead, this data gave us a glimpse into the field.
Questions we looked at include:

Where are clinical trials being done in the United States?
How many trials are done per year? Has it changed much year to year?
Who sponsors clinical trials?
Other interesting tidbits

We filtered our scraped XML for all US based studies. The data extracting and processing code are shown below. It does not run, as the file is WAY TOO BIG.

######### step 1

# get all names of the downloaded xml files
get_file_names = function(){
  file_list = tibble(
    file_name = list.files("/Users/adobel/Desktop/Columbia/data\ science/homework/rclinicaltrials/AllPublicXML", recursive = T)
  ) 
} 
# the xml files are too large to be submitted to github, so we kept the directory local

############ step 2

# build a dataframe for future mapping
all_files = get_file_names() %>%
  separate(file_name,c("1","file_name")) %>%
  select('file_name') %>%
  .[-1,]

### step 3
### build a function that automatically reads and parses each xml file, and organizes the information into a dataframe

xml_reader = function(file_name){
  file = xmlParse(str_c("/Users/adobel/Desktop/Columbia/data\ science/homework/rclinicaltrials/AllPublicXML/",
                        str_sub(file_name,start = 1, end = 7), "xxxx/",
                        file_name,".xml"))  #directory of each file
#########
  xmltop = xmlRoot(file) # get nodes
######### get interested variables
  
  new_tbl = tibble(
    overall_status = xmlValue(xmltop[['overall_status']]),
    phase = xmlValue(xmltop[['phase']]),
    study_type = xmlValue(xmltop[['study_type']]),
    masking = xmlValue(xmltop[['study_design_info']][['masking']]),
    primary_outcome = xmlValue(xmltop[['primary_outcome']]),
    sponsors =  xmlValue(xmltop[['sponsors']]),
    start_date = xmlValue(xmltop[['start_date']]),
    completion_date = xmlValue(xmltop[['completion_date']]),
    primary_completion_date = xmlValue(xmltop[['primary_completion_date']]),
    observational_model = xmlValue(xmltop[['study_design_info']][['observational_model']]),
    time_perspective = xmlValue(xmltop[['study_design_info']][['time_perspective']]),
    measure = xmlValue(xmltop[['primary_outcome']][['measure']]),
    time_frame = xmlValue(xmltop[['primary_outcome']][['time_frame']]),
    enrollment = xmlValue(xmltop[['enrollment']]),
    study_pop = xmlValue(xmltop[['eligibility']][['study_pop']]),
    sampling_method = xmlValue(xmltop[['eligibility']][['sampling_method']]),
    criteria = xmlValue(xmltop[['eligibility']][['criteria']]),
    gender = xmlValue(xmltop[['eligibility']][['gender']]),
    minimum_age = xmlValue(xmltop[['eligibility']][['minimum_age']]),
    maximum_age = xmlValue(xmltop[['eligibility']][['maximum_age']]),
    healthy_volunteers = xmlValue(xmltop[['eligibility']][['healthy_volunteers']]),
    study_first_submitted = xmlValue(xmltop[['study_first_submitted']]),
    study_first_posted = xmlValue(xmltop[['study_first_posted']]),
    location_city = xmlValue(xmltop[['location']][['facility']][['address']][['city']]),
    location_state = xmlValue(xmltop[['location']][['facility']][['address']][['state']]),
    location_country = xmlValue(xmltop[['location']][['facility']][['address']][['country']]),
    agency_class = xmlValue(xmltop[['sponsors']][['lead_sponsor']][['agency_class']])
    )
}

# build dataset
data_from_xml = map_df(all_files$file_name,xml_reader)
data_from_xml = bind_cols(all_files,data_from_xml)
data_from_xml_cleaned = data_from_xml %>% 
  drop_na() %>%
  mutate(file_group = str_c( str_sub(file_name,start = 1, end = 7), "xxxx"))

# save files to local pathway
write_csv(data_from_xml,"./data_from_xml.csv")
write_csv(data_from_xml_cleaned ,"./data_from_xml_cleaned.csv")

# get data only for US studies
test_us =
  data_from_xml %>%
  filter(location_country == "United States") %>%
  mutate(location_state = recode(location_state,
                                 "Missouri" = "Missouri State"))

This dataset includes tidied descriptive information about every US-based clincal trials.

Data Summary and Results

To summarize, we have 2 datasets that we used for further analysis and exploration:

A dataset created through html scraping clinicaltrials.gov which gave us trials eligible to calculate the fragility index with supporting information about the trial.
- This dataset started with 10,000 phase 3 clinical trials completed before 2017.
- 1950 included a placebo arm, of which 620 had “Participant” as a unit of measure, of which 274 had complete data, of which 120 actually counted participants, of which 39 had a 2x2 table
- 11/39 had a non-0 fragility index
A separate and distinct dataset created by selecting information by XML to explore trends in clinical trials.
- The results from this data can be explored with our Shiny app graphs you can find here

The main product of our project consists of three parts:

A custom made dataset generated from clinicaltrials.gov, including respective FI
A collection of all available clinical trial data from clinicaltrials.gov
- Containing general characteristics of clinical trials
Interactive plots utilizing the data above
A functional FI calculator for meta-analysis

Interactive Plot of Clinical Trials Characteristics

Given the collection of all available clinical trial data, we are allowed to integrate a bunch of interactive plots.

On the first tab Spatial Distribution, first of all, we can see the distribution of number of clinical trials across the conutry; if we click on the circles, it goes further into more specific geographical locations. For the checkbox on the left, we’re allowed to filter acoording to types of study: interventional, observational, patient registry and also expanded access.

On the right side, by selecting different agency class, we can filter through the sponsors of study: US Federal, NIH, Industry or other. The histogram on the left interacts with the selection from the right and is ranked by number of trials in the US from the highest to the lowest.

For the time trend plot below, we can compare the number of completed/ongoing clinical trials sponsored by different agencies.

On the second tab Sponsor Information, we’ve collected information of every single sponsor that invloved in clinical trials. We have their registered name, number of sponsored studies, total participants enrolled in their studies, average number of participants in their trials, and the minimum and maximum number of participants enrolled in their single study.

Functional FI calculator for meta-analysis

We initially wanted to create a custom FI calculator. Fortunately, this already existed in an R package, fragilityindex. Additionally, we found on github a custom meta-analysis fragility index function which can be found here. We incorporated this to our website and added interactivity so that users can input their own dataset and different parameters.

As shown above, the interactive parameters in the function includes:

Upload meta-analysis data: takes csv files in specific format
Parameters of the meta-analysis: options include Mantel-Haenszel, Inverse Variance, Peto
Measure: the statistic to assess testing results and corresponding p-vlaue. Options include: Odds Ratio, Risk Difference, Risk Ratio
Random effects: options include Fixed or Random effects

All these parameters are contained in the function we found. Although we have not learned about all of them yet, they are interactive!

Discussion

Overall, we were not able to answer any of the questions we initally suggested in our initial proposal. We simply were not able to gather data capable of answering the questions. Thankfully, there are many insights we were able to gather from the data we were able to collect!

Main Data: Fragility Index

This project showed how difficult scraping data from a website could be if the data has a varied structure. It was much, much more difficult to scrape and tidy the data than we were originally anticipating, especially given the tools available to us prior to trying them.

We learned a lot about the data structure of clinicaltrials.gov and general information about phase 3 clinical trials. We were surprised by the relatively messy data structure of clinicaltrials.gov. The main primary outcome table was inconsistent with its white space usage in the html table, which gave us an immense amount of issues. Thankfully, clinicaltrials.gov is updating their API, which should hopefully improve upon the structure of this data. We learned the hard way the difference between publicly available data, and tidy and usable publicly available data. Tidying complex data is incredibly complex.

From our data and project, we came away with a few insights. First, there are a vast amount of differences and complexity between clinical trials. Using our data as a judge, a real-life phase 3 trial and its primary outcome is rarely simple and straight forward. Attempting to create any general structure currently seems nearly impossible. This is humbling and motivating.

The fragility index is not the most quantitatively complex or descriptive metric. However, it is not meant to be. Literature suggests that the fragility index is an easy to synthesize metric that could be included with a p-value to try to improve biostatitical literacy in clinical and research settings. The target audience for the FI, if it is accepted by the community, is not biostatisticians but clinicians. This brings up a broader conversation discussed in our M.S. program here at Columbia. As we enter our careers we will likely work with individuals who have expertise in other fields, and rely on our knowledge of biostatistics. It is part of our job to disseminate biostatistics information in a way that is most easily absorbed by our other team members.

Having said that, our limited data appears to show that a well-explored chronic disease has a more stable result, given the higher FI. For instance, the trial on CHF has the highest FI among our findings. On the other hand, less-explored chronic disease or even acute diseases have lower FI. Also note that the diseases in our data are all non-cancers, which suggests the potential of FI to be used out of oncology.

Secondary Data: Clinical Trials Dataset

Surprisingly, there were much less issues downloading every clinical trial and graphing different characteristics. At this more “bird’s eye” level of describing a study, the data is much more consistent and tidy, allowing for easier manipulation and usage.

Seeing the distribution of trials across the US shows what one may guess: there are generally more trials in metropolitan areas which have large cities and relevant academic institutions. As the NIH is located in Maryland, many NIH sponsored trials are associated with a location in Maryland. California is the state with by far the most number of trials overall. Surprisingly, Alabama has the 2nd most number of studies performed according to our data. None of the group members would have guessed Alabama being in the top 2 states, let alone being an above average state. Another intersting fact is that observational studies with patient registries are highly clustered in major cities. On the other hand, when we look at the time trend, we can tell that industries has much more completed/ ongoing clinical trials compared to NIH after 2000.

When we order all the sponsors by number of studies they’ve supported, we can see a healthy mix of govermental agencies (e.g. NCI), industries (e.g. Pfizer) and academical institutions (e.g. Stanford University). Memorial Sloan Kettering Cancer Center, which is one of our affiliated institutions, has the 20th most amount of studies recorded at 530. Also, Columbia University itself ranked 62nd with 249 studies recorded.

A work by Bryan Bunning, Yuanzhi Yu, Zongchao Liu, Gavin Ko, and Kevin S.W.