Winning the Citadel C1 PhD Datathon 2024

In Nov 2024, I signed up to participate in the Citadel Correlation One Global PhD Datathon. The topic for the week-long event was “Investigating a Decade of Clinical Trials within the US”.
I conducted a large scale statistical analysis into the data integrity of 10 yrs of US clinical trials published on clinicaltrials.gov.
My submission won 3rd place and was praised for its highly practical problem formulation and rigorous analysis.

The Citadel/C1 Global PhD Datathon is an annual data competition which brings together ~30 students from various universities around the world, to tackle real-world social challenges in topics such as urban traffic, renewable energy, and education.
The competition was one of many random postings I fired applications off to without expectation. When I received an invitation to participate, I had mixed feelings about how/whether to make time for the event within my rigorous research schedule.
Still I had been looking for opportunities to experiment with real data and thought it could be a good chance to expand my comfort zone.

The competition topic titled “Investigating a Decade of Clinical Trials within the US” invited participants to probe into U.S. clinical trials from any perspective of their choosing.
We were provided a dataset containing nearly all website data available on clinicaltrials.gov: comprehensive public records of US clinical trials conducted in 2014-2024.
We were also given auxiliary datasets of 10 years of health care stock prices and IPEDS data on U.S. universities, and encouraged to find and use our own custom data.

I started by looking at all the data available to me. I found it fascinating to page through various clinical trial proposals, with detailed measurements of control groups and project outcomes, and explore demographic details of US higher ed. I even wondered if I should seek out “alternative” data as the prompt encouraged.
Though I was ultimately lost on what to do with any of it, what about clinical trials needed “investigation.” I mused about what is the point of clinical trials and which parts are broken and what may be improvable.
I started reading surveys on challenges in clinical trials. I looked into my PhD advisor’s survey on challenges in peer review and found a thread of papers identifying problems in science evaluation. I wondered if I could tackle problems such as review fraud, economic incentives, or evaluating proposals with uncertain reward.

Along this thread, I found one 2005 paper Are these data real? which examined one clinical trial which had been suspected of data fabrication. It conducted statistical analyses on the trial measurements to test:
- If participants are randomly assigned to control and treatment groups, their demographic characteristics should look statistically similar. Do they?
Not only did I find this paper’s problem interesting, I happily recalled that the NCT dataset reported extensive intimate details about the various features of test participants (starting with height, weight, age, and extending to much more niche trial specific features) for nearly all of its 22000 trials.
I excitedly tried manually running some of the paper’s statistical tests on a few trials from my dataset: specifically, testing the p-value of whether participants had been assigned to randomized control groups truly randomly. I found that their categorical tests produced meaningful numbers, though the continuous features in my dataset were too variable to generalize their approach.
I decided that I would expand this experiment into a large-scale data integrity analysis for my submission.

I set about scaling the chi-squared tests out to run on the full dataset. Even with my handy 2024 chatbot whipping out pandas commands I didn’t care to follow, the data cleaning and error handling cost a sleepless Saturday night.
It was also not obvious whether my numbers meant anything, so I decided to come up with my own second validation test. After trying a few approaches, I settled on a Benford’s Law test.
- Benford’s Law is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In fabricated data, it may be more likely to appear uniform. It is often used to detect accounting fraud. I used it to test whether the trials outcome measurements were likely to be generated by real measurements.
I compared the results of my two tests, and explored if trials produced by certain authors or institutions were more likely to be flagged by either of my tests. I found that some institutions’ trials were “flagged” at rates of 30-60% while the overall flag rate was 4-14%.
Results in hand, I whipped together a final report hours before the deadline.

I was pleasantly surprised to find that my project won 3rd place in the competition. It was praised for its highly practical problem formulation and rigorous analysis at the awards presentation. I received a $2500 cash prize for my labors.

I’m glad I participated. I learned a lot and gained confidence in my skills in exchange for a small time commitment. It also helped highlight my work to potential employers and collaborators.
While I don’t stand by the fraud allegations suggested by my hastily compiled report, I do think a statistical “plagiarism checker” for data integrity in clinical trials is an idea with real potential, and something which potentially could be standardized.
Clinical trials sometimes face allegations of being rife with faulty trials: paper milling and “big pharma” manipulation (interestingly, my report found no higher faulty trial rate coming from industry as compared to academia!). Similar statistical checkers have been implemented in other fields (e.g. statcheck). I think my project suggests this kind of work is feasible for clinical trials as well.