At the moment DPUK is collaborating with 47 cohorts which together means over two million participant records, and over 16,000 variables. Many researchers have only ever been able to access smaller datasets with limited variables, so helping researchers get access to data of this richness and scale is very exciting. However with 47 cohorts comes (almost) 47 different ways of naming variables, different types of tests and scales, and data collection methods. When comparing data from different cohorts it can seem a bit of mess! Researchers need to have the option to compare variables from different datasets when conducting their research and it’s my job to make sure the data is accurate and usable for dementia research. Essentially, what I’m doing is a bit like knitting the variables together – it’s coding the variables in the same way so that researchers can easily and accurately use them for cross-cohort analysis.
To think that hidden somewhere in these reams of numbers that I work with everyday could be the clues to finally cracking dementia, is incredibly exciting.
- Josh Bauermeister
Knitting the variables together
I’m re-coding cohorts’ data in accordance with the DPUK ontology – a naming system for variables based on a scientific rationale developed by senior researchers within DPUK. At the top level there are 22 broad categories of data and my job takes me right down deep in the detail of each cohorts’ data to categorise them. I’m looking at all the variables in the data that each cohort collects. They can go into detail such as ‘age asthma diagnosed’ and ‘sandwiches eaten per week’. It’s fascinating to think that somewhere hidden in all this detail is the potential cure for Alzheimer's disease. Taking each cohorts’ individual dataset, I will then implement the ontology using a STATA script that we have written. STATA is a powerful industry standard statistical software that can be coded to rename large amounts of data at the same time.
Standardisation not harmonisation
The scripts we write rename the variables and labels across the full cohort dataset, adding the DPUK assigned unique cohort identifier, category number and recoded variable names. As the script runs it then separates out the number of levels in each variable and the data collection time point. Renaming all of the variables with the same information in the same way, according to a constant naming structure, means that none of the detail of the original cohort’s data is lost. We’re standardising, not harmonising, cohort data. In curating the cohort data in this way, we are enabling better cross-cohort research by maintaining the richness and detail that research in dementia requires.
For more information on the Data Portal, go to the Data Portal website.