Organizing tabular data¶
The question of managing tabular (e.g. clinical or behavioural) data is a complex one. The main challenges include:
Lack of a standardized vocabulary for naming variables
Asynchrony between tabular and imaging data collection workflows
Difficulty in defining criteria for data validation
BIDS extension proposal (BEP) 36 is one of the major efforts in this space aimed at providing guidelines on standardized naming and organization of tabular data. Nipoppy will support and promote this standard once merged, and therefore we do not have any strong rules or validations for tabular data management the moment. Having said that, we do have a few recommendations that align with the BIDS direction and can be helpful in general.
Source (i.e. acquired) data¶
Similar to imaging data, it is good to separate “data collection” from “data curation” tasks even for the tabular data. This way we don’t modify the acquired source data and only create “clean” curated copies. This is especially useful when your study has different naming conventions for your participant_id
s and/or visit_id
s for the tabular vs imaging data. The recommended location for putting the “collected/acquired” data dump is <NIPOPPY_PROJECT_ROOT>/sourcedata/tabular
directory.
Note
If you do have different naming conventions for the clinical visits vs imaging sessions, then you can establish the correct mapping between those (e.g. V01
<-> ses-BL
) in the manifest file.
Demographic variables¶
For data curation, we begin with writing custom scripts to generate “clean” data files from source data dump. These files will go in the <NIPOPPY_PROJECT_ROOT>/tabular
directory. Here we usually recommend first creating a demographics.tsv
file that includes typical demographic variables collected by most studies, such as date of birth
/age
, sex
, recruitment cohort
, screening date
etc. One can also think of this file as the basic participant information recorded at a recruitment / screening visit that is static and does not change over time. However, since Nipoppy does not validate any tabular files, you can include multiple visits per participants here if preferred.
Example demographics TSV file¶
participant_id |
age_at_recruitment |
sex |
recruitment_cohort |
date_of_recruitment |
---|---|---|---|---|
001 |
29 |
M |
control |
2023-01-15 |
002 |
34 |
F |
control |
2023-01-16 |
003 |
28 |
M |
patient |
2023-01-27 |
004 |
45 |
F |
control |
2023-02-08 |
005 |
31 |
M |
patient |
2023-02-19 |
Behavioural and clinical data¶
For the behavioural or clinical assessments, we create a <NIPOPPY_PROJECT_ROOT>/tabular/assessments
directory and then generate single TSV file (e.g. assessment_A.tsv
) per assessment/instrument. This file contains separate row per participant_id
and visit_id
(or session_id
if identical). This modularity at the level of assessment is helpful for quality checks and making corrections or updates. This file organization is also not validated by Nippopy, so one can come up alternative ways to organize / split clinical assessment information into separate files as preferred.
Example assessment TSV file¶
participant_id |
visit_id |
visit_date |
subscore_1 |
subscore_2 |
subscore_3 |
total_score |
---|---|---|---|---|---|---|
001 |
bl |
2023-01-20 |
5 |
3 |
4 |
12 |
002 |
bl |
2023-02-10 |
6 |
2 |
5 |
13 |
003 |
bl |
2023-02-22 |
4 |
4 |
3 |
11 |
004 |
bl |
2023-03-03 |
7 |
1 |
6 |
14 |
005 |
bl |
2023-04-24 |
3 |
5 |
2 |
10 |
001 |
m12 |
2024-01-27 |
5 |
3 |
4 |
12 |
002 |
m12 |
2024-02-14 |
6 |
2 |
5 |
13 |
003 |
m12 |
2024-02-12 |
4 |
4 |
3 |
11 |
Data dictionaries¶
For each file generated, it is recommended to create data dictionary i.e. demographics.json
, assessment_A.json
etc. next to the data file itself (see examples in BIDS docs). Nipoppy or BIDS itself doesn’t help with creating “standardized” data dictionaries, but another related project, Neurobagel, can help you with it. Neurobagel provides a simple annotation tool to help generate the data dictionaries with standardized vocabulary (when available) for your variable names. This will allow you to harmonize variable naming across different datasets (e.g. all demographic variables will be mapped to the same term) - which can be super helpful when you want to combine, compare or analyze multiple datasets.
Note
The recommendations provided here are work in progress and only meant to help one get started. Nipoppy is contributing to the BEP36 and plans to support it going forward.