This tutorial should take about 1 hour to complete, not including download times.
Primary Purpose: Get data from the NDA using (mostly) command line tools.
Secondary Purpose: Extend your exposure to AWS cloud computing from your desktop machine, whilst gaining exposure to the python virtual environment management tools that will make it possible to translate what you may be used to doing locally to more scalable, reproducible, and meta-analyzable compute infrastructure. Knowing this stuff will make using that stuff more intuitive.
- An account at the NDA.
- Access to data from your account at the NDA. See Lifespan 2.0 Data Access & Download Instructions for full instructions on getting access. The following steps should allow you to download any package to which your data use certification (DUC) grants you access.
- A data package that you want to download from the NDA, i.e. a 'shared' package, a package we created that you added to your account, or a package you created yourself. In particular, one that contains a 'datastructure_manifest.txt' file which is created as a part of the package creation process. Note that the NDA appears to be in the process of rewriting the rules on datastructure_manifest.txt file usage and sharing in general, so the defaults for package generation might change (if this happens we will update these instructions). Using the downloadcmd tool to get subsets of a particular package of data currently depends on your ability to locate and parse the datastructure_manifest.txt file for S3 links. Step 7 of this tutorial will help you to locate and parse the datastructure_manifest.txt
- Willingness to locate or create a terminal to a Linux/MacOS operating system that affords you permission to install software and has access to a filesystem location with space for a download (downloads including imaging data for all HCA or HCD subjects can be hundreds of GB to 20TB, depending on which data you select).
These instructions were tested in a clean ubuntu 20.04 instance 'owned' by an account with pre-loaded computational credits for CCF users at the NDA. Having a controlled environment, e.g. by way of containers, AMIs, or virtual environments, becomes ever more important as dependencies (think Freesurfer versions, and Juptyer notebook Python library requirements) interfere with one another. ReproNim Investigators and others have extended the practical reasons for environment control and format standardization in a compelling case that indeed EVERYTHING MATTERS to downstream scientific analysis and meaning construction. However, there are no reasons in theory this particular tutorial shouldn't work with minimal help from Google in your Ubuntu-esque local machine.
Overview of steps
Step 1: Identify the package number you wish to download from the NDA. If your purpose is to download a subset of data from the NDA (individual files you specify at the command line), then this package should NOT actually contain the image data (e.g. no associated files). Rather, it will contain metadata about the available image data files. I.E. where it is located in S3, and how it is organized.
Step 2: Locate or create a command line environment that has access to adequate storage space AND permission to install software, and then navigate to this environment (for example, with AWS in this tutorial, using computational credits for CCF users at the NDA).
Step 3: Confirm that this environment has all the requirements.
Step 5: Use 'downloadcmd' to download the package you identified in step 1.
Step 6: Confirm that the contents of your download are as expected (in terms of size, file count, etc).
Step 7: Subset the datastructure_manifest.txt file therein for the associated data files you wish to download.
Step 8: Download the associated image data.
Step 9: Turn off AWS machines and/or delete devices, if applicable.
Identify the package number you wish to download via information available at https://nda.nih.gov/user/dashboard/packages.html, after logging into your NDA account (you have succeeded if the upper right hand corner of page contains your username and the ability to logout). You are now on the Data Packages tab (red ellipse on right) of your account profile. Record the number of the package you want to grab from the command line (red ellipse on left). Note that the number you see in your packages may not be the same as this screenshot from my account, here. If you have CCF/ABCD data permissions, you will have access to the HCP shared packages below, provide you toggle the drop down menu to 'Shared Data Package' (red arrow). At this time, you do NOT need to associate these packages to your account by selecting "Add to My Data Packages" from the 'Actions' dropdown.
Find a terminal to a machine with a Linux/MacOS operating system that affords you permission to install software and has access to a filesystem location with storage space for a download.
a. Find any terminal on your computer.
If you need a terminal refresher, you can start here: https://www.youtube.com/watch?v=MBBWVgE0ewk (Windows), https://macpaw.com/how-to/use-terminal-on-mac (Mac), or https://tutorials.ubuntu.com/tutorial/command-line-for-beginners#0 (ubuntu/linux/unix).
b. Determine if this terminal is capable of giving you access to a machine with the appropriate requirements (Linux/MacOS, permissions, and space).
- If you're a Window user and you want to access a machine that someone else owns and manages (e.g. on a cluster at your university or on AWS because that's where the 45TB of space resides), you'll need to install something that can handle an ssh command to this environment, such as Ubuntu for Windows, GitBash, or https://www.putty.org/
d. ssh to this machine, if ssh'ing is in order:
For example, if you created an Ubuntu Instance in the cloud, as in this tutorial, you might ssh in like this from your terminal:
If you were given permission to use a university cluster, you might be instructed to turn on your university VPN and ssh with a password in like this:
Confirm that your workspace has the necessary requirements.
1. Ask the person who gave you permission to use their cluster if it's okay to install https://github.com/NDAR/nda-tools and download large quantities of data.
2. Double check that you have the space for the download, for example, by getting a list of the space in filesystems mounted to your machine as follows:
This is what 2000G of gp3 space looks like in the Ubuntu 20.04 machine from this tutorial:
3. Figure out more about your specific flavor of Linux or MacOS operating system by typing:
Install the tools. Follow instructions here: https://github.com/NDAR/nda-tools, or see if you can get away error-free with the three red bolded command lines below.
BETTER YET (and possibly necessary, depending on how comfortable you are messing up someone else's dependencies on a shared cluster), run all of the following commands to put nda-tools into a python3 virtual environment.
- If you're not yet familiar with virtual environments in python, creating virtual environments is a good habit to get into if you ever want to install the list of packages in a 'requirements.txt' file you found on a repo somewhere (see Step 5 detour). Python virtual environments come with a little bit of a learning curve, though. It's helpful to know that Python2, Python3, and Anaconda (Miniconda) have different tools/syntax for creating and managing virtual environments. It's also helpful to know that installing these tools depends on your particular operating system. Begin by typing > cat /etc/*-release in your command line to find out more about your particular operating system and whether your particular flavor of Linux needs a 'yum-ish,' 'apt-ish', or 'dnf-ish' command to download and install packages outside of 'pip' control.
- Docker/Singularity containers can only render the python virtual environment conversation moot to the extent that the tools within them lack conflicting dependencies (translation: you might as well start to learn about python virtual environments if you're not there yet).
- Your environment may already have a virtual environment maker installed, but its important to know WHICH version of python and WHICH version of the virtual environment maker you're using. The commands below will install a python3 virtual environment maker
Skip to the commands in red, if you want to save virtual environments for another day and don't really care what version of python you're using provided it works.
find out more about your particular operating system and whether your particular environment needs a 'yum-ish,' 'apt-ish', or 'dnf-ish' or 'brew-ish' command to download and install packages outside of Python's 'pip' control (the virtual environment maker is distributed separately from Python itself, in a lot of cases, you'll need to know what package management service you have installed. Note to Mac Users who have never used their command line: its possible you'll need to explicitly install Homebrew
sudo apt update
|Get a status update for the packages in your particular flavor of Linux. Problems? The syntax is different depending on your operating system. Google is your friend. Ask google what the 'sudo apt update' equivalent is in MacOS, for example.|
|This will usually spit out the python 2 version, if you have it installed and aren't already in an activated virtual environment. There is a slight chance that your network admin mapped 'python' to 'python 3,' though, so it's always good to check. If you really don't want to work with virtual environments, skip to the pip install nda-tools line bolded below. If you want to work with a python 2 virtual environment, you're on your own.|
|outside of the virtual environment, 'which' tells me that the python3 installation on my machine is /usr/bin/python3|
|my machine has the 'python3' command mapped to the installed Python 3.8.5 distribution (no python 2 on the Ubuntu 20.04 LTS AMI on AWS)|
sudo apt-get install python3-venv
|'yum, apt, dnf or brew' install the python 3 virtual environment management package. Miniconda starts in a 'base' virtual environment -i.e. you already have a environment making environment, so you can skip this step.|
python3 -m venv nda
|Create a virtual environment named 'nda.' This command line is analogous to 'conda create nda' in Anaconda or Miniconda|
|Activate this new virtual environment. This command is analogous to 'conda activate nda.' You'll know you're in the activated virtual environment because (nda) will precede your terminal prompt.|
|inside the virtual environment, 'which' tells me my python3 installation is /home/ubuntu/nda/bin/python3. Note that this is different than outside the virtual environment.|
|pip install nda-tools||install nda-tools (this will also install vtcmd, which can be used for uploading data to the NDA via the command line). If you run into problems, see https://github.com/NDAR/nda-tools, and/or go back to the beginning of this section and DON'T skip to the red bolded text.|
|downloadcmd --help||show all the options available for the downloadcmd - if you've gotten to this point without errors, then you can proceed to Step 5, with or without a virtual environment.|
|df -h||figure out how much space you have for downloading, and where you want to point the download|
|peek at the settings.cfg file, so you know where to change the debug log destination, if you ever so choose. The full path to your config file will be different, or course, and depend on where YOU actually created your virtual environment, called 'nda'|
In other words:
Refer to the screenshot in Step 1. Download package #1185256 using 8 threads (for comparison, on a t3.2xlarge it took 5.5 hours to download 1300G using 4 threads vs 3.5 hours on 8 threads using the machine I created in the Computational Credits account at the NDA, per this tutorial). Name the directory for download 'HCPDevImgManifestBeh' so that it matches the name in the NDA, even though you don't have to. You may be asked to hit 'enter' a couple times if you don't have a token (this method doesn't require a token).
It usually takes a minute to get rolling, but since you added the -v (verbose) flag, you should start seeing messages about things being downloaded. If you don't, it's likely that the NDA has changed something. Contact their helpdesk: firstname.lastname@example.org for updates. Look for updates on the NDAR repo pages (https://github.com/NDAR). The CCF will try to keep up with their breaking changes, too, and will update our instructions accordingly. If you get errors, first check for typos in the package number, your credentials, or the options used. Then send your question to email@example.com with the command you typed (sans password) and the error you're seeing. Note that 2 out of 2 people who tested this particular tutorial from start to finish encountered issues related to account permissions managed by the NDA and needed to open helpdesk tickets. Check for typos before you send them a help desk ticket, but don't be shy.
Step5 detour: Let's pretend you wanted to clone the abcd downloader and install those requirements. Open another terminal window or deactivate the nda virtual environment.
Confirm that the downloaded directory is the size you're expecting. If it is not the right size, confirm that you haven't maxed your download space, and then try again. If it is still not the right size, contact NDA's helpdesk: firstname.lastname@example.org and include the (likely uninformative) contents of the debug log, along with the exact downloadcmd you used (sans passwords).
Assuming you have had no issues, and your HCPDevImgManifestBeh folder is the expected size, look at the datastructure_manifest.txt file. This file contains the S3 links for associated imaging data files. Note: If you happened to download a package that was created WITH associated imaging files, then this step is irrelevant - you are done. You likely wish to download only a subset of the files listed in the datastructure_manifest.txt file (if you downloaded all the imaging files for HCD, you would need 20 TB+ of space and many of the files would not be necessary for your analyses). The following command examples will help you create your desired subset of S3 links to pass to the downloadcmd 'round 2' of the downloadcmd process, as described in Step 8.
NOTE: the datastructure_manifest.txt file is the closest thing you have to a 'filesystem' view of the data at the NDA as of 2/23/2021. There will likely be much confusion surrounding the use of datastructure_manifest.txt files once the NDA retires the 'DataManager' Service which supports a lot of tools that use tokens to grab data at the end of S3 links (including downloadcmd).
#Take a peek.
# Look even closer: Extract the 6th column of this file and pipe the output to 'less' for viewing
# Even closer: From the 6th column of the datastructure_manifest.txt file, treat '/' as the delimiter to get all the so-defined 'columns' after the 4th '/' (and then remove the trailing " with sed)
# Instead of manipulating the datastructure_manifest.txt file for a filesystem view, let's get it to pipe the complete S3 list (sans extraneous quotations) to a new file for extraction of subsets
# Observe that the fourth column of the datastructure manifest contains the HCD-specific (Lifespan 2.0) subset of our HCP-style package shortnames. This is where our our predefined "HCP packages" (available dataset filters on https://nda.nih.gov/general-query.html?q=query=featured-datasets:HCP%20Aging%20and%20Development) overlap with naming conventions in the datastructure_manifest.txt file.
# Subset to just the S3 links of a golden HCD subject (an HCP-D subject that has complete data for the Lifespan 2.0 Release):
# Create a list of subjects for whom you'd like to grab a particular subset HCP package of imaging data - do it your way (e.g. by creating a list from the ndar_subject01.txt Subject Inventory file) or do this for proof of concept:
# String these commands together to create a file containing the S3 links for all of the PreprocStrucRecommended HCP package data for the two subjects in your list subjectlist.txt file
Having identified the subjects and packages you want to download and created a list of S3 links from the manifest, send this list to the downloadcmd for downloading.
Note: Don't forget to make sure your python virtual environment is activated, if applicable. Remember to check downloadcmd --help for updates. Also double check your storage space. Red arrows in this screenshot point to the commands that can perform these functions, if you don't want to go back to previous steps in this tutorial to copy and paste them again.
Remember that the size of your download will be larger than that of the original package (e.g. 102KB for HCPDevImgManifestBeh in the table below) which didn't include associated files. You are now about to download the associated image files (big data), but there is no good way to know how large this download will be using the downloadcmd --help options described in the screenshot above. You will have to estimate the space you'll need based on the shared package sizes and the sizes of the HCP-style packages for all subjects below. (Eventually we will add updated tables for HCP package sizes for one complete subject).
For reference, the list of links you created in Step 7 represents two subjects' Preproc Structural Recommended Data; based on the tables below, this means that AT MOST, it will have in the neighborhood of 62 GB (31*2) of data. Given that the recommended data is roughly 7% (1669/22789), of All data, you may be able to reduce this estimate: 7% of 62G is 4.5G. Use df -h (screenshot above) to see if you have that much space at your disposal.
Shared Package ID
Shared Package Name
Datasets (HCP-style Packages) available in OPTION 2:
HCP-Style Package shortname
HCP-Style Package OPTION 2 Filter Name
Size All Subjects
HCP Package Contents
194 GB 203 GB
multi-echo MPRAGE (T1 weighted) and T2-SPACE (T2 weighted) scans (in NIFTI format)
HiResHp Structural Unprocessed
turbo-spin-echo high spatial resolution hippocampal structural scan (in NIFTI format)
Resting State rfMRI Unprocessed
1.3 TB 1.3 TB
both pairs of resting state fMRI scans (in NIFTI format)
tfMRI CARIT Unprocessed
237 GB 417 GB
fMRI scans for the CARIT task (in NIFTI format; Go/NoGo Conditioned Approach Response Inhibition Task)
tfMRI FACENAME Unprocessed
fMRI scan for the FACENAME task (in NIFTI format; paired-associative memory task)
tfMRI VISMOTOR Unprocessed
fMRI scan for the VISMOTOR task (in NIFTI format; simultaneous motor and visual activation task)
tfMRI EMOTION Unprocessed
fMRI scan for the EMOTION task (in NIFTI format; emotion and face-processing task)
tfMRI GUESSING Unprocessed
fMRI scans for the GUESSING task (in NIFTI format; reward, punishment, anticipatory reactivity task)
528 GB 566 GB
dMRI scans (in NIFTI format), bval, and bvec files for the two sets of diffusion sensitizing directions ('dir98' and 'dir99')
27 GB 27 GB
mbPCASLhr scan (in NIFTI format; multiband 2D EPI pseudo-continuous arterial spin labeling with high spatial resolution)
Structural Preprocessed Recommended
521 GB 476 GB
recommended starting point for structural analyses and contains files precisely aligned across subjects using the MSMAll multi-modal surface registration
Structural Preprocessed Legacy
526 GB 481 GB
structural files coarsely aligned across subjects using the MSMSulc folding surface registration
Structural Preprocessed FreeSurfer
585 GB 541 GB
actual outputs from the FreeSurferPipeline stage of the HCP Structural Preprocessing, in FreeSurfer's native file formats and directory structure
Structural Preprocessed Extended
160 GB 147 GB
additional files related to QC on structural preprocessing outputs and other extra files that may be useful to select users
rfMRI Preprocessed Recommended
991 GB 883 GB
recommended starting point for rfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
rfMRI Preprocessed Legacy Surface
989 GB 881 GB
cleaned files coarsely aligned across subjects using the MSMSulc folding surface registration, and hcp_fix_multi_run.
rfMRI Preprocessed Legacy Volume
2.1 TB 2.0 TB
cleaned rfMRI files poorly aligned across subjects using nonlinear volume registration
rfMRI Preprocessed Uncleaned
3.5 TB 2.4 TB
uncleaned resting state data of all registration types for use in testing alternative data cleanup strategies
rfMRI Preprocessed Extended
4.9 TB 5.1 TB
additional files related to rfMRI data cleanup and other extra files that may be useful to select users
tfMRI CARIT Preprocessed Recommended
82 GB 154 GB
recommended starting point for CARIT tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
tfMRI CARIT Preprocessed Legacy Surface
82 GB 154 GB
cleaned CARIT tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
tfMRI CARIT Preprocessed Legacy Volume
166 GB 328 GB
cleaned CARIT tfMRI files poorly aligned across subjects using nonlinear volume registration
tfMRI CARIT Preprocessed Uncleaned
540 GB 1.0 TB
uncleaned tfMRI CARIT data of all registration types for use in testing alternative data cleanup strategies
tfMRI CARIT Preprocessed Extended
50 GB 93 GB
additional CARIT tfMRI files related to data cleanup and other extra files that may be useful to select users
tfMRI FACENAME Preprocessed Recommended
recommended starting point for FACENAME tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
tfMRI FACENAME Preprocessed Legacy Surface
cleaned FACENAME tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
tfMRI FACENAME Preprocessed Legacy Volume
cleaned FACENAME tfMRI files poorly aligned across subjects using nonlinear volume registration
tfMRI FACENAME Preprocessed Uncleaned
uncleaned tfMRI FACENAME data of all registration types for use in testing alternative data cleanup strategies
tfMRI FACENAME Preprocessed Extended
additional FACENAME tfMRI files related to data cleanup and other extra files that may be useful to select users
tfMRI VISMOTOR Preprocessed Recommended
recommended starting point for VISMOTOR tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
tfMRI VISMOTOR Preprocessed Legacy Surface
cleaned VISMOTOR tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
tfMRI VISMOTOR Preprocessed Legacy Volume
cleaned VISMOTOR tfMRI files poorly aligned across subjects using nonlinear volume registration
tfMRI VISMOTOR Preprocessed Uncleaned
uncleaned tfMRI VISMOTOR data of all registration types for use in testing alternative data cleanup strategies
tfMRI VISMOTOR Preprocessed Extended
additional VISMOTOR tfMRI files related to data cleanup and other extra files that may be useful to select users
tfMRI EMOTION Preprocessed Recommended
recommended starting point for EMOTION tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
tfMRI EMOTION Preprocessed Legacy Surface
cleaned EMOTION tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration.
tfMRI EMOTION Preprocessed Legacy Volume
cleaned EMOTION tfMRI files poorly aligned across subjects using nonlinear volume registration
tfMRI EMOTION Preprocessed Uncleaned
uncleaned tfMRI EMOTION data of all registration types for use in testing alternative data cleanup strategies
tfMRI EMOTION Preprocessed Extended
additional EMOTION tfMRI files related to data cleanup and other extra files that may be useful to select users
tfMRI GUESSING Preprocessed Recommended
recommended starting point for GUESSING tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
tfMRI GUESSING Preprocessed Legacy Surface
cleaned GUESSING tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
tfMRI GUESSING Preprocessed Legacy Volume
cleaned GUESSING tfMRI files poorly aligned across subjects using nonlinear volume registration
tfMRI GUESSING Preprocessed Uncleaned
uncleaned tfMRI GUESSING data of all registration types for use in testing alternative data cleanup strategies
tfMRI GUESSING Preprocessed Extended
additional GUESSING tfMRI files related to data cleanup and other extra files that may be useful to select users
# Having identified the subjects and packages you want to download and created a list of S3 links from the manifest, send this list to the downloadcmd for downloading.
Note: you can also just download a single file as follows:
Once again, check that your download contains the data expected, for example with du and tree commands.
Helpful tip: If you're on a VPN and the VPN times out before the download completes, you may wish to preface your command with nohup (no hangup) to keep the command running in the background). You can also time how long it takes to download.
eg. the following will download a specific S3 link, send the verbose output AND the time it took to download everything to jobtimer.txt
> nohup time downloadcmd s3://NDAR_Central_3/submission_33230/HCD0008117_V1_MR/T1w/T1w_acpc_dc_restore.nii.gz -u <your username> -p <your password> -d HCPDevImgManifestBehSingleLink -v > jobtimer.txt 2>&1 &
ALL CAPS EXTRA SPECIAL NOTE:
downloadcmd relies on the 'DataManager' which is scheduled for retirement We're told this will mean that in the future you will have to put BOTH a package number AND a list of S3 links in the command to download subsets via the command line.
BUT HERE IS THE CATCH: the number of the package has to be associated with a 'fully loaded' and shared package (one containing all imaging data files, e.g. HCPDevAllFiles, that you don't want to download unless you have 23 TB handy and endless free time to babysit the process). Currently, the only way to obtain the datastructure_manifest.txt without downloading the entire release is through the HCPDevImgManifestBeh package. Confused? WE ARE TOO!!! Contact NDA's helpdesk: email@example.com for updates. Look for updates on the NDAR repo pages (https://github.com/NDAR). Browse for updates to access instructions at wiki.humanconnectome.org