How to get data from the NDA using command line tools

[!tip] This tutorial should take about 1 hour to complete, not including download times.

Purpose

Primary Purpose: Get data from the NDA using (mostly) command line tools.

Secondary Purpose: Extend your exposure to AWS cloud computing from your desktop machine, whilst gaining exposure to the python virtual environment management tools that will make it possible to translate what you may be used to doing locally to more scalable, reproducible, and meta-analyzable compute infrastructure. Knowing this stuff will make using that stuff more intuitive.

Prerequisites:

An account at the NDA.
Access to data from your account at the NDA. See Lifespan 2.0 Data Access & Download Instructions for full instructions on getting access. The following steps should allow you to download any package to which your data use certification (DUC) grants you access.
A data package that you want to download from the NDA, i.e. a ‘shared’ package, a package we created that you added to your account, or a package you created yourself. In particular, one that contains a ‘datastructure_manifest.txt’ file which is created as a part of the package creation process. Note that the NDA appears to be in the process of rewriting the rules on datastructure_manifest.txt file usage and sharing in general, so the defaults for package generation might change (if this happens we will update these instructions). Using the downloadcmd tool to get subsets of a particular package of data currently depends on your ability to locate and parse the datastructure_manifest.txt file for S3 links. Step 7 of this tutorial will help you to locate and parse the datastructure_manifest.txt
Willingness to locate or create a terminal to a Linux/MacOS operating system that affords you permission to install software and has access to a filesystem location with space for a download (downloads including imaging data for all HCA or HCD subjects can be hundreds of GB to 20TB, depending on which data you select).

[!note] These instructions were tested in a clean ubuntu 20.04 instance ‘owned’ by an account with pre-loaded computational credits for CCF users at the NDA. Having a controlled environment, e.g. by way of containers, AMIs, or virtual environments, becomes ever more important as dependencies (think Freesurfer versions, and Juptyer notebook Python library requirements) interfere with one another. ReproNim Investigators and others have extended the practical reasons for environment control and format standardization in a compelling case that indeed EVERYTHING MATTERS to downstream scientific analysis and meaning construction. However, there are no reasons in theory this particular tutorial shouldn’t work with minimal help from Google in your Ubuntu-esque local machine.

Overview of steps

Step 1: Identify the package number you wish to download from the NDA. If your purpose is to download a subset of data from the NDA (individual files you specify at the command line), then this package should NOT actually contain the image data (e.g. no associated files). Rather, it will contain metadata about the available image data files. I.E. where it is located in S3, and how it is organized.

Step 2: Locate or create a command line environment that has access to adequate storage space AND permission to install software, and then navigate to this environment (for example, with AWS in this tutorial, using computational credits for CCF users at the NDA).

Step 3: Confirm that this environment has all the requirements.

Step 4: Install software from https://github.com/NDAR/nda-tools, Skip to bolded text if you don’t wish to care about Python’s virtual environments if you don’t have to.

Step 5: Use ‘downloadcmd’ to download the package you identified in step 1.

Step 6: Confirm that the contents of your download are as expected (in terms of size, file count, etc).

Step 7: Subset the datastructure_manifest.txt file therein for the associated data files you wish to download.

Step 8: Download the associated image data.

Step 9: Turn off AWS machines and/or delete devices, if applicable.

Step 1:

Identify the package number you wish to download via information available at https://nda.nih.gov/user/dashboard/packages.html, after logging into your NDA account (you have succeeded if the upper right hand corner of page contains your username and the ability to logout). You are now on the Data Packages tab (red ellipse on right) of your account profile. Record the number of the package you want to grab from the command line (red ellipse on left). Note that the number you see in your packages may not be the same as this screenshot from my account, here. If you have CCF/ABCD data permissions, you will have access to the HCP shared packages below, provide you toggle the drop down menu to ‘Shared Data Package’ (red arrow). At this time, you do NOT need to associate these packages to your account by selecting “Add to My Data Packages” from the ‘Actions’ dropdown.

Step 2:

Find a terminal to a machine with a Linux/MacOS operating system that affords you permission to install software and has access to a filesystem location with storage space for a download.

a. Find any terminal on your computer.

If you need a terminal refresher, you can start here: https://www.youtube.com/watch?v=MBBWVgE0ewk (Windows), https://macpaw.com/how-to/use-terminal-on-mac (Mac), or https://tutorials.ubuntu.com/tutorial/command-line-for-beginners#0 (ubuntu/linux/unix).

b. Determine if this terminal is capable of giving you access to a machine with the appropriate requirements (Linux/MacOS, permissions, and space).

- The Windows Command Prompt is not going to be useful for this tutorial.
  - If you have a Windows machine with enough space, but you just don’t have the Linux terminal, you can install Ubuntu for Windows or explore VirtualBox to create a virtual Ubuntu machine on your Windows host.
- If you’re a Window user and you want to access a machine that someone else owns and manages (e.g. on a cluster at your university or on AWS because that’s where the 45TB of space resides), you’ll need to install something that can handle an ssh command to this environment, such as Ubuntu for Windows, GitBash, or https://www.putty.org/

c. Consider following this tutorial, to create such a workspace for yourself in AWS, using computational credits for CCF users at the NDA.

d. ssh to this machine, if ssh’ing is in order:

For example, if you created an Ubuntu Instance in the cloud, as in this tutorial, you might ssh in like this from your terminal:

> ssh -i "key2ccfsetup2020.pem" ubuntu@ec2-18-223-32-40.us-east-2.compute.amazonaws.com

If you were given permission to use a university cluster, you might be instructed to turn on your university VPN and ssh with a password in like this:

> ssh petra@123.456.78.90

Step 3:

Confirm that your workspace has the necessary requirements.

Ask the person who gave you permission to use their cluster if it’s okay to install https://github.com/NDAR/nda-tools and download large quantities of data.
Double check that you have the space for the download, for example, by getting a list of the space in filesystems mounted to your machine as follows:

> df -h

This is what 2000G of gp3 space looks like in the Ubuntu 20.04 machine from this tutorial:

Figure out more about your specific flavor of Linux or MacOS operating system by typing:

> lsb_release -a

or

> cat /etc/*-release

or

> sw_vers

Step 4:

Install the tools. Follow instructions here: https://github.com/NDAR/nda-tools, or see if you can get away error-free with the three bolded command lines below.

BETTER YET (and possibly necessary, depending on how comfortable you are messing up someone else’s dependencies on a shared cluster), run all of the following commands to put nda-tools into a python3 virtual environment.

If you’re not yet familiar with virtual environments in python, creating virtual environments is a good habit to get into if you ever want to install the list of packages in a ‘requirements.txt’ file you found on a repo somewhere (see Step 5 detour). Python virtual environments come with a little bit of a learning curve, though. It’s helpful to know that Python2, Python3, and Anaconda (Miniconda) have different tools/syntax for creating and managing virtual environments. It’s also helpful to know that installing these tools depends on your particular operating system. Begin by typing > cat /etc/*-release in your command line to find out more about your particular operating system and whether your particular flavor of Linux needs a ‘yum-ish,’ ‘apt-ish’, or ‘dnf-ish’ command to download and install packages outside of ‘pip’ control.
Docker/Singularity containers can only render the python virtual environment conversation moot to the extent that the tools within them lack conflicting dependencies (translation: you might as well start to learn about python virtual environments if you’re not there yet).
Your environment may already have a virtual environment maker installed, but its important to know WHICH version of python and WHICH version of the virtual environment maker you’re using. The commands below will install a python3 virtual environment maker

Skip to the commands in bold, if you want to save virtual environments for another day and don’t really care what version of python you’re using provided it works.

| command | comments | | —– | — | | cat /etc/\*-release or lsb\_release -a or sw\_vers | find out more about your particular operating system and whether your particular environment needs a ‘yum-ish,’ ‘apt-ish’, or ‘dnf-ish’ or ‘brew-ish’ command to download and install packages outside of Python’s ‘pip’ control (the virtual environment maker is distributed separately from Python itself, in a lot of cases, you’ll need to know what package management service you have installed. Note to Mac Users who have never used their command line: its possible you’ll need to explicitly install Homebrew | | sudo apt update | Get a status update for the packages in your particular flavor of Linux. Problems? The syntax is different depending on your operating system. Google is your friend. Ask google what the ‘sudo apt update’ equivalent is in MacOS, for example. | | python --version| This will usually spit out the python 2 version, if you have it installed and aren’t already in an activated virtual environment. There is a slight chance that your network admin mapped ‘python’ to ‘python 3,’ though, so it’s always good to check. If you really don’t want to work with virtual environments, skip to the pip install nda-tools line bolded below. If you want to work with a python 2 virtual environment, you’re on your own. | | which python3 | outside of the virtual environment, ‘which’ tells me that the python3 installation on my machine is /usr/bin/python3 | | python3 --version| my machine has the ‘python3’ command mapped to the installed Python 3.8.5 distribution (no python 2 on the Ubuntu 20.04 LTS AMI on AWS) | | sudo apt-get install python3-venv| ‘yum, apt, dnf or brew’ install the python 3 virtual environment management package. Miniconda starts in a ‘base’ virtual environment -i.e. you already have a environment making environment, so you can skip this step. | | python3 -m venv nda| Create a virtual environment named ‘nda.’ This command line is analogous to ‘conda create nda’ in Anaconda or Miniconda | | source nda/bin/activate | Activate this new virtual environment. This command is analogous to ‘conda activate nda.’ You’ll know you’re in the activated virtual environment because (nda) will precede your terminal prompt. | | which python3| inside the virtual environment, ‘which’ tells me my python3 installation is /home/ubuntu/nda/bin/python3. Note that this is different than outside the virtual environment. | | pip install nda-tools | install nda-tools (this will also install vtcmd, which can be used for uploading data to the NDA via the command line). If you run into problems, see https://github.com/NDAR/nda-tools, and/or go back to the beginning of this section and DON’T skip to the bolded text. | | downloadcmd –help | show all the options available for the downloadcmd - if you’ve gotten to this point without errors, then you can proceed to Step 5, with or without a virtual environment. | | df -h | figure out how much space you have for downloading, and where you want to point the download | | less /home/ubuntu/nda/config/settings.cfg| peek at the settings.cfg file, so you know where to change the debug log destination, if you ever so choose. The full path to your config file will be different, or course, and depend on where YOU actually created your virtual environment, called ‘nda’ |

In other words:

> cat /etc/*-release
> sudo apt update                                        
> python --version                                       
> which python3                                          
> python3 --version                                     
> sudo apt-get install python3-venv            
> python3 -m venv nda                               
> source nda/bin/activate                               
> which python3                                         
> pip install nda-tools                                   
> downloadcmd --help                                  
> df -h                                                            
> less /home/ubuntu/nda/config/settings.cfg

Step 5:

Refer to the screenshot in Step 1. Download package #1185256 using 8 threads (for comparison, on a t3.2xlarge it took 5.5 hours to download 1300G using 4 threads vs 3.5 hours on 8 threads using the machine I created in the Computational Credits account at the NDA, per this tutorial. Name the directory for download ‘HCPDevImgManifestBeh’ so that it matches the name in the NDA, even though you don’t have to. You may be asked to hit ‘enter’ a couple times if you don’t have a token (this method doesn’t require a token).

> downloadcmd -dp 1185256 -u <your NDA username> -d  HCPDevImgManifestBeh -wt 8 

If you get a password error enter this:

> export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring

Then enter the downloadcmd command again. You will be asked for your NDA password.

It usually takes a minute to get rolling, but you should start seeing messages about files being downloaded. If you don’t see evidence that the package is downloading, it’s likely that the NDA has changed something. Contact their helpdesk: ndahelp@mail.nih.gov for updates. Look for updates on the NDAR repo pages (https://github.com/NDAR). The CCF will try to keep up with their breaking changes, too, and will update our instructions accordingly. If you get errors, first check for typos in the package number, your credentials, or the options used. Then send your question to ndahelp@mail.nih.gov with the command you typed and the error you’re seeing. Note that 2 out of 2 people who tested this particular tutorial from start to finish encountered issues related to account permissions managed by the NDA and needed to open helpdesk tickets. Check for typos before you send them a help desk ticket, but don’t be shy.

Step5 detour: Let’s pretend you wanted to clone the abcd downloader and install those requirements. Open another terminal window or deactivate the nda virtual environment.

>  deactivate                           # deactivates the nda virtual environment (or whatever environment was currently active)
>  python3 -m venv abcd3.5.1         	# create a new python 3 virtual environment for the abcd download tool
>  source abcd3.5.1/bin/activate     	# activate this new environment
>  git clone https://github.com/ABCD-STUDY/nda-abcd-s3-downloader.git   #clone (download/copy) the repository
>  cd nda-abcd-s3-downloader/    		# navigate into the cloned directory and locate the requirements.txt file
>  pip install -r requirements.txt 		# install requirements and then figure out what to do with the cloned programs, based on instructions in the repo's README
>  cd ..                                # move back to the parent directory

Step 6:

Confirm that the downloaded directory is the size you’re expecting. If it is not the right size, confirm that you haven’t maxed your download space, and then try again. If it is still not the right size, contact NDA’s helpdesk: ndahelp@mail.nih.gov and include the (likely uninformative) contents of the debug log, along with the exact downloadcmd you used.

> du -h HCPDevImgManifestBeh      # if it is not the size you're expecting (e.g. 102 KB per the screenshot in Step 1)

> sudo apt install tree           # this is a handy little tool that will show you the directory structure of your download and also output a count of files and folders therein.

> tree HCPDevImgManifestBeh       # make note of the number of folders and files in the directory...is this what you're expecting?

> df -h                           # see if you still have space

> less NDAValidationResults/debug_log_<latest datetime stamp>.txt      # copy the contents into your ticket to the help desk, if needed

Step 7:

Assuming you have had no issues, and your HCPDevImgManifestBeh folder is the expected size, look at the datastructure_manifest.txt file. This file contains the S3 links for associated imaging data files. Note: If you happened to download a package that was created WITH associated imaging files, then this step is irrelevant - you are done. You likely wish to download only a subset of the files listed in the datastructure_manifest.txt file (if you downloaded all the imaging files for HCD, you would need 20 TB+ of space and many of the files would not be necessary for your analyses). The following command examples will help you create your desired subset of S3 links to pass to the downloadcmd ‘round 2’ of the downloadcmd process, as described in Step 8.

NOTE: the datastructure_manifest.txt file is the closest thing you have to a ‘filesystem’ view of the data at the NDA as of 2/23/2021. There will likely be much confusion surrounding the use of datastructure_manifest.txt files once the NDA retires the ‘DataManager’ Service which supports a lot of tools that use tokens to grab data at the end of S3 links (including downloadcmd).

#Take a peek.

> cd HCPDevImgManifestBeh 
> ls -al 
> less datastructure_manifest.txt

# Look even closer: Extract the 6th column of this file and pipe the output to ‘less’ for viewing

> cut -f6 datastructure_manifest.txt | less

# Even closer: From the 6th column of the datastructure_manifest.txt file, treat ‘/’ as the delimiter to get all the so-defined ‘columns’ after the 4th ‘/’ (and then remove the trailing “ with sed)

> cut -f6 datastructure_manifest.txt | cut -f5- -d'/' | sed 's/"//g' | less

# Instead of manipulating the datastructure_manifest.txt file for a filesystem view, let’s get it to pipe the complete S3 list (sans extraneous quotations) to a new file for extraction of subsets

> cut -f6 datastructure_manifest.txt | sed 's/"//g' > HCPDevImgManifestBeh_S3links

# Observe that the fourth column of the datastructure manifest contains the HCD-specific (Lifespan 2.0) subset of our HCP-style package shortnames. This is where our our predefined “HCP packages” (available dataset filters on https://nda.nih.gov/general-query.html?q=query=featured-datasets:HCP%20Aging%20and%20Development) overlap with naming conventions in the datastructure_manifest.txt file.

> cut -f4 datastructure_manifest.txt  | cut -f4 -d'_' | sort -u

# Subset to just the S3 links of a golden HCD subject (an HCP-D subject that has complete data for the Lifespan 2.0 Release):

> grep HCD0001305 HCPDevImgManifestBeh_S3links > GoldenHCD_S3links

# Create a list of subjects for whom you’d like to grab a particular subset HCP package of imaging data - do it your way (e.g. by creating a list from the ndar_subject01.txt Subject Inventory file) or do this for proof of concept:

> echo HCD0001305 > subjectlist.txt   
> echo HCD0008117 >> subjectlist.txt   		# append to this list

# String these commands together to create a file containing the S3 links for all of the PreprocStrucRecommended HCP package data for the two subjects in your list subjectlist.txt file

> grep -f subjectlist.txt datastructure_manifest.txt | grep PreprocStrucRecommended | cut -f6 | sed 's/"//g' > PreprocStrucRecommendedSubjectSubsetS3s

Step 8:

Having identified the subjects and packages you want to download and created a list of S3 links from the manifest, send this list to the downloadcmd for downloading.

Note: Don’t forget to make sure your python virtual environment is activated, if applicable. Remember to check downloadcmd –help for updates. Also double check your storage space. Red arrows in this screenshot point to the commands that can perform these functions, if you don’t want to go back to previous steps in this tutorial to copy and paste them again.

Remember that the size of your download will be larger than that of the original package (e.g. 102KB for HCPDevImgManifestBeh in the table below) which didn’t include associated files. You are now about to download the associated image files (big data), but there is no good way to know how large this download will be using the downloadcmd –help options described in the screenshot above. You will have to estimate the space you’ll need based on the shared package sizes and the sizes of the HCP-style packages for all subjects below. (Eventually we will add updated tables for HCP package sizes for one complete subject).

For reference, the list of links you created in Step 7 represents two subjects’ Preproc Structural Recommended Data; based on the tables below, this means that AT MOST, it will have in the neighborhood of 62 GB (31*2) of data. Given that the recommended data is roughly 7% (1669/22789), of All data, you may be able to reduce this estimate: 7% of 62G is 4.5G. Use df -h (screenshot above) to see if you have that much space at your disposal.

Lifespan 2.0 Shared Packages | Shared Package ID | Shared Package Name | Size | | — | — | — | | 1185256 | HCPDevImgManifestBeh | 102 KB | | 1185057 | HCPAgingImgManifestBeh | 102 KB | | 1185341 | HCPDevelopment1Sub | 31 GB | | 1185340 | HCPAging1Sub | 27 GB | | 1185264 | HCPDevelopmentRec | 1669 GB | | 1185234 | HCPAgingRec | 1742 GB | | 1185249 | HCPDevAllFiles | 22789 GB | | 1184998 | HCPAgingAllFiles | 22858 GB |

Lifespan 2.0 Datasets (HCP-style Packages) available in OPTION 2:

Study	NDA structure	HCP-Style Package shortname	HCP-Style Package OPTION 2 Filter Name	Size All Subjects	HCP Package Contents
HCA HCD	imagingcollection01	UnprocStruc	Structural Unprocessed	194 GB 203 GB	multi-echo MPRAGE (T1 weighted) and T2-SPACE (T2 weighted) scans (in NIFTI format)
HCA	imagingcollection01	UnprocTseHires	HiResHp Structural Unprocessed	10 GB	turbo-spin-echo high spatial resolution hippocampal structural scan (in NIFTI format)
HCA HCD	imagingcollection01	UnprocRfmri	Resting State rfMRI Unprocessed	1.3 TB 1.3 TB	both pairs of resting state fMRI scans (in NIFTI format)
HCA HCD	imagingcollection01	UnprocTfmriCarit	tfMRI CARIT Unprocessed	237 GB 417 GB	fMRI scans for the CARIT task (in NIFTI format; Go/NoGo Conditioned Approach Response Inhibition Task)
HCA	imagingcollection01	UnprocTfmriFacename	tfMRI FACENAME Unprocessed	271 GB	fMRI scan for the FACENAME task (in NIFTI format; paired-associative memory task)
HCA	imagingcollection01	UnprocTfmriVismotor	tfMRI VISMOTOR Unprocessed	154 GB	fMRI scan for the VISMOTOR task (in NIFTI format; simultaneous motor and visual activation task)
HCD	imagingcollection01	UnprocTfmriEmotion	tfMRI EMOTION Unprocessed	126 GB	fMRI scan for the EMOTION task (in NIFTI format; emotion and face-processing task)
HCD	imagingcollection01	UnprocTfmriGuessing	tfMRI GUESSING Unprocessed	391 GB	fMRI scans for the GUESSING task (in NIFTI format; reward, punishment, anticipatory reactivity task)
HCA HCD	imagingcollection01	UnprocDmri	Diffusion Unprocessed	528 GB 566 GB	dMRI scans (in NIFTI format), bval, and bvec files for the two sets of diffusion sensitizing directions (‘dir98’ and ‘dir99’)
HCA HCD	imagingcollection01	UnprocPcasl	ASL Unprocessed	27 GB 27 GB	mbPCASLhr scan (in NIFTI format; multiband 2D EPI pseudo-continuous arterial spin labeling with high spatial resolution)
HCA HCD	fmriresults01	PreprocStrucRecommended	Structural Preprocessed Recommended	521 GB 476 GB	recommended starting point for structural analyses and contains files precisely aligned across subjects using the MSMAll multi-modal surface registration
HCA HCD	fmriresults01	PreprocStrucLegacy	Structural Preprocessed Legacy	526 GB 481 GB	structural files coarsely aligned across subjects using the MSMSulc folding surface registration
HCA HCD	fmriresults01	PreprocStrucFreesurfer	Structural Preprocessed FreeSurfer	585 GB 541 GB	actual outputs from the FreeSurferPipeline stage of the HCP Structural Preprocessing, in FreeSurfer’s native file formats and directory structure
HCA HCD	fmriresults01	PreprocStrucExtended	Structural Preprocessed Extended	160 GB 147 GB	additional files related to QC on structural preprocessing outputs and other extra files that may be useful to select users
HCA HCD	fmriresults01	PreprocRfmriRecommended	rfMRI Preprocessed Recommended	991 GB 883 GB	recommended starting point for rfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
HCA HCD	fmriresults01	PreprocRfmriLegacySurface	rfMRI Preprocessed Legacy Surface	989 GB 881 GB	cleaned files coarsely aligned across subjects using the MSMSulc folding surface registration, and hcp_fix_multi_run.
HCA HCD	fmriresults01	PreprocRfmriLegacyVolume	rfMRI Preprocessed Legacy Volume	2.1 TB 2.0 TB	cleaned rfMRI files poorly aligned across subjects using nonlinear volume registration
HCA HCD	fmriresults01	PreprocRfmriUncleaned	rfMRI Preprocessed Uncleaned	3.5 TB 2.4 TB	uncleaned resting state data of all registration types for use in testing alternative data cleanup strategies
HCA HCD	fmriresults01	PreprocRfmriExtended	rfMRI Preprocessed Extended	4.9 TB 5.1 TB	additional files related to rfMRI data cleanup and other extra files that may be useful to select users
HCA HCD	fmriresults01	PreprocTfmriCaritRecommended	tfMRI CARIT Preprocessed Recommended	82 GB 154 GB	recommended starting point for CARIT tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
HCA HCD	fmriresults01	PreprocTfmriCaritLegacySurface	tfMRI CARIT Preprocessed Legacy Surface	82 GB 154 GB	cleaned CARIT tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
HCA HCD	fmriresults01	PreprocTfmriCaritLegacyVolume	tfMRI CARIT Preprocessed Legacy Volume	166 GB 328 GB	cleaned CARIT tfMRI files poorly aligned across subjects using nonlinear volume registration
HCA HCD	fmriresults01	PreprocTfmriCaritUncleaned	tfMRI CARIT Preprocessed Uncleaned	540 GB 1.0 TB	uncleaned tfMRI CARIT data of all registration types for use in testing alternative data cleanup strategies
HCA HCD	fmriresults01	PreprocTfmriCaritExtended	tfMRI CARIT Preprocessed Extended	50 GB 93 GB	additional CARIT tfMRI files related to data cleanup and other extra files that may be useful to select users
HCA	fmriresults01	PreprocTfmriFacenameRecommended	tfMRI FACENAME Preprocessed Recommended	93 GB	recommended starting point for FACENAME tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
HCA	fmriresults01	PreprocTfmriFacenameLegacySurface	tfMRI FACENAME Preprocessed Legacy Surface	93 GB	cleaned FACENAME tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
HCA	fmriresults01	PreprocTfmriFacenameLegacyVolume	tfMRI FACENAME Preprocessed Legacy Volume	189 GB	cleaned FACENAME tfMRI files poorly aligned across subjects using nonlinear volume registration
HCA	fmriresults01	PreprocTfmriFacenameUncleaned	tfMRI FACENAME Preprocessed Uncleaned	619 GB	uncleaned tfMRI FACENAME data of all registration types for use in testing alternative data cleanup strategies
HCA	fmriresults01	PreprocTfmriFacenameExtended	tfMRI FACENAME Preprocessed Extended	50 GB	additional FACENAME tfMRI files related to data cleanup and other extra files that may be useful to select users
HCA	fmriresults01	PreprocTfmriVismotorRecommended	tfMRI VISMOTOR Preprocessed Recommended	56 GB	recommended starting point for VISMOTOR tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
HCA	fmriresults01	PreprocTfmriVismotorLegacySurface	tfMRI VISMOTOR Preprocessed Legacy Surface	56 GB	cleaned VISMOTOR tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
HCA	fmriresults01	PreprocTfmriVismotorLegacyVolume	tfMRI VISMOTOR Preprocessed Legacy Volume	109 GB	cleaned VISMOTOR tfMRI files poorly aligned across subjects using nonlinear volume registration
HCA	fmriresults01	PreprocTfmriVismotorUncleaned	tfMRI VISMOTOR Preprocessed Uncleaned	346 GB	uncleaned tfMRI VISMOTOR data of all registration types for use in testing alternative data cleanup strategies
HCA	fmriresults01	PreprocTfmriVismotorExtended	tfMRI VISMOTOR Preprocessed Extended	50 GB	additional VISMOTOR tfMRI files related to data cleanup and other extra files that may be useful to select users
HCD	fmriresults01	PreprocTfmriEmotionRecommended	tfMRI EMOTION Preprocessed Recommended	55 GB	recommended starting point for EMOTION tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
HCD	fmriresults01	PreprocTfmriEmotionLegacySurface	tfMRI EMOTION Preprocessed Legacy Surface	55 GB	cleaned EMOTION tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration.
HCD	fmriresults01	PreprocTfmriEmotionLegacyVolume	tfMRI EMOTION Preprocessed Legacy Volume	114 GB	cleaned EMOTION tfMRI files poorly aligned across subjects using nonlinear volume registration
HCD	fmriresults01	PreprocTfmriEmotionUncleaned	tfMRI EMOTION Preprocessed Uncleaned	308 GB	uncleaned tfMRI EMOTION data of all registration types for use in testing alternative data cleanup strategies
HCD	fmriresults01	PreprocTfmriEmotionExtended	tfMRI EMOTION Preprocessed Extended	47 GB	additional EMOTION tfMRI files related to data cleanup and other extra files that may be useful to select users
HCD	fmriresults01	PreprocTfmriGuessingRecommended	tfMRI GUESSING Preprocessed Recommended	101 GB	recommended starting point for GUESSING tfMRI analyses and contains cleaned files precisely aligned across subjects using the MSMAll multi-modal surface registration
HCD	fmriresults01	PreprocTfmriGuessingLegacySurface	tfMRI GUESSING Preprocessed Legacy Surface	145 GB	cleaned GUESSING tfMRI files coarsely aligned across subjects using the MSMSulc folding surface registration
HCD	fmriresults01	PreprocTfmriGuessingLegacyVolume	tfMRI GUESSING Preprocessed Legacy Volume	309 GB	cleaned GUESSING tfMRI files poorly aligned across subjects using nonlinear volume registration
HCD	fmriresults01	PreprocTfmriGuessingUncleaned	tfMRI GUESSING Preprocessed Uncleaned	651 GB	uncleaned tfMRI GUESSING data of all registration types for use in testing alternative data cleanup strategies
HCD	fmriresults01	PreprocTfmriGuessingExtended	tfMRI GUESSING Preprocessed Extended	94 GB	additional GUESSING tfMRI files related to data cleanup and other extra files that may be useful to select users

Having identified the subjects and packages you want to download and created a list of S3 links from the manifest, send this list to the downloadcmd for downloading.

> source ../nda/bin/activate 
> downloadcmd -dp 1185249 -t  PreprocStrucRecommendedSubjectSubsetS3s -u <your username> -d  HCPDevImgManifestBehSUBSET -wt 8

Note: you can also just download a single file as follows:

> downloadcmd -dp 1185249 s3://NDAR_Central_3/submission_33230/HCD0008117_V1_MR/T1w/T1w_acpc_dc_restore.nii.gz -u <your username> -d HCPDevImgManifestBehSingleLink -wt 8

Once again, check that your download contains the data expected, for example with du and tree commands.

>  du -h HCPDevImgManifestBehSUBSET
>  tree HCPDevImgManifestBehSUBSET

Helpful tip: If you’re on a VPN and the VPN times out before the download completes, you may wish to preface your command with nohup (no hangup) to keep the command running in the background). You can also time how long it takes to download.

eg. the following will download a specific S3 link, send the verbose output AND the time it took to download everything to jobtimer.txt

**> nohup time downloadcmd -dp 1185249 s3://NDAR_Central_3/submission_33230/HCD0008117_V1_MR/T1w/T1w_acpc_dc_restore.nii.gz -u -d HCPDevImgManifestBehSingleLink -wt 8 > jobtimer.txt 2>&1 &**

ALL CAPS EXTRA SPECIAL NOTE:

You must put BOTH a package number AND a list of S3 links in the command to download subsets via the command line.

BUT HERE IS THE CATCH: the number of the package has to be associated with a ‘fully loaded’ and shared package (one containing all imaging data files, e.g. HCPDevAllFiles, that you don’t want to download unless you have 23 TB handy and endless free time to babysit the process). Currently, the only way to obtain the datastructure_manifest.txt without downloading the entire release is through the HCPDevImgManifestBeh package. Contact NDA’s helpdesk: ndahelp@mail.nih.gov for updates. Look for updates on the NDAR repo pages (https://github.com/NDAR).