This is an old revision of the document!
This article shows how to parallel download hundreds of BAM files using anthill cluster.
Assume you have file download.list
that has lines location of files that you would like to download with wget
. Beginning of this file can look like this :
ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223712/GVA311_CGG_1_019861.Horse_mt.realigned.r.t.s.m.s.bam ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223713/NewBotai_31_CGG_1_017023.Horse_mt.realigned.r.t.s.merged.bam ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223714/Arz3_CGG_1_017089_i9_GATCAG_U.r.t.s.bam
Login to pier23 or anthill23. Go to W01_NFS
storage with cd /workspace/${USER}
. And create new folder for your download.list
file. Remember to check with quota -sg
how much free space you have. If not enough compress/remove some old files, or request for quota increase.
Lets assume your download.list
file was located in new folder /workspace/${USER}/anthill23_wget
at anthill23 host.
In the same folder (/workspace/${USER}/anthill23_wget
) create subfolder batch
.
In /workspace/${USER}/anthill23_wget/batch/
create file download_list.batch
.
File download_list.batch
:
#!/bin/bash -l #SBATCH --job-name="wget list download" #SBATCH --ntasks=1 #SBATCH --mem=1G #SBATCH --partition=medium #SBATCH --time=1-01:59:59 WORK_DIR="/workspace/${USER}/anthill23_wget" LIST_FILE_PATH="${WORK_DIR}/download.list" DOWNLOAD_DIR="${WORK_DIR}/results" cd ${WORK_DIR} mkdir -p ${DOWNLOAD_DIR} GET_LINE=`sed "${SLURM_ARRAY_TASK_ID}q;d" ${LIST_FILE_PATH}` echo "date: `date -R`" echo "host: `hostname`" echo "taks id: ${SLURM_ARRAY_TASK_ID}" echo "will download ${GET_LINE}" echo "start timestamp: `date +%s`" echo " " cd ${DOWNLOAD_DIR} wget -c "${GET_LINE}" echo " " echo "ended " echo "end timestamp: `date +%s`" echo " "
To run /workspace/${USER}/anthill23_wget/batch/download_list.batch
we need to provide additional argument of how many lines there is in download.list
file as –array=1-999
. Numer of lines in download.list
file is counted automatically ( with wc
conmand ).
cd /workspace/${USER}/anthill23_wget/batch/ sbatch --array=1-`cat ../download.list | wc -l` download_list.batch
How many array jobs waits (PD - PENDING) to be executed :
[tmatejuk@anthill23 ]$ squeue -u ${USER} -t PD JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 11547_[714-878] medium wget lis tmatejuk PD 0:00 1 (Resources)
How many array jobs is in progress (R - RUNNING ) :
[tmatejuk@anthill23 ]$ squeue -u ${USER} -t R JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 11547_385 medium wget lis tmatejuk R 37:02 1 ant009 11547_383 medium wget lis tmatejuk R 37:13 1 ant008 11547_382 medium wget lis tmatejuk R 37:16 1 ant008 ...
Where are jobs standard output information :
[tmatejuk@anthill23 ]$ cd /workspace/${USER}/anthill23_wget/batch [tmatejuk@anthill23 ]$ cat slurm-11547_682.out date: Sun, 22 Sep 2019 13:51:31 +0200 host: ant606 taks id: 682 will download ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3225474/LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai start timestamp: 1569153091 --2019-09-22 13:51:31-- http://ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3225474/LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.192.7 Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.192.7|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 7198680 (6.9M) [application/octet-stream] Saving to: ‘LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai’ 0K .......... .......... .......... .......... .......... 0% 469K 15s 50K .......... .......... .......... .......... .......... 1% 228K 23s ... 6900K .......... .......... .......... .......... .......... 98% 574K 0s 6950K .......... .......... .......... .......... .......... 99% 257K 0s 7000K .......... .......... ......... 100% 695K=24s 2019-09-22 13:51:56 (297 KB/s) - ‘LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai’ saved [7198680/7198680] ended end timestamp: 1569153116
Where are downloaded files :
[tmatejuk@anthill23 ]$ cd /workspace/${USER}/anthill23_wget/results
Usually 1 wget process uses only few % of CPU core. And ussualy download speed for single file from public ftp servers is about ~1MB/s . So more efficient way would be subbmiting more than one wget proccess to one core or one wget process with more than one URL. That change however, would complicate sagnifically batch file and clarity of whole process.