CRwiki

This is an old revision of the document!

description

This article shows how to parallel download hundreds of BAM files using anthill cluster.

file list

Assume you have file download.list that has lines location of files that you would like to download with wget. Beginning of this file can look like this :

ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223712/GVA311_CGG_1_019861.Horse_mt.realigned.r.t.s.m.s.bam
ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223713/NewBotai_31_CGG_1_017023.Horse_mt.realigned.r.t.s.merged.bam
ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223714/Arz3_CGG_1_017089_i9_GATCAG_U.r.t.s.bam

where to download

Login to pier23 or anthill23. Go to W01_NFS storage with cd /workspace/${USER}. And create new folder for your download.list file. Remember to check with quota -sg how much free space you have. If not enough compress/remove some old files, or request for quota increase.
Lets assume your download.list file was located in new folder /workspace/${USER}/anthill23_wget at anthill23 host.

sbatch example

In the same folder (/workspace/${USER}/anthill23_wget) create subfolder batch.
In /workspace/${USER}/anthill23_wget/batch/ create file download_list.batch . File download_list.batch :

#!/bin/bash -l
#SBATCH --job-name="wget list download"
#SBATCH --ntasks=1 
#SBATCH --mem=1G
#SBATCH --partition=medium
#SBATCH --time=1-01:59:59

WORK_DIR="/workspace/${USER}/anthill23_wget"
LIST_FILE_PATH="${WORK_DIR}/download.list"
DOWNLOAD_DIR="${WORK_DIR}/results"

cd ${WORK_DIR}
mkdir -p ${DOWNLOAD_DIR}

GET_LINE=`sed "${SLURM_ARRAY_TASK_ID}q;d" ${LIST_FILE_PATH}`

echo "date: `date -R`"
echo "host: `hostname`"
echo "taks id: ${SLURM_ARRAY_TASK_ID}"
echo "will download ${GET_LINE}"
echo "start timestamp: `date +%s`"
echo " "

cd ${DOWNLOAD_DIR}
wget -c "${GET_LINE}"

echo " "
echo "ended "
echo "end timestamp: `date +%s`"
echo " "

run sbatch example

To run /workspace/${USER}/anthill23_wget/batch/download_list.batch we need to provide additional argument of how many lines there is in download.list file as –array=1-999. Numer of lines in download.list file is counted automatically ( with wc conmand ).

cd /workspace/${USER}/anthill23_wget/batch/
sbatch --array=1-`cat ../download.list | wc -l` download_list.batch

exam download / job status

How many array jobs waits (PD - PENDING) to be executed :

[tmatejuk@anthill23 ]$ squeue -u ${USER} -t PD
           JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 11547_[714-878]    medium wget lis tmatejuk PD       0:00      1 (Resources)

How many array jobs is in progress (R - RUNNING ) :

[tmatejuk@anthill23 ]$ squeue -u ${USER} -t R
           JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       11547_385    medium wget lis tmatejuk  R      37:02      1 ant009
       11547_383    medium wget lis tmatejuk  R      37:13      1 ant008
       11547_382    medium wget lis tmatejuk  R      37:16      1 ant008
       ...

Where are jobs standard output information :

[tmatejuk@anthill23 ]$ cd /workspace/${USER}/anthill23_wget/batch
[tmatejuk@anthill23 ]$ cat slurm-11547_682.out
date: Sun, 22 Sep 2019 13:51:31 +0200
host: ant606
taks id: 682
will download ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3225474/LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai
start timestamp: 1569153091
 
--2019-09-22 13:51:31--  http://ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3225474/LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.192.7
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.192.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7198680 (6.9M) [application/octet-stream]
Saving to: ‘LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai’

     0K .......... .......... .......... .......... ..........  0%  469K 15s
    50K .......... .......... .......... .......... ..........  1%  228K 23s
...
  6900K .......... .......... .......... .......... .......... 98%  574K 0s
  6950K .......... .......... .......... .......... .......... 99%  257K 0s
  7000K .......... .......... .........                       100%  695K=24s

2019-09-22 13:51:56 (297 KB/s) - ‘LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai’ saved [7198680/7198680]
 
ended 
end timestamp: 1569153116

Where are downloaded files :

[tmatejuk@anthill23 ]$ cd /workspace/${USER}/anthill23_wget/results

other / limits

Usually 1 wget process uses only few % of CPU core. And ussualy download speed for single file from public ftp servers is about ~1MB/s . So more efficient way would be subbmiting more than one wget proccess to one core or one wget process with more than one URL. That change however, would complicate sagnifically batch file and clarity of whole process.

CRwiki

User Tools

Site Tools

Table of Contents