==== description ====
This article shows how to download in parallel hundreds of BAM files using Anthill cluster. \\
It can also be used as a template when using the Anthill cluster for similar tasks .

==== file list ====
Assume you have file ''download.list'' that has lines location of files that you would like to download with ''wget''. Beginning of this file can look like this :
  ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223712/GVA311_CGG_1_019861.Horse_mt.realigned.r.t.s.m.s.bam
  ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223713/NewBotai_31_CGG_1_017023.Horse_mt.realigned.r.t.s.merged.bam
  ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3223714/Arz3_CGG_1_017089_i9_GATCAG_U.r.t.s.bam

==== where to download ====
Login to anthill23. Go to ''W01_NFS'' storage with ''cd /workspace/${USER}''. And create new folder for your ''download.list'' file. Remember to check with ''quota -sg'' how much free space you have. If not enough compress/remove some old files, or request for quota increase. \\
Lets assume your ''download.list'' file was located in new folder ''/workspace/${USER}/anthill23_wget'' at anthill23 host.

==== sbatch example ====
In the same folder (''/workspace/${USER}/anthill23_wget'') create subfolder ''batch''. \\
In ''/workspace/${USER}/anthill23_wget/batch/'' create file ''download_list.batch'' .
File ''download_list.batch'' :
  #!/bin/bash -l
  #SBATCH --job-name="wget list download"
  #SBATCH --ntasks=1 
  #SBATCH --mem=1G
  #SBATCH --partition=medium
  #SBATCH --time=1-01:59:59
  
  WORK_DIR="/workspace/${USER}/anthill23_wget"
  LIST_FILE_PATH="${WORK_DIR}/download.list"
  DOWNLOAD_DIR="${WORK_DIR}/results"
  
  cd ${WORK_DIR}
  mkdir -p ${DOWNLOAD_DIR}
  
  GET_LINE=`sed "${SLURM_ARRAY_TASK_ID}q;d" ${LIST_FILE_PATH}`
  
  echo "date: `date -R`"
  echo "host: `hostname`"
  echo "taks id: ${SLURM_ARRAY_TASK_ID}"
  echo "will download ${GET_LINE}"
  echo "start timestamp: `date +%s`"
  echo " "
  
  cd ${DOWNLOAD_DIR}
  wget -c "${GET_LINE}"
  
  echo " "
  echo "ended "
  echo "end timestamp: `date +%s`"
  echo " "


==== run sbatch example ====
To run ''/workspace/${USER}/anthill23_wget/batch/download_list.batch'' we need to provide additional argument of how many lines there is in ''download.list'' file as ''--array=1-999'' argument. In our case number of lines in ''download.list'' file is counted automatically ( with ''wc'' conmand ).
  cd /workspace/${USER}/anthill23_wget/batch/
  sbatch --array=1-`cat ../download.list | wc -l` download_list.batch


==== exam download / job status ====
How many array jobs waits (PD - PENDING) to be executed :
  [tmatejuk@anthill23 ]$ squeue -u ${USER} -t PD
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   11547_[714-878]    medium wget lis tmatejuk PD       0:00      1 (Resources)

How many array jobs is in progress (R - RUNNING ) :
  [tmatejuk@anthill23 ]$ squeue -u ${USER} -t R
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         11547_385    medium wget lis tmatejuk  R      37:02      1 ant009
         11547_383    medium wget lis tmatejuk  R      37:13      1 ant008
         11547_382    medium wget lis tmatejuk  R      37:16      1 ant008
         ...

Where are jobs standard output information :
  [tmatejuk@anthill23 ]$ cd /workspace/${USER}/anthill23_wget/batch
  [tmatejuk@anthill23 ]$ cat slurm-11547_682.out
  date: Sun, 22 Sep 2019 13:51:31 +0200
  host: ant606
  taks id: 682
  will download ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3225474/LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai
  start timestamp: 1569153091
   
  --2019-09-22 13:51:31--  http://ftp.sra.ebi.ac.uk/vol1/run/ERR322/ERR3225474/LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai
  Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.192.7
  Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.192.7|:80... connected.
  HTTP request sent, awaiting response... 200 OK
  Length: 7198680 (6.9M) [application/octet-stream]
  Saving to: ‘LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai’
  
       0K .......... .......... .......... .......... ..........  0%  469K 15s
      50K .......... .......... .......... .......... ..........  1%  228K 23s
  ...
    6900K .......... .......... .......... .......... .......... 98%  574K 0s
    6950K .......... .......... .......... .......... .......... 99%  257K 0s
    7000K .......... .......... .........                       100%  695K=24s
  
  2019-09-22 13:51:56 (297 KB/s) - ‘LOBOT_B_CGG_1_020182.Horse_nuc_wY.realigned.r.t.m.3p1.S.bam.bai’ saved [7198680/7198680]
   
  ended 
  end timestamp: 1569153116

Where are downloaded files  :
  [tmatejuk@anthill23 ]$ cd /workspace/${USER}/anthill23_wget/results


==== other / limits ====
Usually 1 wget process uses only few % of CPU core. And usually download speed for single file from public ftp servers is about ~1MB/s . So more efficient way would be submitting more than one wget process to one core or one wget process with more than one URL. That change however, would complicate significantly batch file and clarity of whole process and it is hard to say how much efficient download process would be.

It is very unlikely but currently ( 2019.09 ) on Anthill cluster described download process has potential to generate about 5GB/s total transfer. Such transfers still would be handled by CeNTs network/storage infrastructure but it is more than possible latency increase of CeNTs storage / network would be noticeable for all users.