CD-HIT is a fast program for clustering and comparing large sets of protein or nucleotide sequences. Accelerated for clustering the next generation sequencing data. Software page : http://cd-hit.org .
CD-HIT , from Ubuntu 18.04 repo (CD-HIT version 4.7 (built on Jul 1 2017)) .
File uniprot_sprot.fasta
(md5sum: 28930f98b76a8475b1d4f8291f2a5833) complete UniProtKB/Swiss-Prot data set in flat file format was downloaded from here .
Take notice of -M
option. There is not slurm environment variable that stores value of allocated memory. That is why it is important to remember to make value of allocated for computing job memory (–mem
) and -M
coherent.
#!/bin/bash -l #SBATCH --partition=short #SBATCH --ntasks=8 #SBATCH --mem 15G #input filename FASTAFILE=uniprot_sprot.fasta #dir with input file and dir for results INPUTFILEDIR="/workspace/${USER}/uniprot_files/" #printout to output file some info echo "cd-hit anthill test" echo "input_file : " ${FASTAFILE} echo "date: " `date` echo "node_name: " `hostname` echo "cores: " ${SLURM_NPROCS} #run cdhit computation and remove output files cd ${INPUTFILEDIR} cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000
Bellow results how efficiently cd-hit scales in function of used cores. Each computation were run 3 times separately for each node type ant6nn and type ant0nn.
node | cores | min[s] | avg [s] | median [s] | max[s] | efficiency [%] |
---|---|---|---|---|---|---|
ant6nn | 1 | 107.66 | 108.18 | 108.21 | 108.78 | 100.00% |
ant6nn | 2 | 103.79 | 104.00 | 104.02 | 104.16 | 52.01% |
ant6nn | 4 | 75.78 | 76.11 | 76.21 | 76.29 | 35.50% |
ant6nn | 8 | 62.15 | 62.18 | 62.17 | 62.26 | 21.76% |
ant6nn | 10 | 57.41 | 58.53 | 58.92 | 59.10 | 18.36% |
node | cores | min[s] | avg [s] | median [s] | max[s] | efficiency [%] |
---|---|---|---|---|---|---|
ant00n | 1 | 150.84 | 160.29 | 151.97 | 178.35 | 100.00% |
ant00n | 2 | 154.55 | 158.30 | 155.96 | 170.05 | 48.72% |
ant00n | 4 | 114.23 | 117.90 | 116.52 | 127.71 | 32.61% |
ant00n | 8 | 92.79 | 108.49 | 99.84 | 139.51 | 19.03% |
ant00n | 10 | 90.37 | 96.00 | 96.08 | 104.37 | 15.82% |
ant00n | 14 | 84.71 | 88.76 | 88.17 | 94.25 | 12.31% |
ant00n | 16 | 82.85 | 88.40 | 90.33 | 92.43 | 10.52% |
ant00n | 21 | 80.98 | 86.11 | 88.26 | 89.28 | 8.20% |
ant00n | 28 | 78.94 | 89.15 | 91.17 | 93.79 | 5.95% |
*) ant6nn = ant602 and ant604, ant00n = ant007 and ant008
*) efficiency as t1 / ( cores * tn )
( where t1 is computation time at one core, tn is computation time on N cores )
*) command run cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000
on file uniprot_sprot.fasta (md5sum: 28930f98b76a8475b1d4f8291f2a5833)