CD-HIT is a fast program for clustering and comparing large sets of protein or nucleotide sequences. Accelerated for clustering the next generation sequencing data. Software page : http://cd-hit.org .
CD-HIT , from Ubuntu 18.04 repo (CD-HIT version 4.7 (built on Jul 1 2017)) .
File uniprot_sprot.fasta (md5sum: 28930f98b76a8475b1d4f8291f2a5833) complete UniProtKB/Swiss-Prot data set in flat file format was downloaded from here .
Take notice of -M option. There is not slurm environment variable that stores value of allocated memory. That is why it is important to remember to make value of allocated for computing job memory (–mem) and -M coherent.
#!/bin/bash -l
#SBATCH --partition=short
#SBATCH --ntasks=8
#SBATCH --mem 15G
#input filename
FASTAFILE=uniprot_sprot.fasta
#dir with input file and dir for results
INPUTFILEDIR="/workspace/${USER}/uniprot_files/"
#printout to output file some info
echo "cd-hit anthill test"
echo "input_file : " ${FASTAFILE}
echo "date: " `date`
echo "node_name: " `hostname`
echo "cores: " ${SLURM_NPROCS}
#run cdhit computation and remove output files
cd ${INPUTFILEDIR}
cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000
Bellow results how efficiently cd-hit scales in function of used cores. Each computation were run 3 times separately for each node type ant6nn and type ant0nn.
| node | cores | min[s] | avg [s] | median [s] | max[s] | efficiency [%] |
|---|---|---|---|---|---|---|
| ant6nn | 1 | 107.66 | 108.18 | 108.21 | 108.78 | 100.00% |
| ant6nn | 2 | 103.79 | 104.00 | 104.02 | 104.16 | 52.01% |
| ant6nn | 4 | 75.78 | 76.11 | 76.21 | 76.29 | 35.50% |
| ant6nn | 8 | 62.15 | 62.18 | 62.17 | 62.26 | 21.76% |
| ant6nn | 10 | 57.41 | 58.53 | 58.92 | 59.10 | 18.36% |
| node | cores | min[s] | avg [s] | median [s] | max[s] | efficiency [%] |
|---|---|---|---|---|---|---|
| ant00n | 1 | 150.84 | 160.29 | 151.97 | 178.35 | 100.00% |
| ant00n | 2 | 154.55 | 158.30 | 155.96 | 170.05 | 48.72% |
| ant00n | 4 | 114.23 | 117.90 | 116.52 | 127.71 | 32.61% |
| ant00n | 8 | 92.79 | 108.49 | 99.84 | 139.51 | 19.03% |
| ant00n | 10 | 90.37 | 96.00 | 96.08 | 104.37 | 15.82% |
| ant00n | 14 | 84.71 | 88.76 | 88.17 | 94.25 | 12.31% |
| ant00n | 16 | 82.85 | 88.40 | 90.33 | 92.43 | 10.52% |
| ant00n | 21 | 80.98 | 86.11 | 88.26 | 89.28 | 8.20% |
| ant00n | 28 | 78.94 | 89.15 | 91.17 | 93.79 | 5.95% |
*) ant6nn = ant602 and ant604, ant00n = ant007 and ant008
*) efficiency as t1 / ( cores * tn ) ( where t1 is computation time at one core, tn is computation time on N cores )
*) command run cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000 on file uniprot_sprot.fasta (md5sum: 28930f98b76a8475b1d4f8291f2a5833)