Site Tools


anthill_sri001

description

CD-HIT is a fast program for clustering and comparing large sets of protein or nucleotide sequences. Accelerated for clustering the next generation sequencing data. Software page : http://cd-hit.org .

software version

CD-HIT , from Ubuntu 18.04 repo (CD-HIT version 4.7 (built on Jul 1 2017)) .

sbatch example

File uniprot_sprot.fasta (md5sum: 28930f98b76a8475b1d4f8291f2a5833) complete UniProtKB/Swiss-Prot data set in flat file format was downloaded from here .

Take notice of -M option. There is not slurm environment variable that stores value of allocated memory. That is why it is important to remember to make value of allocated for computing job memory (–mem) and -M coherent.

#!/bin/bash -l
#SBATCH --partition=short
#SBATCH --ntasks=8
#SBATCH --mem 15G

#input filename
FASTAFILE=uniprot_sprot.fasta

#dir with input file and dir for results
INPUTFILEDIR="/workspace/${USER}/uniprot_files/"

#printout to output file some info
echo "cd-hit anthill test"
echo "input_file : " ${FASTAFILE}
echo "date: " `date`
echo "node_name: " `hostname`
echo "cores: " ${SLURM_NPROCS}

#run cdhit computation and remove output files
cd ${INPUTFILEDIR}
cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000

performance tests

Bellow results how efficiently cd-hit scales in function of used cores. Each computation were run 3 times separately for each node type ant6nn and type ant0nn.

node cores min[s] avg [s] median [s] max[s] efficiency [%]
ant6nn 1 107.66 108.18 108.21 108.78 100.00%
ant6nn 2 103.79 104.00 104.02 104.16 52.01%
ant6nn 4 75.78 76.11 76.21 76.29 35.50%
ant6nn 8 62.15 62.18 62.17 62.26 21.76%
ant6nn 10 57.41 58.53 58.92 59.10 18.36%
node cores min[s] avg [s] median [s] max[s] efficiency [%]
ant00n 1 150.84 160.29 151.97 178.35 100.00%
ant00n 2 154.55 158.30 155.96 170.05 48.72%
ant00n 4 114.23 117.90 116.52 127.71 32.61%
ant00n 8 92.79 108.49 99.84 139.51 19.03%
ant00n 10 90.37 96.00 96.08 104.37 15.82%
ant00n 14 84.71 88.76 88.17 94.25 12.31%
ant00n 16 82.85 88.40 90.33 92.43 10.52%
ant00n 21 80.98 86.11 88.26 89.28 8.20%
ant00n 28 78.94 89.15 91.17 93.79 5.95%

*) ant6nn = ant602 and ant604, ant00n = ant007 and ant008
*) efficiency as t1 / ( cores * tn ) ( where t1 is computation time at one core, tn is computation time on N cores )
*) command run cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000 on file uniprot_sprot.fasta (md5sum: 28930f98b76a8475b1d4f8291f2a5833)

anthill_sri001.txt · Last modified: 2023/08/01 01:08 by 127.0.0.1