==== description ==== CD-HIT is a fast program for clustering and comparing large sets of protein or nucleotide sequences. Accelerated for clustering the next generation sequencing data. Software page : [[http://cd-hit.org]] . ==== software version ==== CD-HIT , from Ubuntu 18.04 repo (CD-HIT version 4.7 (built on Jul 1 2017)) . ==== sbatch example ==== File ''uniprot_sprot.fasta'' (md5sum: 28930f98b76a8475b1d4f8291f2a5833) complete UniProtKB/Swiss-Prot data set in flat file format was downloaded from [[ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/|here]] . Take notice of ''-M'' option. There is not slurm environment variable that stores value of allocated memory. That is why it is important to remember to make value of allocated for computing job memory (''--mem'') and ''-M'' coherent. #!/bin/bash -l #SBATCH --partition=short #SBATCH --ntasks=8 #SBATCH --mem 15G #input filename FASTAFILE=uniprot_sprot.fasta #dir with input file and dir for results INPUTFILEDIR="/workspace/${USER}/uniprot_files/" #printout to output file some info echo "cd-hit anthill test" echo "input_file : " ${FASTAFILE} echo "date: " `date` echo "node_name: " `hostname` echo "cores: " ${SLURM_NPROCS} #run cdhit computation and remove output files cd ${INPUTFILEDIR} cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000 ==== performance tests ==== Bellow results how efficiently cd-hit scales in function of used cores. Each computation were run 3 times separately for each node type ant6nn and type ant0nn. {{ :anthill:cdhit4.7_anthill_result_001.png?nolink|}} ^ node ^ cores ^ min[s] ^ avg [s] ^ median [s] ^ max[s] ^ efficiency [%] ^ | ant6nn | 1 | 107.66 | 108.18 | 108.21 | 108.78 | 100.00% | | ant6nn | 2 | 103.79 | 104.00 | 104.02 | 104.16 | 52.01% | | ant6nn | 4 | 75.78 | 76.11 | 76.21 | 76.29 | 35.50% | | ant6nn | 8 | 62.15 | 62.18 | 62.17 | 62.26 | 21.76% | | ant6nn | 10 | 57.41 | 58.53 | 58.92 | 59.10 | 18.36% | ^ node ^ cores ^ min[s] ^ avg [s] ^ median [s] ^ max[s] ^ efficiency [%] ^ | ant00n | 1 | 150.84 | 160.29 | 151.97 | 178.35 | 100.00% | | ant00n | 2 | 154.55 | 158.30 | 155.96 | 170.05 | 48.72% | | ant00n | 4 | 114.23 | 117.90 | 116.52 | 127.71 | 32.61% | | ant00n | 8 | 92.79 | 108.49 | 99.84 | 139.51 | 19.03% | | ant00n | 10 | 90.37 | 96.00 | 96.08 | 104.37 | 15.82% | | ant00n | 14 | 84.71 | 88.76 | 88.17 | 94.25 | 12.31% | | ant00n | 16 | 82.85 | 88.40 | 90.33 | 92.43 | 10.52% | | ant00n | 21 | 80.98 | 86.11 | 88.26 | 89.28 | 8.20% | | ant00n | 28 | 78.94 | 89.15 | 91.17 | 93.79 | 5.95% | *) ant6nn = ant602 and ant604, ant00n = ant007 and ant008 \\ *) efficiency as ''t1 / ( cores * tn )'' ( where t1 is computation time at one core, tn is computation time on N cores ) \\ *) command run ''cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000'' on file ''uniprot_sprot.fasta (md5sum: 28930f98b76a8475b1d4f8291f2a5833)''