This is an old revision of the document!
 TODO UNDERCONSTRUCTION
CD-HIT is a fast program for clustering and comparing large sets of protein or nucleotide sequences. Accelerated for clustering the next generation sequencing data. Software page : http://cd-hit.org .
CD-HIT , from Ubuntu 18.04 repo (CD-HIT version 4.7 (built on Jul 1 2017)) .
File uniprot_sprot.fasta (md5sum: 28930f98b76a8475b1d4f8291f2a5833) complete UniProtKB/Swiss-Prot data set in flat file format was downloaded from here .
Take notice of -M option. There is not slurm environment variable that stores value of allocated memory. That is why it is important to remember to make value of –mem and -M coherent.
#!/bin/bash -l
#SBATCH --partition=short
#SBATCH --ntasks=8
#SBATCH --mem 15G
#input filename
FASTAFILE=uniprot_sprot.fasta
#dir with input file and dir for results
INPUTFILEDIR="/workspace/${USER}/uniprot_files/"
#printout to output file some info
echo "cd-hit anthill test"
echo "input_file : " ${FASTAFILE}
echo "date: " `date`
echo "node_name: " `hostname`
echo "cores: " ${SLURM_NPROCS}
#run cdhit computation and remove output files
cd ${INPUTFILEDIR}
cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000
Bellow results how efficiently cd-hit scales in function of used cores.
Files come form UniProt webpage (UK mirror: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/).
batch file 
#!/bin/bash -l
#SBATCH --partition=long
#SBATCH --nodelist=ant006
#SBATCH --ntasks=8
#SBATCH --mem 14G
#test filename
FASTAFILE=uniprot_sprot.fasta
#dir with input file and dir for results
INPUTFILEDIR="/workspace/${USER}/uniprot_files/"
#download input file if not present
if [ ! -f ${INPUTFILEDIR}/${FASTAFILE} ]; then
echo "File ${FASTAFILE} not found!"
echo "please download file ${FASTAFILE} to ${INPUTFILEDIR}/ "
exit
fi
#printout to output file some info
echo "cd-hit anthill test"
echo "input_file : " ${FASTAFILE}
echo "date: " `date`
echo "node_name: " `hostname`
echo "cores: " ${SLURM_NPROCS}
#create temporary folder on local disk for input and output data
TESTDIR=`mktemp -d`
mkdir -p ${TESTDIR}
cp ${INPUTFILEDIR}/${FASTAFILE} ${TESTDIR}
#run cdhit computation and remove output files
cd ${TESTDIR}
time -p cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 14000
#remove data after test:
cd ~
rm -rf ${TESTDIR}
test conditions
Each job was run alone, that is only one job run on whole cluster. So jobs did not compete for nfs shares or memory bus etc. To automate whole test process cd-hit job batch file was modified. All jobs were committed with bash script. To make sure that jobs do not run simultaneously slurm's option "--dependency=afterany:" was used .
Auxiliary batch script (empty.batch):
#!/bin/bash -l
#SBATCH --partition=short
#SBATCH --ntasks=1
sleep 1;
ch-hit job script (cdhit_test.batch) :
#!/bin/bash -l
#SBATCH --partition=short
#SBATCH --mem 2048M
#test filename
FASTAFILE=uniprot_sprot.fasta
#dir with input file and dir for results
INPUTFILEDIR="/workspace/${USER}/anthill23/testfiles"
CDHITTESTDIR="/workspace/${USER}/anthill23/test08"
#download input file if not present
if [ ! -f ${INPUTFILEDIR}/${FASTAFILE} ]; then
echo "File ${FASTAFILE} not found!"
echo "please download file ${FASTAFILE} to ${INPUTFILEDIR}/ "
exit
fi
#pintout to output file some info
echo "cd-hit anthill test"
echo "input file : " ${FASTAFILE}
echo "date: " `date`
echo "nodename: " `hostname`
echo "cores: " ${SLURM_NPROCS}
#copy uniprot_sprot.fasta to result directory if not present
mkdir -p ${CDHITTESTDIR}
cd ${CDHITTESTDIR}
if [ ! -f ${FASTAFILE} ]; then
echo "File uniprot_sprot.fasta not present in results dir!"
cp ${INPUTFILEDIR}/${FASTAFILE} ${CDHITTESTDIR}
fi
#run cdhit computation and remove output files
cd ${CDHITTESTDIR}
time -p cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 2048
rm out out.clstr
Bash script used to commit all jobs (run_cdhit_test_on_anthill.sh) :
#!/bin/sh
BATCHNAME=cdhit_test.batch
# first job - no dependencies, run empty job and get jobid
previd=$(sbatch empty.batch | awk '{print $4}');
for i in $(seq 1 3) #repeat each test 3 times
do
#nodes : ant001, ant002
for nodename in ant001 ant002
do
for ntasksvalue in 1 2
do
echo "node : ${nodename}"
echo "ntasks : ${ntasksvalue}"
nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}')
previd=${nextid}
done
done
#nodes : ant003, ant004
for nodename in ant003 ant004
do
for ntasksvalue in 1 2 4
do
echo "node : ${nodename}"
echo "ntasks : ${ntasksvalue}"
nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}')
previd=${nextid}
done
done
#nodes : ant005, ant006
for nodename in ant005 ant006
do
for ntasksvalue in 1 2 4 8
do
echo "node : ${nodename}"
echo "ntasks : ${ntasksvalue}"
nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}')
previd=${nextid}
done
done
#nodes : ant007, ant008
for nodename in ant007 ant008
do
for ntasksvalue in 1 2 4 8 14 16 28
do
echo "node : ${nodename}"
echo "ntasks : ${ntasksvalue}"
nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}')
previd=${nextid}
done
done
done # repeat each test 3 times
# show dependencies in squeue output:
squeue -u $USER -o "%.8A %.4C %.10m %.20E"
results
In this test each node was occupied only by one job. File : uniprot_sprot.fasta . Memory allocated for each job : 2048 MB (cd-hit reported memory usage lower than 1400 MB).
node name 	cores used 	result1 [s] 	result2 [s] 	result3 [s] 	average [s]
ant001 	1 	365.94 	371.68 	367.26 	368.29
ant001 	2 	244.42 	240.67 	242.41 	242.50
ant002 	1 	409.78 	369.57 	368.46 	382.60
ant002 	2 	243.23 	238.74 	244.39 	242.12
ant003 	1 	360.88 	361.21 	365.22 	362.44
ant003 	2 	245.87 	251.52 	245.33 	247.57
ant003 	4 	202.77 	205.5 	203.77 	204.01
ant004 	1 	357.43 	364.11 	372.73 	364.76
ant004 	2 	239.34 	239.92 	236.58 	238.61
ant004 	4 	189.77 	194 	194.24 	192.67
ant005 	1 	334.39 	331.71 	318.99 	328.36
ant005 	2 	238.52 	234.26 	239.69 	237.49
ant005 	4 	196.94 	199.28 	198.59 	198.27
ant005 	8 	175.24 	178.24 	173.24 	175.57
ant006 	1 	318.6 	330.38 	330.18 	326.39
ant006 	2 	239.27 	238.46 	240.01 	239.25
ant006 	4 	192.16 	191.58 	196.91 	193.55
ant006 	8 	156.69 	176.7 	178.24 	170.54
ant007 	1 	149.8 	140.85 	147.45 	146.03
ant007 	2 	114.65 	112.76 	118.23 	115.21
ant007 	4 	98.15 	96.56 	98.67 	97.79
ant007 	8 	90.63 	89.64 	88.38 	89.55
ant007 	14 	85.28 	87.12 	83.37 	85.26
ant007 	16 	82.46 	86.69 	86.67 	85.27
ant007 	28 	75.35 	87.45 	85.33 	82.71
ant008 	1 	140.42 	147.09 	142.25 	143.25
ant008 	2 	113.35 	116.83 	118.27 	116.15
ant008 	4 	99.1 	91.73 	90.12 	93.65
ant008 	8 	88.89 	80.2 	89.31 	86.13
ant008 	14 	76.03 	85.79 	86.91 	82.91
ant008 	16 	86.97 	82.77 	85.11 	84.95
ant008 	28 	83.19 	89.83 	89.45 	87.49
ant009 	1 	135.40 	135.35 	140.31 	137.02
ant009 	2 	108.71 	111.89 	108.27 	109.62
ant009 	4 	85.06 	90.44 	84.27 	86.59
ant009 	8 	72.98 	78.29 	77.61 	76.29
ant009 	14 	71.93 	72.95 	67.78 	70.89
ant009 	16 	67.35 	71.71 	72.00 	70.35
ant011 	1 	104.71 	105.06 	104.5 	104.76
ant011 	2 	89.57 	90.33 	89.65 	89.85
ant011 	4 	72.20 	71.92 	72.22 	72.11
ant012 	1 	106.92 	106.50 	107.06 	106.83
ant012 	2 	93.41 	93.48 	93.27 	93.39
ant012 	4 	75.24 	75.16 	75.23 	75.21
ant100 	1 	498.72 	499.5 	486.61 	494.94
ant100 	2 	452.51 	422.12 	447.15 	440.59
ant100 	4 	389.68 	388.35 	407.17 	395.07
ant100 	8 	388.59 	382.68 	396.2 	389.16
ant100 	14 	368.64 	351.9 	401.13 	373.89
ant100 	16 	391.47 	396.53 	357.93 	381.98
ant101 	1 	503.45 	452.31 	487.26 	481.01
ant101 	2 	458.42 	435.75 	451.08 	448.42
ant101 	4 	417.15 	382.07 	394.57 	397.93
ant101 	8 	369.29 	383.49 	397.68 	383.49
ant101 	14 	367.95 	379.02 	377.11 	374.69
ant101 	16 	359.85 	377.17 	383.34 	373.45
ant102 	1 	458.29 	493.13 	550.64 	500.69
ant102 	2 	426.48 	431.78 	425.33 	427.86
ant102 	4 	389.84 	397.64 	400.21 	395.90
ant102 	8 	376.8 	386.24 	384.44 	382.49
ant102 	14 	378.7 	380.39 	363.19 	374.09
ant102 	16 	386.91 	377.53 	362.45 	375.63
ant103 	1 	542.24 	512.03 	536.04 	530.10
ant103 	2 	419.03 	421.15 	439.53 	426.57
ant103 	4 	403.98 	401.51 	413.29 	406.26
ant103 	8 	419.37 	401.93 	400.54 	407.28
ant103 	14 	382.57 	390.96 	399.59 	391.04
ant103 	16 	381.75 	352.94 	408.68 	381.12
ant104 	1 	506.13 	491.18 	498.93 	498.75
ant104 	2 	436.75 	457.8 	460.15 	451.57
ant104 	4 	392.18 	399.32 	426.82 	406.11
ant104 	8 	398.33 	376.98 	410.96 	395.42
ant104 	14 	410.99 	407.83 	361.08 	393.30
ant104 	16 	379.62 	446.03 	387.65 	404.43
ant105 	1 	531.36 	496.54 	497.53 	508.48
ant105 	2 	427.13 	436.78 	420.61 	428.17
ant105 	4 	390.03 	421.38 	402.52 	404.64
ant105 	8 	405.04 	381.83 	398.82 	395.23
ant105 	14 	392.35 	381.58 	400.12 	391.35
ant105 	16 	394.61 	372.8 	391.66 	386.36
ant106 	1 	571.33 	521.11 	501.05 	531.16
ant106 	2 	426.67 	459.36 	420.83 	435.62
ant106 	4 	373.5 	389.3 	409.29 	390.70
ant106 	8 	406.36 	419.5 	387.7 	404.52
ant106 	14 	404.83 	403.86 	365.56 	391.42
ant106 	16 	389.17 	390.04 	387.46 	388.89
ant107 	1 	499.09 	556.72 	503.26 	519.69
ant107 	2 	440.45 	447.31 	433.92 	440.56
ant107 	4 	400.1 	396.56 	401.83 	399.50
ant107 	8 	390.98 	396.28 	389.81 	392.36
ant107 	14 	372.23 	348.51 	378.69 	366.48
ant107 	16 	388.68 	389.47 	380.07 	386.07
ant108 	1 	495.9 	556.77 	548.07 	533.58
ant108 	2 	444.71 	412.32 	425.88 	427.64
ant108 	4 	391.17 	400.76 	388.44 	393.46
ant108 	8 	412.31 	402.63 	394.59 	403.18
ant108 	14 	362.67 	398.34 	394.56 	385.19
ant108 	16 	386.53 	388.5 	362.56 	379.20
ant109 	1 	545.91 	556.75 	494.27 	532.31
ant109 	2 	461.51 	444.07 	450.6 	452.06
ant109 	4 	403.82 	399.62 	397.57 	400.34
ant109 	8 	386.67 	372.51 	382.94 	380.71
ant109 	14 	394.64 	398.26 	403.91 	398.94
ant109 	16 	393.43 	380.63 	381.21 	385.09
ant110 	1 	483.4 	504.87 	565.43 	517.90
ant110 	2 	436.53 	445.72 	440.23 	440.83
ant110 	4 	386.86 	376.82 	397.58 	387.09
ant110 	8 	406.49 	400.98 	404.7 	404.06
ant110 	14 	377.74 	376.83 	387.77 	380.78
ant110 	16 	388.67 	414.8 	338.77 	380.75
ant300 	1 	505.57 	508.71 	507.09 	507.12
ant300 	2 	465.51 	469.14 	443.56 	459.40
ant300 	4 	391.08 	387.43 	423.27 	400.59
ant300 	8 	375.34 	391.55 	384.94 	383.94
ant300 	14 	370 	385.67 	392.79 	382.82
ant300 	16 	417.78 	371.86 	394 	394.55
ant301 	1 	529.53 	555.58 	554.29 	546.47
ant301 	2 	446.13 	428.45 	433.15 	435.91
ant301 	4 	391.19 	400.25 	381.3 	390.91
ant301 	8 	378.6 	403.8 	398.7 	393.70
ant301 	14 	387.28 	404.41 	416.27 	402.65
ant301 	16 	407.38 	384.49 	393.4 	395.09
ant302 	1 	487.93 	540.43 	509.09 	512.48
ant302 	2 	455.83 	416.38 	422.88 	431.70
ant302 	4 	402.54 	402.98 	411.03 	405.52
ant302 	8 	410.27 	396.98 	387.37 	398.21
ant302 	14 	388.08 	361.4 	404.88 	384.79
ant302 	16 	393.81 	388.89 	350.89 	377.86
ant303 	1 	543 	496.53 	497.46 	512.33
ant303 	2 	438.08 	431.19 	450.57 	439.95
ant303 	4 	374.69 	412.4 	415.05 	400.71
ant303 	8 	392.73 	402.06 	371.14 	388.64
ant303 	14 	385.49 	403.59 	377.39 	388.82
ant303 	16 	385.74 	380.52 	368.13 	378.13
ant304 	1 	488.44 	530.69 	505.74 	508.29
ant304 	2 	426.9 	429.9 	416.27 	424.36
ant304 	4 	400.71 	411.94 	409.4 	407.35
ant304 	8 	397.68 	390.12 	395.15 	394.32
ant304 	14 	400.9 	399.97 	395.25 	398.71
ant304 	16 	376.71 	411.95 	391.9 	393.52
ant305 	1 	493.29 	457.06 	499.45 	483.27
ant305 	2 	435.51 	416.19 	441.24 	430.98
ant305 	4 	383.56 	383.94 	374.93 	380.81
ant305 	8 	390.53 	374.54 	381.38 	382.15
ant305 	14 	383.11 	378.42 	366.69 	376.07
ant305 	16 	339.12 	371.78 	384.37 	365.09
ant306 	1 	525.12 	562.76 	554.43 	547.44
ant306 	2 	422.76 	428.61 	455.4 	435.59
ant306 	4 	399.04 	418.49 	393.59 	403.71
ant306 	8 	401.78 	398.08 	398.79 	399.55
ant306 	14 	372.71 	344.34 	387.41 	368.15
ant306 	16 	402.07 	397.08 	363.71 	387.62
ant307 	1 	529.49 	491.09 	551.61 	524.06
ant307 	2 	436.61 	450.36 	410.17 	432.38
ant307 	4 	369.26 	382.06 	372.96 	374.76
ant307 	8 	399.93 	368.99 	381.69 	383.54
ant307 	14 	393.04 	402.61 	392.46 	396.04
ant307 	16 	397.69 	366.65 	368.11 	377.48
ant308 	1 	493.37 	485 	495.46 	491.28
ant308 	2 	424.94 	407.66 	440.15 	424.25
ant308 	4 	379.14 	381.28 	370.09 	376.84
ant308 	8 	393.84 	385.62 	379.88 	386.45
ant308 	14 	386.46 	367.32 	382.12 	378.63
ant308 	16 	374.33 	381.08 	394.61 	383.34
ant309 	1 	497.76 	454.63 	534.66 	495.68
ant309 	2 	425.73 	411.48 	429.8 	422.34
ant309 	4 	397.27 	402.02 	390.32 	396.54
ant309 	8 	401.44 	391.34 	365.22 	386.00
ant309 	14 	344.78 	387.31 	390.56 	374.22
ant309 	16 	381.41 	370.21 	381.87 	377.83
ant310 	1 	530.09 	535.96 	484.61 	516.89
ant310 	2 	452.09 	452.9 	439.9 	448.30
ant310 	4 	394.19 	387.42 	411.44 	397.68
ant310 	8 	390.54 	401.96 	396.51 	396.34
ant310 	14 	358.21 	397.18 	377.33 	377.57
ant310 	16 	379.86 	399.12 	386.91 	388.63
ant311 	1 	460.55 	491.7 	497.99 	483.41
ant311 	2 	442.68 	419.38 	411.68 	424.58
ant311 	4 	365.68 	411.03 	386.79 	387.83
ant311 	8 	400.72 	381.16 	358.73 	380.20
ant311 	14 	399.81 	385.22 	377.79 	387.61
ant311 	16 	399.33 	400.72 	371.35 	390.47