This is an old revision of the document!
CD-HIT , from Ubuntu 18.04 repo (CD-HIT version 4.7 (built on Jul 1 2017)) . Page : http://cd-hit.org .
A fast program for clustering and comparing large sets of protein or nucleotide sequences. Accelerated for clustering the next generation sequencing data.
Files come form UniProt webpage (UK mirror: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/). batch file #!/bin/bash -l #SBATCH --partition=long #SBATCH --nodelist=ant006 #SBATCH --ntasks=8 #SBATCH --mem 2048M #test filename FASTAFILE=uniprot_sprot.fasta #dir with input file and dir for results INPUTFILEDIR="/workspace/${USER}/anthill23/testfiles" CDHITTESTDIR="/workspace/${USER}/anthill23/test08" #download input file if not present if [ ! -f ${INPUTFILEDIR}/${FASTAFILE} ]; then echo "File ${FASTAFILE} not found!" echo "please download file ${FASTAFILE} to ${INPUTFILEDIR}/ " exit fi #printout to output file some info echo "cd-hit anthill test" echo "input_file : " ${FASTAFILE} echo "date: " `date` echo "node_name: " `hostname` echo "cores: " ${SLURM_NPROCS} #copy fasta file to result directory if not present mkdir -p ${CDHITTESTDIR} cd ${CDHITTESTDIR} if [ ! -f ${FASTAFILE} ]; then echo "File ${FASTAFILE} not present in results dir!" cp ${INPUTFILEDIR}/${FASTAFILE} ${CDHITTESTDIR} fi #run cdhit computation and remove output files cd ${CDHITTESTDIR} time -p cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 24576 rm out out.clstr test conditions Each job was run alone, that is only one job run on whole cluster. So jobs did not compete for nfs shares or memory bus etc. To automate whole test process cd-hit job batch file was modified. All jobs were committed with bash script. To make sure that jobs do not run simultaneously slurm's option "--dependency=afterany:" was used . Auxiliary batch script (empty.batch): #!/bin/bash -l #SBATCH --partition=short #SBATCH --ntasks=1 sleep 1; ch-hit job script (cdhit_test.batch) : #!/bin/bash -l #SBATCH --partition=short #SBATCH --mem 2048M #test filename FASTAFILE=uniprot_sprot.fasta #dir with input file and dir for results INPUTFILEDIR="/workspace/${USER}/anthill23/testfiles" CDHITTESTDIR="/workspace/${USER}/anthill23/test08" #download input file if not present if [ ! -f ${INPUTFILEDIR}/${FASTAFILE} ]; then echo "File ${FASTAFILE} not found!" echo "please download file ${FASTAFILE} to ${INPUTFILEDIR}/ " exit fi #pintout to output file some info echo "cd-hit anthill test" echo "input file : " ${FASTAFILE} echo "date: " `date` echo "nodename: " `hostname` echo "cores: " ${SLURM_NPROCS} #copy uniprot_sprot.fasta to result directory if not present mkdir -p ${CDHITTESTDIR} cd ${CDHITTESTDIR} if [ ! -f ${FASTAFILE} ]; then echo "File uniprot_sprot.fasta not present in results dir!" cp ${INPUTFILEDIR}/${FASTAFILE} ${CDHITTESTDIR} fi #run cdhit computation and remove output files cd ${CDHITTESTDIR} time -p cdhit -i ${FASTAFILE} -o out -c 0.9 -n 5 -T ${SLURM_NPROCS} -M 2048 rm out out.clstr Bash script used to commit all jobs (run_cdhit_test_on_anthill.sh) : #!/bin/sh BATCHNAME=cdhit_test.batch # first job - no dependencies, run empty job and get jobid previd=$(sbatch empty.batch | awk '{print $4}'); for i in $(seq 1 3) #repeat each test 3 times do #nodes : ant001, ant002 for nodename in ant001 ant002 do for ntasksvalue in 1 2 do echo "node : ${nodename}" echo "ntasks : ${ntasksvalue}" nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}') previd=${nextid} done done #nodes : ant003, ant004 for nodename in ant003 ant004 do for ntasksvalue in 1 2 4 do echo "node : ${nodename}" echo "ntasks : ${ntasksvalue}" nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}') previd=${nextid} done done #nodes : ant005, ant006 for nodename in ant005 ant006 do for ntasksvalue in 1 2 4 8 do echo "node : ${nodename}" echo "ntasks : ${ntasksvalue}" nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}') previd=${nextid} done done #nodes : ant007, ant008 for nodename in ant007 ant008 do for ntasksvalue in 1 2 4 8 14 16 28 do echo "node : ${nodename}" echo "ntasks : ${ntasksvalue}" nextid=$(sbatch --dependency=afterany:${previd} --nodelist=${nodename} --ntasks=${ntasksvalue} ${BATCHNAME} | awk '{print $4}') previd=${nextid} done done done # repeat each test 3 times # show dependencies in squeue output: squeue -u $USER -o "%.8A %.4C %.10m %.20E" results In this test each node was occupied only by one job. File : uniprot_sprot.fasta . Memory allocated for each job : 2048 MB (cd-hit reported memory usage lower than 1400 MB). node name cores used result1 [s] result2 [s] result3 [s] average [s] ant001 1 365.94 371.68 367.26 368.29 ant001 2 244.42 240.67 242.41 242.50 ant002 1 409.78 369.57 368.46 382.60 ant002 2 243.23 238.74 244.39 242.12 ant003 1 360.88 361.21 365.22 362.44 ant003 2 245.87 251.52 245.33 247.57 ant003 4 202.77 205.5 203.77 204.01 ant004 1 357.43 364.11 372.73 364.76 ant004 2 239.34 239.92 236.58 238.61 ant004 4 189.77 194 194.24 192.67 ant005 1 334.39 331.71 318.99 328.36 ant005 2 238.52 234.26 239.69 237.49 ant005 4 196.94 199.28 198.59 198.27 ant005 8 175.24 178.24 173.24 175.57 ant006 1 318.6 330.38 330.18 326.39 ant006 2 239.27 238.46 240.01 239.25 ant006 4 192.16 191.58 196.91 193.55 ant006 8 156.69 176.7 178.24 170.54 ant007 1 149.8 140.85 147.45 146.03 ant007 2 114.65 112.76 118.23 115.21 ant007 4 98.15 96.56 98.67 97.79 ant007 8 90.63 89.64 88.38 89.55 ant007 14 85.28 87.12 83.37 85.26 ant007 16 82.46 86.69 86.67 85.27 ant007 28 75.35 87.45 85.33 82.71 ant008 1 140.42 147.09 142.25 143.25 ant008 2 113.35 116.83 118.27 116.15 ant008 4 99.1 91.73 90.12 93.65 ant008 8 88.89 80.2 89.31 86.13 ant008 14 76.03 85.79 86.91 82.91 ant008 16 86.97 82.77 85.11 84.95 ant008 28 83.19 89.83 89.45 87.49 ant009 1 135.40 135.35 140.31 137.02 ant009 2 108.71 111.89 108.27 109.62 ant009 4 85.06 90.44 84.27 86.59 ant009 8 72.98 78.29 77.61 76.29 ant009 14 71.93 72.95 67.78 70.89 ant009 16 67.35 71.71 72.00 70.35 ant011 1 104.71 105.06 104.5 104.76 ant011 2 89.57 90.33 89.65 89.85 ant011 4 72.20 71.92 72.22 72.11 ant012 1 106.92 106.50 107.06 106.83 ant012 2 93.41 93.48 93.27 93.39 ant012 4 75.24 75.16 75.23 75.21 ant100 1 498.72 499.5 486.61 494.94 ant100 2 452.51 422.12 447.15 440.59 ant100 4 389.68 388.35 407.17 395.07 ant100 8 388.59 382.68 396.2 389.16 ant100 14 368.64 351.9 401.13 373.89 ant100 16 391.47 396.53 357.93 381.98 ant101 1 503.45 452.31 487.26 481.01 ant101 2 458.42 435.75 451.08 448.42 ant101 4 417.15 382.07 394.57 397.93 ant101 8 369.29 383.49 397.68 383.49 ant101 14 367.95 379.02 377.11 374.69 ant101 16 359.85 377.17 383.34 373.45 ant102 1 458.29 493.13 550.64 500.69 ant102 2 426.48 431.78 425.33 427.86 ant102 4 389.84 397.64 400.21 395.90 ant102 8 376.8 386.24 384.44 382.49 ant102 14 378.7 380.39 363.19 374.09 ant102 16 386.91 377.53 362.45 375.63 ant103 1 542.24 512.03 536.04 530.10 ant103 2 419.03 421.15 439.53 426.57 ant103 4 403.98 401.51 413.29 406.26 ant103 8 419.37 401.93 400.54 407.28 ant103 14 382.57 390.96 399.59 391.04 ant103 16 381.75 352.94 408.68 381.12 ant104 1 506.13 491.18 498.93 498.75 ant104 2 436.75 457.8 460.15 451.57 ant104 4 392.18 399.32 426.82 406.11 ant104 8 398.33 376.98 410.96 395.42 ant104 14 410.99 407.83 361.08 393.30 ant104 16 379.62 446.03 387.65 404.43 ant105 1 531.36 496.54 497.53 508.48 ant105 2 427.13 436.78 420.61 428.17 ant105 4 390.03 421.38 402.52 404.64 ant105 8 405.04 381.83 398.82 395.23 ant105 14 392.35 381.58 400.12 391.35 ant105 16 394.61 372.8 391.66 386.36 ant106 1 571.33 521.11 501.05 531.16 ant106 2 426.67 459.36 420.83 435.62 ant106 4 373.5 389.3 409.29 390.70 ant106 8 406.36 419.5 387.7 404.52 ant106 14 404.83 403.86 365.56 391.42 ant106 16 389.17 390.04 387.46 388.89 ant107 1 499.09 556.72 503.26 519.69 ant107 2 440.45 447.31 433.92 440.56 ant107 4 400.1 396.56 401.83 399.50 ant107 8 390.98 396.28 389.81 392.36 ant107 14 372.23 348.51 378.69 366.48 ant107 16 388.68 389.47 380.07 386.07 ant108 1 495.9 556.77 548.07 533.58 ant108 2 444.71 412.32 425.88 427.64 ant108 4 391.17 400.76 388.44 393.46 ant108 8 412.31 402.63 394.59 403.18 ant108 14 362.67 398.34 394.56 385.19 ant108 16 386.53 388.5 362.56 379.20 ant109 1 545.91 556.75 494.27 532.31 ant109 2 461.51 444.07 450.6 452.06 ant109 4 403.82 399.62 397.57 400.34 ant109 8 386.67 372.51 382.94 380.71 ant109 14 394.64 398.26 403.91 398.94 ant109 16 393.43 380.63 381.21 385.09 ant110 1 483.4 504.87 565.43 517.90 ant110 2 436.53 445.72 440.23 440.83 ant110 4 386.86 376.82 397.58 387.09 ant110 8 406.49 400.98 404.7 404.06 ant110 14 377.74 376.83 387.77 380.78 ant110 16 388.67 414.8 338.77 380.75 ant300 1 505.57 508.71 507.09 507.12 ant300 2 465.51 469.14 443.56 459.40 ant300 4 391.08 387.43 423.27 400.59 ant300 8 375.34 391.55 384.94 383.94 ant300 14 370 385.67 392.79 382.82 ant300 16 417.78 371.86 394 394.55 ant301 1 529.53 555.58 554.29 546.47 ant301 2 446.13 428.45 433.15 435.91 ant301 4 391.19 400.25 381.3 390.91 ant301 8 378.6 403.8 398.7 393.70 ant301 14 387.28 404.41 416.27 402.65 ant301 16 407.38 384.49 393.4 395.09 ant302 1 487.93 540.43 509.09 512.48 ant302 2 455.83 416.38 422.88 431.70 ant302 4 402.54 402.98 411.03 405.52 ant302 8 410.27 396.98 387.37 398.21 ant302 14 388.08 361.4 404.88 384.79 ant302 16 393.81 388.89 350.89 377.86 ant303 1 543 496.53 497.46 512.33 ant303 2 438.08 431.19 450.57 439.95 ant303 4 374.69 412.4 415.05 400.71 ant303 8 392.73 402.06 371.14 388.64 ant303 14 385.49 403.59 377.39 388.82 ant303 16 385.74 380.52 368.13 378.13 ant304 1 488.44 530.69 505.74 508.29 ant304 2 426.9 429.9 416.27 424.36 ant304 4 400.71 411.94 409.4 407.35 ant304 8 397.68 390.12 395.15 394.32 ant304 14 400.9 399.97 395.25 398.71 ant304 16 376.71 411.95 391.9 393.52 ant305 1 493.29 457.06 499.45 483.27 ant305 2 435.51 416.19 441.24 430.98 ant305 4 383.56 383.94 374.93 380.81 ant305 8 390.53 374.54 381.38 382.15 ant305 14 383.11 378.42 366.69 376.07 ant305 16 339.12 371.78 384.37 365.09 ant306 1 525.12 562.76 554.43 547.44 ant306 2 422.76 428.61 455.4 435.59 ant306 4 399.04 418.49 393.59 403.71 ant306 8 401.78 398.08 398.79 399.55 ant306 14 372.71 344.34 387.41 368.15 ant306 16 402.07 397.08 363.71 387.62 ant307 1 529.49 491.09 551.61 524.06 ant307 2 436.61 450.36 410.17 432.38 ant307 4 369.26 382.06 372.96 374.76 ant307 8 399.93 368.99 381.69 383.54 ant307 14 393.04 402.61 392.46 396.04 ant307 16 397.69 366.65 368.11 377.48 ant308 1 493.37 485 495.46 491.28 ant308 2 424.94 407.66 440.15 424.25 ant308 4 379.14 381.28 370.09 376.84 ant308 8 393.84 385.62 379.88 386.45 ant308 14 386.46 367.32 382.12 378.63 ant308 16 374.33 381.08 394.61 383.34 ant309 1 497.76 454.63 534.66 495.68 ant309 2 425.73 411.48 429.8 422.34 ant309 4 397.27 402.02 390.32 396.54 ant309 8 401.44 391.34 365.22 386.00 ant309 14 344.78 387.31 390.56 374.22 ant309 16 381.41 370.21 381.87 377.83 ant310 1 530.09 535.96 484.61 516.89 ant310 2 452.09 452.9 439.9 448.30 ant310 4 394.19 387.42 411.44 397.68 ant310 8 390.54 401.96 396.51 396.34 ant310 14 358.21 397.18 377.33 377.57 ant310 16 379.86 399.12 386.91 388.63 ant311 1 460.55 491.7 497.99 483.41 ant311 2 442.68 419.38 411.68 424.58 ant311 4 365.68 411.03 386.79 387.83 ant311 8 400.72 381.16 358.73 380.20 ant311 14 399.81 385.22 377.79 387.61 ant311 16 399.33 400.72 371.35 390.47