Program gto_fastq_xs

The gto_fastq_xs is a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. XS handles Ion Torrent, Roche-454, Illumina and ABI-SOLiD simulation sequencing types. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores). Quality-scores can be simulated using uniform and Gaussian distributions.

For help type:

./gto_fastq_xs -h


In the following subsections, we explain the input and output paramters.

Input parameters

The gto_fastq_xs program needs a FASTQ file to compute.

The attribution is given according to:

Usage: ./gto_fastq_xs [OPTION]... [FILE]

System options:
-h give this help
-v verbose mode

Main FASTQ options:
-t < sequencingType > type: 1=Roche-454, 2=Illumina, 3=ABI SOLiD, 4=Ion Torrent
-hf < headerFormat > header format: 1=Length appendix, 2=Pair End
-i n=< instrumentName > the unique instrument name (use n= before name)
-o use the same header in third line of the read
-ls < lineSize > static line (bases/quality scores) size
-ld < minSize >:< maxSize > dynamic line (bases/quality scores) size
-n < numberOfReads > number of reads per file

DNA options:
-f < A >,< C >,< G >,< T >,< N > symbols frequency
-rn < numberOfRepeats > repeats: number (default: 0)
-ri < repeatsMinSize > repeats: minimum size
-ra < repeatsMaxSize > repeats: maximum size
-rm < mutationRate > repeats: mutation frequency
-rr repeats: use reverse complement repeats

Quality scores options:
-qt < assignmentType > quality scores distribution: 1=uniform, 2=gaussian
-qf < statsFile > load file: mean, standard deviation (when: -qt 2)
-qc < template > custom template ascii alphabet

Filtering options:
-eh excludes the use of headers from output
-eo excludes the use of optional headers (+) from output
-ed excludes the use of DNA bases from output
-edb excludes '\n' when DNA bases line size is reached
-es excludes the use of quality scores from output

Stochastic options:
-s < seed > generation seed

< genFile > simulated output file

Common usage:
./XS -v -t 1 -i n=MySeq -ld 30:80 -n 20000 -qt=1 -qc 33,36,39:43 File
./XS -v -ls 100 -n 10000 -eh -eo -es -edb -f 0.3,0.2,0.2,0.3,0.0 -rn 50 -ri 300 -ra 3000 -rm 0.1 File


An example of such an input file is:

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=60
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIIIIIDIII
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGAAGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBIIIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I


Output

The output of the gto_fastq_xs program is a FASTQ file

Using the input above using the common usage with 5 reads (-n 5), an output example for this is the following:

@output.fastq.598 LQGQLWH01D5WVZ length=62
TTCNTNCCAGGTAAAGAGAACATNCCGNCGCACTACTCGTAAGACTTGCTGGNCGAGAAAGG
+
)(+!*!$')($(()+'))$$()'!)!$!!$*+)+''('!)))!+!)(!+!*$!'$*)**++!
@output.fastq.1510 LQGQLWH01A7LJI length=57
CTAGACTACTCGAGCACTAGGCTCGCGTNTACCANGGGGNCTGCGNGTTGGCNCGGT
+
)+(*(+$*)+!*)!'!!(!(!!(*'$!+!(()$'!!+*+!!))!*!')***+!$+''
@output.fastq.2153 LQGQLWH01CHBQJ length=33
ACTTTTTGCTCAAGCAGGGTTGCCTAGCAANAC
+
*)++!+$''')*)**!+)$(*((*)$!'!+!!*
@output.fastq.3251 LQGQLWH01C8OY4 length=75
TCTTTCCTTCNCGNCCNAATTCCCCATAANAACTTAAAATCNCNNGCTGCGCGTGATCAACAATATTAATACTCC
+
!*''+*'!''!+!!!*!'!+(++)*(*($!!*((')$*!$(!'!!'+)$+*!$*!**!'()$!*'+'*'+!!+'(
@output.fastq.3934 LQGQLWH01AQDXM length=36
GGTAACNNGGAATTCTTCCAATTANCCNTGTCCGGC
+
$+)'!'!!)+)+!''**$$*!!')!+)!)*()!))$