Program gto_fastq_metagenomics

The gto_fastq_metagenomics is an ultra-fast method to infer metagenomic composition of sequenced reads relative to a database. gto_fastq_metagenomics measures similarity between any FASTQ file (or FASTA), independently from the size, against any multi-FASTA database, such as the entire set of complete genomes from the NCBI. gto_fastq_metagenomics supports single reads, paired-end reads, and compositions of both. It has been tested in many plataforms, such as Illumina MySeq, HiSeq, Novaseq, IonTorrent.

gto_fastq_metagenomics is efficient to detect the presence and authenticate a given species in the FASTQ reads. The core of the method is based on relative data compression. gto_fastq_metagenomics uses variable multi-threading, without multiplying the memory for each thread, being able to run efficiently in a common laptop.

For help type:

./gto_fastq_metagenomics -h

In the following subsections, we explain the input and output paramters.

Input parameters

The gto_fastq_metagenomics program needs a FASTQ file to compute.

The attribution is given according to:

gto_fastq_metagenomics v3.1: a tool to infer metagenomic composition.

gto_fastq_metagenomics [OPTION]... [FILE1]:[FILE2]:... [FILE]

gto_fastq_metagenomics -v -F -l 47 -Z -y reads1.fq:reads2.fq DB.fa

It infers metagenomic sample composition of sequenced reads.
The core of the method uses a cooperation between multiple
context and tolerant context models with several depths.
The reference sequences must be in a multi-FASTA format.
The sequenced reads must be trimmed and in FASTQ format.

Non-mandatory arguments:

-h give this help,
-F force mode (overwrites top file),
-V display version number,
-v verbose mode (more information),
-Z database local similarity,
-s show compression levels,

-l compression level [1;47],
-p subsampling (default: 1),
-t top of similarity (default: 20),
-n number of threads (default: 2),

-x similarity top filename,
-y profile filename (-Z must be on).

Mandatory arguments:

[FILE1]:[FILE2]:... metagenomic filename (FASTQ),
Use ":" for splitting files.

[FILE] database filename (Multi-FASTA).

Copyright (C) 2014-2019, IEETA, University of Aveiro.
This is a Free software, under GPLv3. You may redistribute
copies of it under the terms of the GNU - General Public
License v3 .

An example of such an input file is:

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72
+SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=72


The output of the gto_fastq_metagenomics program is a CSV file (top.csv) with the highest probability of being contained in the samples. An example for this CSV file is the following:

1 66725 12.263 NC_037703.1_Saccharomycodes_ludwigii_strain_Y-8871_mitochondrion
2 66725 12.263 NC_037703.1_Saccharomycodes_ludwigii_strain_Y-8871_mitochondrion
3 107123 11.492 NC_012621.1_Nakaseomyces_bacillisporus_mitochondrion
4 107123 11.492 NC_012621.1_Nakaseomyces_bacillisporus_mitochondrion
5 16592 11.153 NC_024030.1_Equus_przewalskii_mitochondrial_DNA
6 14583 10.851 NC_021120.1_Bursaphelenchus_mucronatus_mitochondrion
7 162504 10.607 NC_018415.1_Candidatus_Carsonella_ruddii_CS_isolate_Thao2000
8 10315 10.586 NC_016117.1_Mnemiopsis_leidyi_mitochondrion
9 162589 10.550 NC_018414.1_Candidatus_Carsonella_ruddii_CE_isolate_Thao2000
10 166163 10.476 NC_018416.1_Candidatus_Carsonella_ruddii_HC_isolate_Thao2000