Program gto_genomic_compressor

The gto_genomic_compressor is able to provide additional compression gains over several top specific tools, while as an analysis tool, it is able to determine absolute measures, namely for many distance computations, and local measures, such as the information content contained in each element, providing a way to quantify and locate specific genomic events.

For help type:

./gto_genomic_compressor -h

In the following subsections, we explain the input and output paramters.

Input parameters

The gto_genomic_compressor program needs a sequence to compress.

The attribution is given according to:

./gto_genomic_compressor [OPTION]... -r [FILE] [FILE]:[FILE]:[FILE]:[...]

Run Compression : ./gto_genomic_compressor -v -l 3 sequence.txt
Run Decompression : ./gto_genomic_decompressor -v
Run Information Profile : ./gto_genomic_compressor -v -l 3 -e sequence.txt

Compress and decompress genomic sequences for storage purposes.
Measure an upper bound of the sequences entropy.
Compute information profiles of genomic sequences.

-h, --help
usage guide (help menu).

-V, --version
Display program and version information.

-F, --force
force mode. Overwrites old files.

-v, --verbose
verbose mode (more information).

-x, --examples
show several running examples (parameter examples).

-s, --show-levels
show pre-computed compression levels (configured parameters).

-e, --estimate
it creates a file with the extension ".iae" with the
respective information content. If the file is FASTA or
FASTQ it will only use the "ACGT" (genomic) sequence.

-l [NUMBER], --level [NUMBER]
Compression level (integer).
Default level: 5.
It defines compressibility in balance with computational
resources (RAM & time). Use -s for levels perception.

-tm [NB_C]:[NB_D]:[NB_I]:[NB_H]:[NB_G]/[NB_S]:[NB_E]:[NB_A]
Template of a target context model.
[NB_C]: (integer [1;20]) order size of the regular context
model. Higher values use more RAM but, usually, are
related to a better compression score.
[NB_D]: (integer [1;5000]) denominator to build alpha, which
is a parameter estimator. Alpha is given by 1/[NB_D].
Higher values are usually used with higher [NB_C],
and related to confiant bets. When [NB_D] is one,
the probabilities assume a Laplacian distribution.
[NB_I]: (integer {0,1,2}) number to define if a sub-program
which addresses the specific properties of DNA
sequences (Inverted repeats) is used or not. The
number 2 turns ON this sub-program without the
regular context model (only inverted repeats). The
number 1 turns ON the sub-program using at the same
time the regular context model. The number 0 does
not contemple its use (Inverted repeats OFF). The
use of this sub-program increases the necessary time
to compress but it does not affect the RAM.
[NB_H]: (integer [1;254]) size of the cache-hash for deeper
context models, namely for [NB_C] > 14. When the
[NB_C] <= 14 use, for example, 1 as a default. The
RAM is highly dependent of this value (higher value
stand for higher RAM).
[NB_G]: (real [0;1)) real number to define gamma. This value
represents the decayment forgetting factor of the
regular context model in definition.
[NB_S]: (integer [0;20]) maximum number of editions allowed
to use a substitutional tolerant model with the same
memory model of the regular context model with
order size equal to [NB_C]. The value 0 stands for
turning the tolerant context model off. When the
model is on, it pauses when the number of editions
is higher that [NB_C], while it is turned on when
a complete match of size [NB_C] is seen again. This
is probabilistic-algorithmic model very usefull to
handle the high substitutional nature of genomic
sequences. When [NB_S] > 0, the compressor used more
processing time, but uses the same RAM and, usually,
achieves a substantial higher compression ratio. The
impact of this model is usually only noticed for
[NB_C] >= 14.
[NB_E]: (integer [1;5000]) denominator to build alpha for
substitutional tolerant context model. It is
analogous to [NB_D], however to be only used in the
probabilistic model for computing the statistics of
the substitutional tolerant context model.
[NB_A]: (real [0;1)) real number to define gamma. This value
represents the decayment forgetting factor of the
substitutional tolerant context model in definition.
Its definition and use is analogus to [NB_G].

... (you may use several target models with custom parameters)

-rm [NB_C]:[NB_D]:[NB_I]:[NB_H]:[NB_G]/[NB_S]:[NB_E]:[NB_A]
Template of a reference context model.
Use only when -r [FILE] is set (referential compression).
Parameters: the same as in -tm.

... (you may use several reference models with custom parameters)

-r [FILE], --reference [FILE]
Reference sequence filename ("-rm" are trainned here).
Example: -r file1.txt.

Input sequence filename (to compress) -- MANDATORY.
File(s) to compress (last argument).
For more files use splitting ":" characters.
Example: file1.txt:file2.txt:file3.txt.

In the following example, it will be downloaded seventeen DNA sequences, and compress and decompress one of the smallest (BuEb). Finally, it compares if the uncompressed sequence is equal to the original.

cp DNACorpus/BuEb .
../../bin/gto_genomic_compressor -v -l 2 BuEb
../../bin/gto_genomic_decompressor -v
cmp BuEb -l