The **gto_genomic_compressor** is able to provide additional compression gains over several top specific tools, while as an analysis tool, it is able to determine absolute measures, namely for many distance computations, and local measures, such as the information content contained in each element, providing a way to quantify and locate specific genomic events.

For help type:

```
```./gto_genomic_compressor -h

In the following subsections, we explain the input and output paramters.

The **gto_genomic_compressor** program needs a sequence to compress.

The attribution is given according to:

```
```SYNOPSIS

./gto_genomic_compressor [OPTION]... -r [FILE] [FILE]:[FILE]:[FILE]:[...]

SAMPLE

Run Compression : ./gto_genomic_compressor -v -l 3 sequence.txt

Run Decompression : ./gto_genomic_decompressor -v sequence.txt.co

Run Information Profile : ./gto_genomic_compressor -v -l 3 -e sequence.txt

DESCRIPTION

Compress and decompress genomic sequences for storage purposes.

Measure an upper bound of the sequences entropy.

Compute information profiles of genomic sequences.

-h, --help

usage guide (help menu).

-V, --version

Display program and version information.

-F, --force

force mode. Overwrites old files.

-v, --verbose

verbose mode (more information).

-x, --examples

show several running examples (parameter examples).

-s, --show-levels

show pre-computed compression levels (configured parameters).

-e, --estimate

it creates a file with the extension ".iae" with the

respective information content. If the file is FASTA or

FASTQ it will only use the "ACGT" (genomic) sequence.

-l [NUMBER], --level [NUMBER]

Compression level (integer).

Default level: 5.

It defines compressibility in balance with computational

resources (RAM & time). Use -s for levels perception.

-tm [NB_C]:[NB_D]:[NB_I]:[NB_H]:[NB_G]/[NB_S]:[NB_E]:[NB_A]

Template of a target context model.

Parameters:

[NB_C]: (integer [1;20]) order size of the regular context

model. Higher values use more RAM but, usually, are

related to a better compression score.

[NB_D]: (integer [1;5000]) denominator to build alpha, which

is a parameter estimator. Alpha is given by 1/[NB_D].

Higher values are usually used with higher [NB_C],

and related to confiant bets. When [NB_D] is one,

the probabilities assume a Laplacian distribution.

[NB_I]: (integer {0,1,2}) number to define if a sub-program

which addresses the specific properties of DNA

sequences (Inverted repeats) is used or not. The

number 2 turns ON this sub-program without the

regular context model (only inverted repeats). The

number 1 turns ON the sub-program using at the same

time the regular context model. The number 0 does

not contemple its use (Inverted repeats OFF). The

use of this sub-program increases the necessary time

to compress but it does not affect the RAM.

[NB_H]: (integer [1;254]) size of the cache-hash for deeper

context models, namely for [NB_C] > 14. When the

[NB_C] <= 14 use, for example, 1 as a default. The

RAM is highly dependent of this value (higher value

stand for higher RAM).

[NB_G]: (real [0;1)) real number to define gamma. This value

represents the decayment forgetting factor of the

regular context model in definition.

[NB_S]: (integer [0;20]) maximum number of editions allowed

to use a substitutional tolerant model with the same

memory model of the regular context model with

order size equal to [NB_C]. The value 0 stands for

turning the tolerant context model off. When the

model is on, it pauses when the number of editions

is higher that [NB_C], while it is turned on when

a complete match of size [NB_C] is seen again. This

is probabilistic-algorithmic model very usefull to

handle the high substitutional nature of genomic

sequences. When [NB_S] > 0, the compressor used more

processing time, but uses the same RAM and, usually,

achieves a substantial higher compression ratio. The

impact of this model is usually only noticed for

[NB_C] >= 14.

[NB_E]: (integer [1;5000]) denominator to build alpha for

substitutional tolerant context model. It is

analogous to [NB_D], however to be only used in the

probabilistic model for computing the statistics of

the substitutional tolerant context model.

[NB_A]: (real [0;1)) real number to define gamma. This value

represents the decayment forgetting factor of the

substitutional tolerant context model in definition.

Its definition and use is analogus to [NB_G].

... (you may use several target models with custom parameters)

-rm [NB_C]:[NB_D]:[NB_I]:[NB_H]:[NB_G]/[NB_S]:[NB_E]:[NB_A]

Template of a reference context model.

Use only when -r [FILE] is set (referential compression).

Parameters: the same as in -tm.

... (you may use several reference models with custom parameters)

-r [FILE], --reference [FILE]

Reference sequence filename ("-rm" are trainned here).

Example: -r file1.txt.

[FILE]

Input sequence filename (to compress) -- MANDATORY.

File(s) to compress (last argument).

For more files use splitting ":" characters.

Example: file1.txt:file2.txt:file3.txt.

In the following example, it will be downloaded seventeen DNA sequences, and compress and decompress one of the smallest (BuEb). Finally, it compares if the uncompressed sequence is equal to the original.

```
```wget http://sweet.ua.pt/pratas/datasets/DNACorpus.zip

unzip DNACorpus.zip

cp DNACorpus/BuEb .

../../bin/gto_genomic_compressor -v -l 2 BuEb

../../bin/gto_genomic_decompressor -v BuEb.co

cmp BuEb BuEb.de -l