Tutorial
Contents
- Simulate DNA sequencing
- Simulate RNA sequencing
- Customize simulation models
- Customize sequence identifiers
- Make your simulations reproducible
- Persistently customize simulation models with Docker
Simulate DNA sequencing
To simulate a whole-genome sequencing, you just need a reference genome file in FASTA
format for the desired species. If you don’t have one, you can follow this step:
or
Then, to simulate a whole-genome sequencing with paired-end reads, simply run:
Alternatively, you may customize several parameters for simulation:
Output files
The simulation output file generated for whole-genome sequencing will depend on the --output-format
(fastq
, bam
),
--join-paired-ends
(mate read pairs into a single file) and --sequencing-type
(single-end, paired-end)
options. Along with the simulation output, a file with read counts per chromosome will be produced. In the previous
example, the following output files will be produced:
Simulate DNA sequencing with custom quality profiles
By default, Sandy generates phred-scores using a statistical model based on the Poisson distribution.
Alternatively, Sandy may generate fastq
quality entries that mimic the Illumina,
PacBio and Nanopore sequencers. You can change it by passing
--quality-profile
option:
Which will output the phred-scores according to the MySeq sequencer with a read length of 150 bases.
To see all available quality profiles, run:
Simulate DNA sequencing with genomic variations
The user can also tune the reference genome, adding homozygous or heterozygous genomic variations such as SNVs, Indels, gene fusions and other types of structural variations (e.g. CNVs, retroCNVs). Sandy provides several bult-in genomic variations obtained from the 1KGP and from COSMIC.
So, let’s simulate a genome which includes the fusion between the genes NPM1 and ALK:
To see all available genomic variations, run:
Simulate RNA sequencing
To simulate a RNA sequencing, you just need a reference transcriptome file in FASTA
format for the desired species. If you don’t have one, you can follow this step:
or
Then, to simulate a RNA sequencing with paired-end reads, simply run:
Alternatively, you may customize several parameters for simulation:
Output files
The simulation output file generated for RNA sequencing will depend on the --output-format
(fastq
, bam
),
--join-paired-ends
(mate read pairs into a single file) and --sequencing-type
(single-end, paired-end)
options. Along with the simulation output, a file with the abundances per transcript will be produced,
and if there is the relationship between genes and their transcripts at the fasta
header, a file with the
abundances per gene are produced as well. In the previous example, the following output files will be produced:
Simulate RNA sequencing with custom expression profiles
By default, Sandy simulates RNA sequencing raffling transcripts according to their lengths. It is also possible to simulate RNA data that reflects the expression (abundance) of transcripts or genes in a particular tissue. For this purpose, built-in expression matrices were created from the gene expression profiles of 54 tissues from the GTExV8 project.
You can select an expression profile with the --expression-matrix
option. For example, let’s
simulate an RNA-Seq for liver tissue:
To see all available expression profiles, run:
Customize simulation models
Users can include their own models of sequencing quality, expression profiles and genomic variations in order to adapt the simulation to their needs.
Sequencing quality
To add a custom sequencing quality profile to Sandy, you should provide a file in fastq
format or
a file containing only the ASCII-encoded phred-scores, as in this example:
And with the command, you can add your custom profile to Sandy:
Expression profiles
To add a custom expression profile to Sandy, you should provide a file containing an expression matrix with two columns. The first column contains the transcript or gene ids and the second column contains raw counts. Counts will be treated as weights. Example:
And with the command, you can add your custom profile to Sandy:
Genomic variations
A genomic variation may be represented by a genomic position (seqid, position), a reference sequence at that position, an alternate sequence and a genotype (homozygous or heterozygous). To add a custom set of genomic variations to Sandy, you should provide a vcf file or a custom file.
For vcf
files, the user should point out the sample present in the vcf header and then its column will be used to extract
the genotype. If the user does not pass the option --sample-name
, then Sandy will use the first sample. Example:
In the my_variations.vcf
file, if you do not point out sample NA002
by passing the
option --sample-name=NA002
, the sample NA001
will be used by default.
Alternatively, you may provide a genomic variation file, which is a representation of a reduced VCF, that is, without the columns: QUAL, FILTER, INFO and FORMAT. There is only one SAMPLE column with the genotype for the entry in the format HO for homozygous and HE for heterozygous. See the example bellow:
And with the command, you can add the set of genomic variations to Sandy:
Customize sequence identifiers
The sequence identifier, as the name implies, is a string that identifies a biological sequence (usually
nucleotides) within sequencing data. For example, the fasta
format includes the sequence identifier
always after the >
character at the beginning of the line; the fastq
format always includes it after
the @
character at the beginning of the line; the sam
format uses the first column (called the
query template name).
Sequence identifier | File format |
---|---|
>MYID and Optional information ATCGATCG |
fasta |
@MYID and Optional information ATCGATCG + ABCDEFGH |
fastq |
MYID 99 chr1 123456 20 8M chr1 123478 30 ATCGATCG ABCDEFGH | sam |
Sequence identifiers may be customized in Sandy output using a format string passed by the user. This format
is a combination of literal and escaped characters, in a similar fashion to that used in C programming
language’s printf
function.
For example, simulating a paired-end sequencing you can add the read length, read position and mate position to all sequence identifiers with the following format:
In this case, results in fastq
format would be:
Make your simulations reproducible
Sandy comes with the option --seed
which receives an integer and is used to initiate the random number
generator. The ability to set a seed is useful for those who want reproducible simulations. Pay attention
to the number of jobs (--jobs
) set, because each job receives a different seed calculated from the main seed.
So, for reproducibility, the same seed set before needs the same number of jobs set before as well.
Obviously the user also needs to use the same options for the simulation type, sequencing type, quality profile, expression matrix, genomic variation, number of reads, coverage and the file with the reference sequences.
So, let’s test the reproducibility with the following examples:
Reproducibility in genome
Let’s make the same simulation twice and compare them:
Comparing both the results:
And you should receive the message:
Reproducibility in transcriptome
in the same way as the examples above for genome, you can test the reproducibility with:
Comparing the simulations for R1:
And you should receive the message:
Comparing the simulations for R2
And you should receive the message:
Persistently customize simulation models with Docker
Sandy expression matrix, quality profile and structural variation patterns are stored within docker container, that is, any database changes during runtime will last as long as the container is not removed.
A named Docker volume or a mounted host directory should be used in order to keep your changes to the
database. If our container detects that the path /sandy/db
is mounted, then the database
/sandy/db/db.sqlite3
will be used instead of the default database. In the same way, if there is no
database db.sqlite3
inside the mounted path /sandy/db/
, then the default database will be copied to
/sandy/db/
and used consecutively.
Named volume
sandy_db
volume will be created at first run and will persist after container deletion.
You can verify the created volume with the commands:
And in more detail with the command:
Mounted directory
/path/to/DB
will receive the default database at first run and any further changes will be stored in it.
Now, verify the directory /path/to/DB
. You should find the file db.sqlite3
.
As you add your custom patterns to Sandy, the alterations will be kept safely outside the container.
Add custom sequencing quality profiles
Check the new quality profile at sandy_db
:
Add custom expression profiles
Check the new expression matrix at sandy_db
:
Add custom genomic variations
Check the new structural variation at sandy_db
: