Tutorial
Contents
- Simulate DNA sequencing
- Simulate RNA sequencing
- Customize simulation models
- Customize sequence identifiers
- Make your simulations reproducible
- Persistently customize simulation models with Docker
Simulate DNA sequencing
To simulate a whole-genome sequencing, you just need a reference genome file in FASTA
format for the desired species. If you don’t have one, you can follow this step:
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
or
$ curl https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Then, to simulate a whole-genome sequencing with paired-end reads, simply run:
$ sandy genome --sequencing-type paired-end hg38.fa.gz
Alternatively, you may customize several parameters for simulation:
$ sandy genome \
--verbose \ #enable log messages
--jobs 2 \ #set number of jobs
--coverage 8 \ #set genome coverage
--prefix out \ #set prefix output
--output-format fastq.gz \ #set output format
--sequencing-type paired-end \ #set sequencing type
--quality-profile poisson \ #set sequencing quality profile
--fragment-mean 300 \ #set fragment mean size
--fragment-stdd 50 \ #set standard deviation of fragment size
hg38.fa.gz
Output files
The simulation output file generated for whole-genome sequencing will depend on the --output-format
(fastq
, bam
),
--join-paired-ends
(mate read pairs into a single file) and --sequencing-type
(single-end, paired-end)
options. Along with the simulation output, a file with read counts per chromosome will be produced. In the previous
example, the following output files will be produced:
out_R1_001.fastq.gz
out_R2_001.fastq.gz
out_coverage.tsv
Simulate DNA sequencing with custom quality profiles
By default, Sandy generates phred-scores using a statistical model based on the Poisson distribution.
Alternatively, Sandy may generate fastq
quality entries that mimic the Illumina,
PacBio and Nanopore sequencers. You can change it by passing
--quality-profile
option:
$ sandy genome --verbose --quality-profile miseq_150 hg38.fa.gz
Which will output the phred-scores according to the MySeq sequencer with a read length of 150 bases.
To see all available quality profiles, run:
$ sandy quality
Simulate DNA sequencing with genomic variations
The user can also tune the reference genome, adding homozygous or heterozygous genomic variations such as SNVs, Indels, gene fusions and other types of structural variations (e.g. CNVs, retroCNVs). Sandy provides several bult-in genomic variations obtained from the 1KGP and from COSMIC.
So, let’s simulate a genome which includes the fusion between the genes NPM1 and ALK:
$ sandy genome --genomic-variation fusion_hg38_NPM1-ALK hg38.fa.gz
To see all available genomic variations, run:
$ sandy variation
Simulate RNA sequencing
To simulate a RNA sequencing, you just need a reference transcriptome file in FASTA
format for the desired species. If you don’t have one, you can follow this step:
$ wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz
or
$ curl http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gz
Then, to simulate a RNA sequencing with paired-end reads, simply run:
$ sandy transcriptome --sequencing-type paired-end gencode.v40.transcripts.fa.gz
Alternatively, you may customize several parameters for simulation:
$ sandy transcriptome \
--verbose \ #enable log messages
--jobs 2 \ #set number of jobs
--prefix out \ #set prefix output
--output-format fastq.gz \ #set output format
--sequencing-type paired-end \ #set paired-end sequencing
--quality-profile poisson \ #set sequencing quality profile
--fragment-mean 300 \ #set fragment mean size
--fragment-stdd 50 \ #set standard deviation of fragment size
--number-of-reads 1000000 \ #set the number of reads
gencode.v40.transcripts.fa.gz
Output files
The simulation output file generated for RNA sequencing will depend on the --output-format
(fastq
, bam
),
--join-paired-ends
(mate read pairs into a single file) and --sequencing-type
(single-end, paired-end)
options. Along with the simulation output, a file with the abundances per transcript will be produced,
and if there is the relationship between genes and their transcripts at the fasta
header, a file with the
abundances per gene are produced as well. In the previous example, the following output files will be produced:
out_R1_001.fastq.gz
out_R2_001.fastq.gz
out_abundance_transcripts.tsv
out_abundance_genes.tsv
Simulate RNA sequencing with custom expression profiles
By default, Sandy simulates RNA sequencing raffling transcripts according to their lengths. It is also possible to simulate RNA data that reflects the expression (abundance) of transcripts or genes in a particular tissue. For this purpose, built-in expression matrices were created from the gene expression profiles of 54 tissues from the GTExV8 project.
You can select an expression profile with the --expression-matrix
option. For example, let’s
simulate an RNA-Seq for liver tissue:
$ sandy transcriptome --expression-matrix liver gencode.v40.transcripts.fa.gz
To see all available expression profiles, run:
$ sandy expression
Customize simulation models
Users can include their own models of sequencing quality, expression profiles and genomic variations in order to adapt the simulation to their needs.
Sequencing quality
To add a custom sequencing quality profile to Sandy, you should provide a file in fastq
format or
a file containing only the ASCII-encoded phred-scores, as in this example:
$ cat my_qualities.txt
BECGF@F@DEBIDBE@DCC?HFH?BBB?H@FEEIFDCCECCCIGDIDI?@?CCC?AE?EC?F?@FB;<9<>9:599=>7:57614,30,440&"!***)#
@DCGIDBDECIHIG@FII?G?GCAD@BFECDCEF?H?GIHE?@GEECBCIHCABAFHDFAHBEBEB:5575678=75>657673-14,.113#"()#&)$
F?B@@DFAHIDD?EBFADICBFABCBBAHFCGF@@@?DEIAIEAFCEADC?B@IB?BIEABIBG@C<:;96<968:>::;778,+0203-3,#&'$$#&!
...
And with the command, you can add your custom profile to Sandy:
$ sandy quality add --verbose --quality-profile new_quality my_qualities.txt
Expression profiles
To add a custom expression profile to Sandy, you should provide a file containing an expression matrix with two columns. The first column contains the transcript or gene ids and the second column contains raw counts. Counts will be treated as weights. Example:
$ cat my_custom_expression_matrix.txt
#feature count
ENST00000000233.9 2463
ENST00000000412.7 2494
ENST00000000442.10 275
ENST00000001008.5 5112
ENST00000001146.6 637
ENST00000002125.8 660
ENST00000002165.10 478
ENST00000002501.10 57
ENST00000002596.5 183
...
And with the command, you can add your custom profile to Sandy:
$ sandy expression add --verbose --expression-matrix new_tissue my_custom_expression_matrix.tsv
Genomic variations
A genomic variation may be represented by a genomic position (seqid, position), a reference sequence at that position, an alternate sequence and a genotype (homozygous or heterozygous). To add a custom set of genomic variations to Sandy, you should provide a vcf file or a custom file.
For vcf
files, the user should point out the sample present in the vcf header and then its column will be used to extract
the genotype. If the user does not pass the option --sample-name
, then Sandy will use the first sample. Example:
$ cat my_variations.vcf
##fileformat=VCFv4.3
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA001 NA002
chr20 14370 rs81 G A 29 PASS NS=3;DP=14 GT 0/1 0/0
chr20 17330 rs82 T AAA 3 PASS NS=3;DP=20 GT 1/1 0/0
chr20 110696 rs83 A GTCT 10 PASS NS=2;DP=11 GT 0/1 1/1
...
In the my_variations.vcf
file, if you do not point out sample NA002
by passing the
option --sample-name=NA002
, the sample NA001
will be used by default.
Alternatively, you may provide a genomic variation file, which is a representation of a reduced VCF, that is, without the columns: QUAL, FILTER, INFO and FORMAT. There is only one SAMPLE column with the genotype for the entry in the format HO for homozygous and HE for heterozygous. See the example bellow:
$ cat my_variations.txt
#seqid position id reference alternate genotype
chr20 14370 rs81 G A HE
chr20 17330 rs82 T AAA HO
chr20 110696 rs83 A GTCT HE
And with the command, you can add the set of genomic variations to Sandy:
$ sandy variation add --verbose --genomic-variation=my_variations my_variations.txt
Customize sequence identifiers
The sequence identifier, as the name implies, is a string that identifies a biological sequence (usually
nucleotides) within sequencing data. For example, the fasta
format includes the sequence identifier
always after the >
character at the beginning of the line; the fastq
format always includes it after
the @
character at the beginning of the line; the sam
format uses the first column (called the
query template name).
Sequence identifier | File format |
---|---|
>MYID and Optional information ATCGATCG |
fasta |
@MYID and Optional information ATCGATCG + ABCDEFGH |
fastq |
MYID 99 chr1 123456 20 8M chr1 123478 30 ATCGATCG ABCDEFGH | sam |
Sequence identifiers may be customized in Sandy output using a format string passed by the user. This format
is a combination of literal and escaped characters, in a similar fashion to that used in C programming
language’s printf
function.
For example, simulating a paired-end sequencing you can add the read length, read position and mate position to all sequence identifiers with the following format:
%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r
In this case, results in fastq
format would be:
==> Into R1
@SR.1 read=chr6:979-880 mate=chr6:736-835 length=100
...
==> Into R2
@SR.1 read=chr6:736-835 mate=chr6:979-880 length=100
Make your simulations reproducible
Sandy comes with the option --seed
which receives an integer and is used to initiate the random number
generator. The ability to set a seed is useful for those who want reproducible simulations. Pay attention
to the number of jobs (--jobs
) set, because each job receives a different seed calculated from the main seed.
So, for reproducibility, the same seed set before needs the same number of jobs set before as well.
Obviously the user also needs to use the same options for the simulation type, sequencing type, quality profile, expression matrix, genomic variation, number of reads, coverage and the file with the reference sequences.
So, let’s test the reproducibility with the following examples:
Reproducibility in genome
Let’s make the same simulation twice and compare them:
$ sandy genome -v -t single-end -q nextseq_85 -j 5 -s 1717 -c 1 -a NA12878_hg38_chr17 -O fastq -o my_sim1/ chr17.fa
$ sandy genome -v -t single-end -q nextseq_85 -j 5 -s 1717 -c 1 -a NA12878_hg38_chr17 -O fastq -o my_sim2/ chr17.fa
Comparing both the results:
$ diff -s my_sim1/out_R1_001.fastq my_sim2/out_R1_001.fastq
And you should receive the message:
Files my_sim1/out_R1_001.fastq and my_sim2/out_R1_001.fastq are identical
Reproducibility in transcriptome
in the same way as the examples above for genome, you can test the reproducibility with:
$ sandy transcriptome \
-v \
-t paired-end \
-q hiseq_101 \
-j 5 \
-s 1717 \
-n 1000000 \
-f liver \
-O fastq \
-o my_sim1/ \
gencode.v40.transcripts.fa.gz
$ sandy transcriptome \
-v \
-t paired-end \
-q hiseq_101 \
-j 5 \
-s 1717 \
-n 1000000 \
-f liver \
-O fastq \
-o my_sim2/ \
gencode.v40.transcripts.fa.gz
Comparing the simulations for R1:
$ diff -s my_sim1/out_R1_001.fastq my_sim2/out_R1_001.fastq
And you should receive the message:
Files my_sim1/out_R1_001.fastq and my_sim2/out_R1_001.fastq are identical
Comparing the simulations for R2
$ diff -s my_sim1/out_R2_001.fastq my_sim2/out_R2_001.fastq
And you should receive the message:
Files my_sim1/out_R2_001.fastq and my_sim2/out_R2_001.fastq are identical
Persistently customize simulation models with Docker
Sandy expression matrix, quality profile and structural variation patterns are stored within docker container, that is, any database changes during runtime will last as long as the container is not removed.
A named Docker volume or a mounted host directory should be used in order to keep your changes to the
database. If our container detects that the path /sandy/db
is mounted, then the database
/sandy/db/db.sqlite3
will be used instead of the default database. In the same way, if there is no
database db.sqlite3
inside the mounted path /sandy/db/
, then the default database will be copied to
/sandy/db/
and used consecutively.
Named volume
sandy_db
volume will be created at first run and will persist after container deletion.
$ docker run \
--rm \
-v sandy_db:/sandy/db \
galantelab/sandy
You can verify the created volume with the commands:
$ docker volume ls
And in more detail with the command:
$ docker volume inspect sandy_db
Mounted directory
/path/to/DB
will receive the default database at first run and any further changes will be stored in it.
$ docker run \
--rm \
-v /path/to/DB:/sandy/db \
galantelab/sandy
Now, verify the directory /path/to/DB
. You should find the file db.sqlite3
.
As you add your custom patterns to Sandy, the alterations will be kept safely outside the container.
Add custom sequencing quality profiles
$ docker run \
--rm \
-v /path/to/quality_profile.txt:/quality_profile.txt \
-v sandy_db:/sandy/db \
galantelab/sandy quality add -q new_profile /quality_profile.txt
Check the new quality profile at sandy_db
:
$ docker run --rm -v sandy_db:/sandy/db galantelab/sandy quality
Add custom expression profiles
$ docker run \
--rm \
-v /path/to/tissue_counts.txt:/tissue_counts.txt \
-v sandy_db:/sandy/db \
galantelab/sandy expression add -f new_tissue /tissue_counts.txt
Check the new expression matrix at sandy_db
:
$ docker run --rm -v sandy_db:/sandy/db galantelab/sandy expression
Add custom genomic variations
$ docker run \
--rm \
-v /path/to/sv.txt:/sv.txt \
-v sandy_db:/sandy/db \
galantelab/sandy variation add -a new_sv /sv.txt
Check the new structural variation at sandy_db
:
$ docker run --rm -v sandy_db:/sandy/db galantelab/sandy variation