Tutorial
Contents
- Simulate DNA sequencing
- Simulate RNA sequencing
- Customize simulation models
- Customize sequence identifiers
- Make your simulations reproducible
- Persistently customize simulation models with Docker
Simulate DNA sequencing
To simulate a whole-genome sequencing, you just need a reference genome file in FASTA 
format for the desired species. If you don’t have one, you can follow this step:
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gzor
$ curl https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gzThen, to simulate a whole-genome sequencing with paired-end reads, simply run:
$ sandy genome --sequencing-type paired-end hg38.fa.gzAlternatively, you may customize several parameters for simulation:
$ sandy genome \
    --verbose \    #enable log messages
    --jobs 2 \    #set number of jobs
    --coverage 8 \    #set genome coverage
    --prefix out \    #set prefix output
    --output-format fastq.gz \    #set output format
    --sequencing-type paired-end \    #set sequencing type
    --quality-profile poisson \    #set sequencing quality profile
    --fragment-mean 300 \    #set fragment mean size
    --fragment-stdd 50 \    #set standard deviation of fragment size
    hg38.fa.gzOutput files
The simulation output file generated for whole-genome sequencing will depend on the --output-format (fastq, bam),
--join-paired-ends (mate read pairs into a single file) and --sequencing-type (single-end, paired-end) 
options. Along with the simulation output, a file with read counts per chromosome will be produced. In the previous
example, the following output files will be produced:
out_R1_001.fastq.gz
out_R2_001.fastq.gz
out_coverage.tsvSimulate DNA sequencing with custom quality profiles
By default, Sandy generates phred-scores using a statistical model based on the Poisson distribution.
Alternatively, Sandy may generate fastq quality entries that mimic the Illumina,
PacBio and Nanopore sequencers. You can change it by passing
 --quality-profile option:
$ sandy genome --verbose --quality-profile miseq_150 hg38.fa.gzWhich will output the phred-scores according to the MySeq sequencer with a read length of 150 bases.
To see all available quality profiles, run:
$ sandy qualitySimulate DNA sequencing with genomic variations
The user can also tune the reference genome, adding homozygous or heterozygous genomic variations such as SNVs, Indels, gene fusions and other types of structural variations (e.g. CNVs, retroCNVs). Sandy provides several bult-in genomic variations obtained from the 1KGP and from COSMIC.
So, let’s simulate a genome which includes the fusion between the genes NPM1 and ALK:
$ sandy genome --genomic-variation fusion_hg38_NPM1-ALK hg38.fa.gzTo see all available genomic variations, run:
$ sandy variationSimulate RNA sequencing
To simulate a RNA sequencing, you just need a reference transcriptome file in FASTA
format for the desired species. If you don’t have one, you can follow this step:
$ wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gzor
$ curl http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.transcripts.fa.gzThen, to simulate a RNA sequencing with paired-end reads, simply run:
$ sandy transcriptome --sequencing-type paired-end gencode.v40.transcripts.fa.gzAlternatively, you may customize several parameters for simulation:
$ sandy transcriptome \
    --verbose \    #enable log messages
    --jobs 2 \    #set number of jobs
    --prefix out \    #set prefix output
    --output-format fastq.gz \    #set output format
    --sequencing-type paired-end \    #set paired-end sequencing
    --quality-profile poisson \    #set sequencing quality profile
    --fragment-mean 300 \    #set fragment mean size
    --fragment-stdd 50 \    #set standard deviation of fragment size
    --number-of-reads 1000000 \    #set the number of reads
    gencode.v40.transcripts.fa.gzOutput files
The simulation output file generated for RNA sequencing will depend on the --output-format (fastq, bam), 
--join-paired-ends (mate read pairs into a single file) and --sequencing-type (single-end, paired-end) 
options. Along with the simulation output, a file with the abundances per transcript will be produced, 
and if there is the relationship between genes and their transcripts at the fasta header, a file with the
abundances per gene are produced as well. In the previous example, the following output files will be produced:
out_R1_001.fastq.gz
out_R2_001.fastq.gz
out_abundance_transcripts.tsv
out_abundance_genes.tsvSimulate RNA sequencing with custom expression profiles
By default, Sandy simulates RNA sequencing raffling transcripts according to their lengths. It is also possible to simulate RNA data that reflects the expression (abundance) of transcripts or genes in a particular tissue. For this purpose, built-in expression matrices were created from the gene expression profiles of 54 tissues from the GTExV8 project.
You can select an expression profile with the --expression-matrix option. For example, let’s
simulate an RNA-Seq for liver tissue:
$ sandy transcriptome --expression-matrix liver gencode.v40.transcripts.fa.gzTo see all available expression profiles, run:
$ sandy expressionCustomize simulation models
Users can include their own models of sequencing quality, expression profiles and genomic variations in order to adapt the simulation to their needs.
Sequencing quality
To add a custom sequencing quality profile to Sandy, you should provide a file in fastq format or 
a file containing only the ASCII-encoded phred-scores, as in this example:
$ cat my_qualities.txtBECGF@F@DEBIDBE@DCC?HFH?BBB?H@FEEIFDCCECCCIGDIDI?@?CCC?AE?EC?F?@FB;<9<>9:599=>7:57614,30,440&"!***)#
@DCGIDBDECIHIG@FII?G?GCAD@BFECDCEF?H?GIHE?@GEECBCIHCABAFHDFAHBEBEB:5575678=75>657673-14,.113#"()#&)$
F?B@@DFAHIDD?EBFADICBFABCBBAHFCGF@@@?DEIAIEAFCEADC?B@IB?BIEABIBG@C<:;96<968:>::;778,+0203-3,#&'$$#&!
...And with the command, you can add your custom profile to Sandy:
$ sandy quality add --verbose --quality-profile new_quality my_qualities.txtExpression profiles
To add a custom expression profile to Sandy, you should provide a file containing an expression matrix with two columns. The first column contains the transcript or gene ids and the second column contains raw counts. Counts will be treated as weights. Example:
$ cat my_custom_expression_matrix.txt#feature	count
ENST00000000233.9	2463
ENST00000000412.7	2494
ENST00000000442.10	275
ENST00000001008.5	5112
ENST00000001146.6	637
ENST00000002125.8	660
ENST00000002165.10	478
ENST00000002501.10	57
ENST00000002596.5	183
...And with the command, you can add your custom profile to Sandy:
$ sandy expression add --verbose --expression-matrix new_tissue my_custom_expression_matrix.tsvGenomic variations
A genomic variation may be represented by a genomic position (seqid, position), a reference sequence at that position, an alternate sequence and a genotype (homozygous or heterozygous). To add a custom set of genomic variations to Sandy, you should provide a vcf file or a custom file.
For vcf files, the user should point out the sample present in the vcf header and then its column will be used to extract
the genotype. If the user does not pass the option --sample-name, then Sandy will use the first sample. Example:
$ cat my_variations.vcf##fileformat=VCFv4.3
...
#CHROM POS     ID    REF ALT   QUAL FILTER INFO        FORMAT NA001 NA002
chr20  14370   rs81  G   A     29   PASS   NS=3;DP=14  GT     0/1   0/0
chr20  17330   rs82  T   AAA   3    PASS   NS=3;DP=20  GT     1/1   0/0
chr20  110696  rs83  A   GTCT  10   PASS   NS=2;DP=11  GT     0/1   1/1
...In the my_variations.vcf file, if you do not point out sample NA002 by passing the
option --sample-name=NA002, the sample NA001 will be used by default.
Alternatively, you may provide a genomic variation file, which is a representation of a reduced VCF, that is, without the columns: QUAL, FILTER, INFO and FORMAT. There is only one SAMPLE column with the genotype for the entry in the format HO for homozygous and HE for heterozygous. See the example bellow:
$ cat my_variations.txt#seqid  position id        reference alternate	genotype
chr20   14370    rs81      G         A          HE
chr20   17330    rs82      T         AAA        HO
chr20   110696   rs83      A         GTCT       HEAnd with the command, you can add the set of genomic variations to Sandy:
$ sandy variation add --verbose --genomic-variation=my_variations my_variations.txtCustomize sequence identifiers
The sequence identifier, as the name implies, is a string that identifies a biological sequence (usually
nucleotides) within sequencing data. For example, the fasta format includes the sequence identifier
always after the > character at the beginning of the line; the fastq format always includes it after
the @ character at the beginning of the line; the sam format uses the first column (called the
query template name).
| Sequence identifier | File format | 
|---|---|
| >MYID and Optional information ATCGATCG | fasta | 
| @MYID and Optional information ATCGATCG + ABCDEFGH | fastq | 
| MYID 99 chr1 123456 20 8M chr1 123478 30 ATCGATCG ABCDEFGH | sam | 
Sequence identifiers may be customized in Sandy output using a format string passed by the user. This format
is a combination of literal and escaped characters, in a similar fashion to that used in C programming
language’s printf function.
For example, simulating a paired-end sequencing you can add the read length, read position and mate position to all sequence identifiers with the following format:
%i.%U read=%c:%t-%n mate=%c:%T-%N length=%rIn this case, results in fastq format would be:
==> Into R1
@SR.1 read=chr6:979-880 mate=chr6:736-835 length=100
...
==> Into R2
@SR.1 read=chr6:736-835 mate=chr6:979-880 length=100Make your simulations reproducible
Sandy comes with the option --seed which receives an integer and is used to initiate the random number
generator. The ability to set a seed is useful for those who want reproducible simulations. Pay attention
to the number of jobs (--jobs) set, because each job receives a different seed calculated from the main seed.
So, for reproducibility, the same seed set before needs the same number of jobs set before as well.
Obviously the user also needs to use the same options for the simulation type, sequencing type, quality profile, expression matrix, genomic variation, number of reads, coverage and the file with the reference sequences.
So, let’s test the reproducibility with the following examples:
Reproducibility in genome
Let’s make the same simulation twice and compare them:
$ sandy genome -v -t single-end -q nextseq_85 -j 5 -s 1717 -c 1 -a NA12878_hg38_chr17 -O fastq -o my_sim1/ chr17.fa$ sandy genome -v -t single-end -q nextseq_85 -j 5 -s 1717 -c 1 -a NA12878_hg38_chr17 -O fastq -o my_sim2/ chr17.faComparing both the results:
$ diff -s my_sim1/out_R1_001.fastq my_sim2/out_R1_001.fastqAnd you should receive the message:
Files my_sim1/out_R1_001.fastq and my_sim2/out_R1_001.fastq are identicalReproducibility in transcriptome
in the same way as the examples above for genome, you can test the reproducibility with:
$ sandy transcriptome \
    -v \
    -t paired-end \
    -q hiseq_101 \
    -j 5 \
    -s 1717 \
    -n 1000000 \
    -f liver \
    -O fastq \
    -o my_sim1/ \
    gencode.v40.transcripts.fa.gz$ sandy transcriptome \
    -v \
    -t paired-end \
    -q hiseq_101 \
    -j 5 \
    -s 1717 \
    -n 1000000 \
    -f liver \
    -O fastq \
    -o my_sim2/ \
    gencode.v40.transcripts.fa.gzComparing the simulations for R1:
$ diff -s my_sim1/out_R1_001.fastq my_sim2/out_R1_001.fastqAnd you should receive the message:
Files my_sim1/out_R1_001.fastq and my_sim2/out_R1_001.fastq are identicalComparing the simulations for R2
$ diff -s my_sim1/out_R2_001.fastq my_sim2/out_R2_001.fastqAnd you should receive the message:
Files my_sim1/out_R2_001.fastq and my_sim2/out_R2_001.fastq are identicalPersistently customize simulation models with Docker
Sandy expression matrix, quality profile and structural variation patterns are stored within docker container, that is, any database changes during runtime will last as long as the container is not removed.
A named Docker volume or a mounted host directory should be used in order to keep your changes to the
database. If our container detects that the path /sandy/db is mounted, then the database
/sandy/db/db.sqlite3 will be used instead of the default database. In the same way, if there is no
database db.sqlite3 inside the mounted path /sandy/db/, then the default database will be copied to
/sandy/db/ and used consecutively.
Named volume
sandy_db volume will be created at first run and will persist after container deletion.
$ docker run \
    --rm \
    -v sandy_db:/sandy/db \
    galantelab/sandyYou can verify the created volume with the commands:
$ docker volume lsAnd in more detail with the command:
$ docker volume inspect sandy_dbMounted directory
/path/to/DB will receive the default database at first run and any further changes will be stored in it.
$ docker run \
    --rm \
    -v /path/to/DB:/sandy/db \
    galantelab/sandyNow, verify the directory /path/to/DB. You should find the file db.sqlite3.
As you add your custom patterns to Sandy, the alterations will be kept safely outside the container.
Add custom sequencing quality profiles
$ docker run \
    --rm \
    -v /path/to/quality_profile.txt:/quality_profile.txt \
    -v sandy_db:/sandy/db \
    galantelab/sandy quality add -q new_profile /quality_profile.txtCheck the new quality profile at sandy_db:
$ docker run --rm -v sandy_db:/sandy/db galantelab/sandy qualityAdd custom expression profiles
$ docker run \
    --rm \
    -v /path/to/tissue_counts.txt:/tissue_counts.txt \
    -v sandy_db:/sandy/db \
    galantelab/sandy expression add -f new_tissue /tissue_counts.txtCheck the new expression matrix at sandy_db:
$ docker run --rm -v sandy_db:/sandy/db galantelab/sandy expressionAdd custom genomic variations
$ docker run \
    --rm \
    -v /path/to/sv.txt:/sv.txt \
    -v sandy_db:/sandy/db \
    galantelab/sandy variation add -a new_sv /sv.txtCheck the new structural variation at sandy_db:
$ docker run --rm -v sandy_db:/sandy/db galantelab/sandy variation