Contents
- Overview
- Installation
- Dependencies
- Commands and options
- Finding signatures of genes or transcripts
- Applying signatures of genes or transcripts in survival
- Integrative analysis
- Toy example
- Authors
Overview
Reboot is a flexible, easy-to-use algorithm to identify and validate genes or transcripts signatures whose expression are highly correlated with patient survival. This tool innovates by using a multivariate strategy with penalized Cox regression (Lasso method) combined with a bootstrap approach, presenting robust convergence of the regression procedure, and a variety of statistical tests for signature score. Reboot comprises two modules developed in R (version 4.0.4). The regression module provides functionality for obtaining gene/transcript signatures from a given set of samples. In turn, the survival module provides functionality for producing, applying and validating a score, which is calculated based on the obtained signature, in patient datasets. In this module, a different set of samples may be provided for validation purposes, and clinical variables may also be taken into account. Moreover, Reboot also has the execution option complete, which integratively executes the two aforementioned modules.
Reboot workflow: First module (regression) makes a regression analysis to identify a gene or transcript signature. The second module (survival) runs survival analysis of a score calculated based on the obtained signature.
Installation
Reboot can be obtained from Github and installed via a Docker container (recommended) or through direct installation.
-
Docker container
This method works on any distribution or operational systems, as long as Docker is installed.
docker pull galantelab/reboot
-
Direct installation
This method requires the previous installation of R (version >= 4.0.4):
git clone https://github.com/galantelab/reboot.git
sudo sh reboot/install.sh
Dependencies
In order to work properly, the following packages are necessary (included in the installation procedures):
- argparse
- BiocManager
- BioinformaticsFMRP/TCGAbiolinks (from BiocManager)
- data.table
- extrafont
- forestmodel
- hash
- OptimalCutpoints
- optparse
- penalized
- remotes (from BiocManager)
- R.utils
- sjlabelled
- sjmisc
- sjstats
- survcomp (from BiocManager)
- survival
- survivalROC
- survminer
- tidyverse
- mice
Commands and options
Reboot works with a command and subcommands structure:
reboot.R [subcommand] <options>
Subcommands may be invoked by the help menu:
# for docker container
docker run --rm galantelab/reboot reboot.R -h
Optionally:
# for direct installation
reboot.R -h
Similarlly, version may be invoked by:
# for docker container
docker run --rm galantelab/reboot reboot.R -v
Optionally:
# for direct installation
reboot.R -v
In summary, 3 subcommands are available:
regression | generates a signature through multivariate Cox regression |
survival | applies a signature score in survival analysis |
complete | generates a signature and applies its derived score in survival analysis |
Finding signatures of genes or transcripts
Reboot searches for a genetic signature (significance coefficients) correlated with patient survival based on a multivariate Cox regression of genes or transcripts. This module uses a LASSO algorithm combined with a bootstrap approach to deal with possible dimension vulnerabilities (especially when data has many attributes and few instances).
In addition, a number of filters are implemented. Reboot analyses starts off by checking if the provided expression dataset has the minimum of 20 variables (genes/transcripts) necessary for the next steps to take place correctly. Also, a minimum of 10 samples is required due to cross validation steps performed to optimize LASSO coefficient choice. Then, data attributes with variance lower than a user-defined cut-off are removed. If a low variance is detected in the provided follow up times or the proportion between individuals with diverging survival status (e.g., dead and alive) is greater than 80% or lower than 20%, the analysis halts and returns a warning due to possible convergence problems in the regression process. Prior to starting the analysis, a Schoenfeld’s test filter is also performed in a univariate way to exclude variables which do not meet the proportional hazard assumptions of Cox models. Next, a Spearman’s correlation filter is applied for every iteration of the bootstrap process based on the user-defined allowed fraction of pairs with correlation coefficients higher than 0.8 and p-values lower than 0.05. To minimize false positives in the results, empirical coefficient cutoffs for genes/transcripts were established: genes with coefficients higher than 0.003 are selected, whereas only transcripts with coefficients higher than 0.012 are chosen based on random bootstrap resampling analyses.
Coverage (Cv) may be introduced as the number of times an attribute is raffled, on average, after B iterations, according to the formula:
Cv = (B * G) / N
Where (N) is the number of attributes (genes or transcripts) and (G) is the group size, i.e, the number of attributes to be included in each iteration. For optimal algorithm convergence, we recommend the group size (G) to be between 10 and 15 attributes per iteration. In order to assure a satisfying level of group combinations, our suggestion is to run each attribute in N/G simulations on average. In other words: Cv = N/G, which yields:
B = (N / G)²
For instance, when (N) is 100 and (G) is 10, the recommended number of bootstrap iterations would be 100 and, as a consequence, the number of appearances of each attribute would be 10, as exemplified in the following table:
N | G | B | Cv |
---|---|---|---|
100 | 10 | 100 | 10 |
300 | 15 | 400 | 20 |
500 | 12 | 1736 | 42 |
We recommend avoidance of regression set that may not present unimodal distribution, which may introduce some bias. The regression analysis implemented in Reboot is performed approximately linearly in time in relation to the number of bootstrap iterations. If the analysis should take no longer than a period of the day, we encourage a maximum of 10,000 iterations.
Usage
For gene or transcript signature obtention, run the following:
# for docker container
docker run --rm -v $(pwd):$(pwd) galantelab/reboot reboot.R regression <options>
optionally:
# for direct installation
reboot.R regression <options>
Regression options are:
Options | Description |
-I, --filein | Input file name. Tab-separated values (.tsv) file containing genes or transcripts expression and survival parameters |
-O, --outprefix | Output file prefix (string). Default: reboot |
-B, --bootstrap | Number of iterations for bootstrap simulation (integer). Default: 1 |
-G, --groupsize | Number of genes or transcripts to be selected in each bootstrap simulation (integer). Default: 10 |
-P, --pcentfilter | Percentage of correlated gene or transcript pairs allowed in each iteration (double). Default: 0.3 |
-V, --varfilter | Minimum normalized variance (0-1) required for each gene or transcript among samples (double). Default: 0.01 |
-F, --force | Choose -F to bypass OS and OStime filters |
-h, --help | Show this help message and exit |
Input
To produce a genetic signature, Reboot requires a .tsv file containing normalized expression values in Transcripts Per Million (TPM) or in Fragments Per Kilobase per Million (FPKM) for genes or transcripts across multiple samples, in addition to survival data: survival status (e.g., 0 = alive or 1 = dead) and follow up time:
Sample ID | OS | OS.time | PARPBP | RAD51 | … |
---|---|---|---|---|---|
patient_1 | 1 | 448 | 41.81557 | 34.70869 | … |
patient_2 | 0 | 466 | 24.78227 | 64.80153 | … |
… | … | … | … | … | … |
Output
As result, Reboot generates one log file, a .tsv file containing regression coefficients and 2 plots. The .tsv file is in the following format:
Feature name | coefficient |
---|---|
PARPBP | 0.17014 |
CXCR6 | 0.22173 |
… | … |
Coefficients may be interpreted according to absolute value and signal. Significance is as high as the absolute value that the coefficient gets. Positive signals contribute to the accountability of bad prognosis, while negative signals contribute to the accountability of good prognosis.
The plots generated are a histogram with the distribution of the regression coefficients and a lollipop plot with the most relevant coefficients (see bellow).
Applying signatures of genes or transcripts in survival
Reboot produces and applies a score for all samples based on the signature previously obtained from the regression module. Besides, Reboot also offers the multivariate option, where further clinical variables (e.g., therapy, age and gender) can be loaded in a multivariate survival model. Multiple univariate analyses are executed and only variables with a p-value <= 0.2 and that passed the Schoenfeld’s test are selected for the final multivariate model. Statistical tests are performed in order to evaluate the relevance of the signature score along with co-variables as prognostic factors of a given event (overall / progression-free / recurrence-free survival).
By default, both univariate and multivariate survival analyses use the median score value as a cutoff to stratify patients in high and low score signatures. Alternatively, this cutoff value may be based on a Receiver Operator Characteristic (ROC) curve using Nearest Neighbour Estimate (NNE) method and the Youden statistics, where J = [sensitivity + (specificity -1)]. If more than one J coefficient is available, then the first one is chosen.
If a multivariate analysis is performed based on a ROC curve, a bootstrap resampling method is applied once the provided clinical dataset passes the following additional filters: (i) final dataset with at least 70% of the original one (Not Available - NA filter) and; (ii) the frequency of the less abundant category for each co-variable is not less than 20% (proportion filter). Otherwise, the multivariate analysis is performed without the bootstrap method. After 100 iterations, the relevance frequency of each co-variable with the event is calculated.
Usage
To validate a signature of genes or transcripts in survival analysis, run the following:
# for docker container
docker run --rm -v $(pwd):$(pwd) galantelab/reboot reboot.R survival <options>
Optionally:
# for direct installation
reboot.R survival <options>
Survival options are:
Options | Description |
-I, --filein | Input file name. Tab-separated values (.tsv) file containing genes or transcripts expression and survival parameters |
-O, --outprefix | Output file prefix (string). Default: reboot |
-M, --multivariate | If clinical variables should be included, choose -M. This option is tied with -C option. Default: FALSE |
-C, --clinical | Tab-separated values (.tsv) file containing binary categorical variables only. Required if -M option is chosen |
-R, --roc | To categorize the genetic score according to a ROC curve instead of median value, choose -R. Default: FALSE |
-S, --signature | Tab-separated values (.tsv) file containing a set of genes or transcripts and corresponding cox coefficients |
-F, --force | Choose -F to bypass OS and OStime filters |
-h, --help | Show this help message and exit |
Inputs
Survival analyses may be run in univariate or multivariate mode. Required inputs depend on this choice.
-
Univariate mode
This is the simplest mode and requires a single input file. The expected .tsv file contains a set of features (genes/transcripts) and their corresponding coefficients provided as output by the regression module:
Feature name coefficient PARPBP 0.17014 CXCR6 0.22173 … … -
Multivariate mode
In case multivariate mode is chosen, a .tsv file containing clinical information is also necessary. Note that all clinical variables MUST be categorical and present ONLY 2 classes (NA values are allowed):
Sample ID age gender therapy … patient_1 18-55 years male radiation … patient_2 56+ years female chemoradiation … … … … … …
Outputs
Depending on whether survival analysis was performed in univariate or multivariate mode, a different set of output files are created.
-
Univariate mode
If the analysis is performed in univariate mode, Reboot returns a log and a lograng.txt file, containing the survival results for the signature score:
feature coefficient hazard.ratio log.rank.pvalue low.high.samples median.survival.low median.survival.high prognosis score -1.0091 0.3645 (95% CI, 0.2456-0.541) 0.003 52/53 532 (95% CI, 455-648) 313 (95% CI, 231-362) better Plots returned in this mode include: a proportional hazard assumptions plot (result of Schoenfeld test) and a Kaplan Meier plot (see bellow).
-
Multivariate mode
If the analysis is performed in multivariate mode, reboot returns all files created in the univariate mode in addition to a multicox.txt file, which contains the survival results of the signature score along with all other clinical variables:
variable reference univariate.hazard.ratio univariate.Cox.pvalue univariate.prognosis multivariate.hazard.ratio multivariate.Cox.pvalue multivariate.prognosis score low 0.3645 (95% CI, 0.2456-0.541) 0.001 better 0.3904 (95% CI, 0.2248-0.6779) 8e-04 better age 56+ years 1.369 (95% CI, 0.9086-2.0625) 0.1332 worse 1.1104 (95% CI, 0.6314-1.9531) 0.7161 ---- gender MALE 0.9474 (95% CI, 0.6381-1.4066) 0.7886 ---- ---- ---- ---- ... ... ... ... ... ... ... ... Plots returned in this mode include a forest plot for all clinical variables, a Kaplan Meier plot, and a proportional hazard assumptions plot (Schoenfeld tests). If the option –ROC is selected, only the most relevant variables (p-value <= 0.05 in at least 25% of iterations) are plotted. A ROC curve and a histogram of co-variable frequencies are also provided (see bellow).
Integrative analysis
Reboot also provides a subcommand to perform the full analyses (regression followed by survival) in a one-step process. To execute it, run the following:
# for docker container
docker run --rm -v $(pwd):$(pwd) galantelab/reboot reboot.R complete <options>
Optionally:
# for direct installation
reboot.R complete -h
Complete options are:
Options | Description |
-I, --filein | Input file name. Tab-separated values (.tsv) file containing genes or transcripts expression and survival parameters |
-O, --outprefix | Output file prefix (string). Default: reboot |
-B, --bootstrap | Number of iterations for bootstrap simulation (integer). Default: 1 |
-G, --groupsize | Number of genes or transcripts to be selected in each bootstrap simulation (integer). Default: 10 |
-P, --pcentfilter | Percentage of correlated gene or transcript pairs allowed in each iteration (double). Default: 0.3 |
-V, --varfilter | Minimum normalized variance (0-1) required for each gene or transcript among samples (double). Default: 0.01 |
-M, --multivariate | If clinical variables should be included, choose -M. This option is tied with -C option. Default: FALSE |
-C, --clinical | Tab-separated values (.tsv) file containing binary categorical variables only. Required if -M option is chosen |
-R, --roc | To categorize the genetic score according to a ROC curve instead of median value, choose -R. Default: FALSE |
-F, --force | Choose -F to bypass OS and OStime filters | -h, --help | Show this help message and exit |
Toy example
In order to illustrate how easy is to use Reboot, we produce a framework to get a small dataset (here called toy dataset) from TCGA and use it into Reboot.
We provide a script to download and format gene expression and clinical data of glioblastoma patients from TCGA. It can be run following this command into the reboot directory:
# for docker container
docker run --env MYID=$(id -u) --rm -ti -v $(pwd):$(pwd) -w $(pwd) galantelab/reboot toyfordocker.R
optionally:
# for direct usage
toyscript.R
This command returns 2 .tsv files, mentioned above, called expression.tsv and clinical.tsv. A MANIFEST.txt file and a set of expression and clinical data are also created, as intermediates of TCGA download process. The composition of the expression dataset comprises clinical variables: OS (survival status) and OS.time (follow up time) and 50 random picked gene expression (FPKM).
Finally, reboot can be run in the complete mode:
# for docker container
docker run -u $(id -u):$(id -g) --rm -v $(pwd):$(pwd) -w $(pwd) galantelab/reboot reboot.R complete -I expression.tsv -O toy -B 100 -G 10 -M -C clinical.tsv -F
Optionally:
# for direct usage
reboot.R complete -I expression.tsv -O toy -B 100 -G 10 -M -C clinical.tsv -F
Authors
Felipe Rodolfo Camargo dos Santos*
Gabriela Der Agopian Guardia*
Filipe Ferreira dos Santos*
Daniel Takatori Ohara
Pedro Alexandre Favoretto Galante
*These authors contributed equally for the work