Reboot is a flexible, easy-to-use algorithm to identify and validate genes or transcripts signatures whose expression are highly correlated with patient survival. This tool innovates by using a multivariate strategy with penalized Cox regression (Lasso method) combined with a bootstrap approach, presenting robust convergence of the regression procedure, and a variety of statistical tests for signature score. Reboot comprises two modules developed in R (version 4.0.4). The regression module provides functionality for obtaining gene/transcript signatures from a given set of samples. In turn, the survival module provides functionality for producing, applying and validating a score, which is calculated based on the obtained signature, in patient datasets. In this module, a different set of samples may be provided for validation purposes, and clinical variables may also be taken into account. Moreover, Reboot also has the execution option complete, which integratively executes the two aforementioned modules.

Reboot workflow: First module (regression) makes a regression analysis to identify a gene or transcript signature. The second module (survival) runs survival analysis of a score calculated based on the obtained signature.

Installation

Reboot can be obtained from Github and installed via a Docker container (recommended) or through direct installation.

Docker container

This method works on any distribution or operational systems, as long as Docker is installed.

docker pull galantelab/reboot
Direct installation

This method requires the previous installation of R (version >= 4.0.4):

git clone https://github.com/galantelab/reboot.git

sudo sh reboot/install.sh

Dependencies

In order to work properly, the following packages are necessary (included in the installation procedures):

argparse
BiocManager
BioinformaticsFMRP/TCGAbiolinks (from BiocManager)
data.table
extrafont
forestmodel
hash
OptimalCutpoints
optparse
penalized
remotes (from BiocManager)
R.utils
sjlabelled
sjmisc
sjstats
survcomp (from BiocManager)
survival
survivalROC
survminer
tidyverse
mice

Commands and options

Reboot works with a command and subcommands structure:

reboot.R [subcommand] <options>

Subcommands may be invoked by the help menu:

# for docker container
docker run --rm galantelab/reboot reboot.R -h

Optionally:

# for direct installation
reboot.R -h

Similarlly, version may be invoked by:

# for docker container
docker run --rm galantelab/reboot reboot.R -v

Optionally:

# for direct installation
reboot.R -v

In summary, 3 subcommands are available:

regression	generates a signature through multivariate Cox regression
survival	applies a signature score in survival analysis
complete	generates a signature and applies its derived score in survival analysis

Finding signatures of genes or transcripts

Reboot searches for a genetic signature (significance coefficients) correlated with patient survival based on a multivariate Cox regression of genes or transcripts. This module uses a LASSO algorithm combined with a bootstrap approach to deal with possible dimension vulnerabilities (especially when data has many attributes and few instances).

In addition, a number of filters are implemented. Reboot analyses starts off by checking if the provided expression dataset has the minimum of 20 variables (genes/transcripts) necessary for the next steps to take place correctly. Also, a minimum of 10 samples is required due to cross validation steps performed to optimize LASSO coefficient choice. Then, data attributes with variance lower than a user-defined cut-off are removed. If a low variance is detected in the provided follow up times or the proportion between individuals with diverging survival status (e.g., dead and alive) is greater than 80% or lower than 20%, the analysis halts and returns a warning due to possible convergence problems in the regression process. Prior to starting the analysis, a Schoenfeld’s test filter is also performed in a univariate way to exclude variables which do not meet the proportional hazard assumptions of Cox models. Next, a Spearman’s correlation filter is applied for every iteration of the bootstrap process based on the user-defined allowed fraction of pairs with correlation coefficients higher than 0.8 and p-values lower than 0.05. To minimize false positives in the results, empirical coefficient cutoffs for genes/transcripts were established: genes with coefficients higher than 0.003 are selected, whereas only transcripts with coefficients higher than 0.012 are chosen based on random bootstrap resampling analyses.

Coverage (Cv) may be introduced as the number of times an attribute is raffled, on average, after B iterations, according to the formula:

Cv = (B * G) / N

Where (N) is the number of attributes (genes or transcripts) and (G) is the group size, i.e, the number of attributes to be included in each iteration. For optimal algorithm convergence, we recommend the group size (G) to be between 10 and 15 attributes per iteration. In order to assure a satisfying level of group combinations, our suggestion is to run each attribute in N/G simulations on average. In other words: Cv = N/G, which yields:

B = (N / G)²

For instance, when (N) is 100 and (G) is 10, the recommended number of bootstrap iterations would be 100 and, as a consequence, the number of appearances of each attribute would be 10, as exemplified in the following table:

N	G	B	Cv
100	10	100	10
300	15	400	20
500	12	1736	42

We recommend avoidance of regression set that may not present unimodal distribution, which may introduce some bias. The regression analysis implemented in Reboot is performed approximately linearly in time in relation to the number of bootstrap iterations. If the analysis should take no longer than a period of the day, we encourage a maximum of 10,000 iterations.

Usage

For gene or transcript signature obtention, run the following:

   # for docker container
   docker run --rm -v $(pwd):$(pwd) galantelab/reboot reboot.R regression <options>

optionally:

   # for direct installation
   reboot.R regression <options>

Regression options are:

Options	Description
-I, --filein	Input file name. Tab-separated values (.tsv) file containing genes or transcripts expression and survival parameters
-O, --outprefix	Output file prefix (string). Default: reboot
-B, --bootstrap	Number of iterations for bootstrap simulation (integer). Default: 1
-G, --groupsize	Number of genes or transcripts to be selected in each bootstrap simulation (integer). Default: 10
-P, --pcentfilter	Percentage of correlated gene or transcript pairs allowed in each iteration (double). Default: 0.3
-V, --varfilter	Minimum normalized variance (0-1) required for each gene or transcript among samples (double). Default: 0.01
-F, --force	Choose -F to bypass OS and OStime filters
-h, --help	Show this help message and exit

Input

To produce a genetic signature, Reboot requires a .tsv file containing normalized expression values in Transcripts Per Million (TPM) or in Fragments Per Kilobase per Million (FPKM) for genes or transcripts across multiple samples, in addition to survival data: survival status (e.g., 0 = alive or 1 = dead) and follow up time:

Sample ID	OS	OS.time	PARPBP	RAD51	…
patient_1	1	448	41.81557	34.70869	…
patient_2	0	466	24.78227	64.80153	…
…	…	…	…	…	…

Output

As result, Reboot generates one log file, a .tsv file containing regression coefficients and 2 plots. The .tsv file is in the following format:

Feature name	coefficient
PARPBP	0.17014
CXCR6	0.22173
…	…

Coefficients may be interpreted according to absolute value and signal. Significance is as high as the absolute value that the coefficient gets. Positive signals contribute to the accountability of bad prognosis, while negative signals contribute to the accountability of good prognosis.

The plots generated are a histogram with the distribution of the regression coefficients and a lollipop plot with the most relevant coefficients (see bellow).

Applying signatures of genes or transcripts in survival

Reboot produces and applies a score for all samples based on the signature previously obtained from the regression module. Besides, Reboot also offers the multivariate option, where further clinical variables (e.g., therapy, age and gender) can be loaded in a multivariate survival model. Multiple univariate analyses are executed and only variables with a p-value <= 0.2 and that passed the Schoenfeld’s test are selected for the final multivariate model. Statistical tests are performed in order to evaluate the relevance of the signature score along with co-variables as prognostic factors of a given event (overall / progression-free / recurrence-free survival).

By default, both univariate and multivariate survival analyses use the median score value as a cutoff to stratify patients in high and low score signatures. Alternatively, this cutoff value may be based on a Receiver Operator Characteristic (ROC) curve using Nearest Neighbour Estimate (NNE) method and the Youden statistics, where J = [sensitivity + (specificity -1)]. If more than one J coefficient is available, then the first one is chosen.

If a multivariate analysis is performed based on a ROC curve, a bootstrap resampling method is applied once the provided clinical dataset passes the following additional filters: (i) final dataset with at least 70% of the original one (Not Available - NA filter) and; (ii) the frequency of the less abundant category for each co-variable is not less than 20% (proportion filter). Otherwise, the multivariate analysis is performed without the bootstrap method. After 100 iterations, the relevance frequency of each co-variable with the event is calculated.

Usage

To validate a signature of genes or transcripts in survival analysis, run the following:

   # for docker container
   docker run --rm -v $(pwd):$(pwd) galantelab/reboot reboot.R survival <options>

Optionally:

   # for direct installation
   reboot.R survival <options>

Survival options are:

Options	Description
-I, --filein	Input file name. Tab-separated values (.tsv) file containing genes or transcripts expression and survival parameters
-O, --outprefix	Output file prefix (string). Default: reboot
-M, --multivariate	If clinical variables should be included, choose -M. This option is tied with -C option. Default: FALSE
-C, --clinical	Tab-separated values (.tsv) file containing binary categorical variables only. Required if -M option is chosen
-R, --roc	To categorize the genetic score according to a ROC curve instead of median value, choose -R. Default: FALSE
-S, --signature	Tab-separated values (.tsv) file containing a set of genes or transcripts and corresponding cox coefficients
-F, --force	Choose -F to bypass OS and OStime filters
-h, --help	Show this help message and exit

Inputs

Survival analyses may be run in univariate or multivariate mode. Required inputs depend on this choice.

Univariate mode

This is the simplest mode and requires a single input file. The expected .tsv file contains a set of features (genes/transcripts) and their corresponding coefficients provided as output by the regression module:

Feature name	coefficient
PARPBP	0.17014
CXCR6	0.22173
…	…

Multivariate mode

In case multivariate mode is chosen, a .tsv file containing clinical information is also necessary. Note that all clinical variables MUST be categorical and present ONLY 2 classes (NA values are allowed):

Sample ID	age	gender	therapy	…
patient_1	18-55 years	male	radiation	…
patient_2	56+ years	female	chemoradiation	…
…	…	…	…	…

Outputs

Depending on whether survival analysis was performed in univariate or multivariate mode, a different set of output files are created.

Univariate mode

If the analysis is performed in univariate mode, Reboot returns a log and a lograng.txt file, containing the survival results for the signature score:

feature	coefficient	hazard.ratio	log.rank.pvalue	low.high.samples	median.survival.low	median.survival.high	prognosis
score	-1.0091	0.3645 (95% CI, 0.2456-0.541)	0.003	52/53	532 (95% CI, 455-648)	313 (95% CI, 231-362)	better

Plots returned in this mode include: a proportional hazard assumptions plot (result of Schoenfeld test) and a Kaplan Meier plot (see bellow).

Multivariate mode

If the analysis is performed in multivariate mode, reboot returns all files created in the univariate mode in addition to a multicox.txt file, which contains the survival results of the signature score along with all other clinical variables:

variable	reference	univariate.hazard.ratio	univariate.Cox.pvalue	univariate.prognosis	multivariate.hazard.ratio	multivariate.Cox.pvalue	multivariate.prognosis
score	low	0.3645 (95% CI, 0.2456-0.541)	0.001	better	0.3904 (95% CI, 0.2248-0.6779)	8e-04	better
age	56+ years	1.369 (95% CI, 0.9086-2.0625)	0.1332	worse	1.1104 (95% CI, 0.6314-1.9531)	0.7161	----
gender	MALE	0.9474 (95% CI, 0.6381-1.4066)	0.7886	----	----	----	----
...	...	...	...	...	...	...	...

Plots returned in this mode include a forest plot for all clinical variables, a Kaplan Meier plot, and a proportional hazard assumptions plot (Schoenfeld tests). If the option –ROC is selected, only the most relevant variables (p-value <= 0.05 in at least 25% of iterations) are plotted. A ROC curve and a histogram of co-variable frequencies are also provided (see bellow).

Integrative analysis

Reboot also provides a subcommand to perform the full analyses (regression followed by survival) in a one-step process. To execute it, run the following:

   # for docker container
   docker run --rm -v $(pwd):$(pwd) galantelab/reboot reboot.R complete <options>

Optionally:

   # for direct installation
   reboot.R complete -h

Complete options are:

Options	Description
-I, --filein	Input file name. Tab-separated values (.tsv) file containing genes or transcripts expression and survival parameters
-O, --outprefix	Output file prefix (string). Default: reboot
-B, --bootstrap	Number of iterations for bootstrap simulation (integer). Default: 1
-G, --groupsize	Number of genes or transcripts to be selected in each bootstrap simulation (integer). Default: 10
-P, --pcentfilter	Percentage of correlated gene or transcript pairs allowed in each iteration (double). Default: 0.3
-V, --varfilter	Minimum normalized variance (0-1) required for each gene or transcript among samples (double). Default: 0.01
-M, --multivariate	If clinical variables should be included, choose -M. This option is tied with -C option. Default: FALSE
-C, --clinical	Tab-separated values (.tsv) file containing binary categorical variables only. Required if -M option is chosen
-R, --roc	To categorize the genetic score according to a ROC curve instead of median value, choose -R. Default: FALSE
-F, --force	Choose -F to bypass OS and OStime filters
-h, --help	Show this help message and exit

Toy example

In order to illustrate how easy is to use Reboot, we produce a framework to get a small dataset (here called toy dataset) from TCGA and use it into Reboot.

We provide a script to download and format gene expression and clinical data of glioblastoma patients from TCGA. It can be run following this command into the reboot directory:

   # for docker container
   docker run --env MYID=$(id -u) --rm -ti -v $(pwd):$(pwd) -w $(pwd) galantelab/reboot toyfordocker.R

optionally:

   # for direct usage
   toyscript.R

This command returns 2 .tsv files, mentioned above, called expression.tsv and clinical.tsv. A MANIFEST.txt file and a set of expression and clinical data are also created, as intermediates of TCGA download process. The composition of the expression dataset comprises clinical variables: OS (survival status) and OS.time (follow up time) and 50 random picked gene expression (FPKM).

Finally, reboot can be run in the complete mode:

   # for docker container
   docker run -u $(id -u):$(id -g) --rm -v $(pwd):$(pwd) -w $(pwd) galantelab/reboot reboot.R complete -I expression.tsv -O toy -B 100 -G 10 -M -C clinical.tsv -F

Optionally:

   # for direct usage
   reboot.R complete -I expression.tsv -O toy -B 100 -G 10 -M -C clinical.tsv -F

Authors

Felipe Rodolfo Camargo dos Santos*

Gabriela Der Agopian Guardia*

Filipe Ferreira dos Santos*

Daniel Takatori Ohara

Pedro Alexandre Favoretto Galante

*These authors contributed equally for the work

Contents

Regression and survival tool with a multivariate bootstrap approach

Contents

Overview

Installation

Dependencies

Commands and options

Finding signatures of genes or transcripts

Usage

Input

Output

Applying signatures of genes or transcripts in survival

Usage

Inputs

Outputs

Integrative analysis

Toy example

Authors