This guide provides step-by-step instructions to reproduce the analyses and results from the fedRBE preprint. It leverages the utility scripts and data provided in this repository to demonstrate both centralized and federated batch effect correction using limma and fedRBE, respectively.
requirements.txt.requirements_r.txtSet up Git LFS:
Install git lfs following the git lfs documentation.
Make sure you setup git lfs:
git lfs install
If you already cloned the repository, you need to now download the large files from the large file storage:
git lfs pull
If you have not yet cloned the repository, git clone will automatically download LFS files if git lfs install has been run before.
Clone the repository:
If you didn’t clone the repository yet please do so:
git clone https://github.com/Freddsle/fedRBE.git
cd fedRBE
Set up Python environment:
We recommend using a virtual environment:
python3 -m venv fedrbe_env
source fedrbe_env/bin/activate # on Windows: fedrbe_env\Scripts\activate
pip install -r requirements.txt
Set up R environment:
limma and variancepartition are Bioconductor packages and must be installed separately from the CRAN packages:
# Install CRAN packages
Rscript -e 'pkgs <- readLines("requirements_r.txt"); pkgs <- pkgs[!grepl("^#", pkgs) & nzchar(trimws(pkgs))]; install.packages(pkgs[!pkgs %in% c("limma", "variancepartition")], repos = "http://cran.rstudio.com/")'
# Install Bioconductor packages
Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install(c("limma", "variancepartition"))'
This section guides you through running both federated and centralized batch effect corrections and comparing their results.
Use the provided utility script to perform federated batch effect correction on your datasets.
python3 ./generate_fedrbe_corrected_datasets.py
Steps Performed by the Script:
evaluation_data/[dataset]/before/.Output:
evaluation_data/[dataset]/after/individual_results/. Furthermore stores the merged corrected data of all clients directly in evaluation_data/[dataset]/after/ as FedApp_corrected_data.tsv or FedApp_corrected_data_smpc.tsv if SMPC was used.evaluation_data/[dataset]/after/individual_results/, detailed logs and correction reports can be found.Note: The script may take some time to complete, depending on the dataset size and the number of clients. It usually takes a few hours upto a day.
_Note 2: To process this dataset one need >16GB RAM. To skip the correction on microarray datasets, comment the corresponding lines in the script (generate_fedrbe_corrected_datasets.py, search for experiments.append(microarray_experiment) to find the relevant 4 lines of code).
Note 3: This was already performed and the fedRBE corrected data is stored in the repository.
You can skip this if you just want to look at the results.
Customization:
## ADD EXPERIMENTS, CHANGE HERE TO INCLUDE/EXCLUDE EXPERIMENTS in generate_fedrbe_corrected_datasets.py).evaluation_data/[dataset]/before/ following the existing structure.Perform centralized batch effect correction using limma’s removeBatchEffect for comparison.
Navigate to the dataset directory:
cd evaluation_data/[dataset_name]
Run the data preprocessing and centralized correction script inside ipynb.
The code is located in *central_RBE.ipynb Jupyter notebooks in the evaluation_data/[dataset]/ directory.
Output:
evaluation_data/[dataset]/after/ for each dataset.Note: The preprocessing steps and centralized correction are already implemented in the provided notebooks. It is possible to skip this step completely and use the provided corrected data.
Use the provided script to analyze and compare the results of federated and centralized batch effect corrections.
python3 ./analyse_fedvscentral.py
What This Does:
Output:
fed_vc_cent_results.tsv in the evaluation_data/ directory.To reproduce the tables and figures from the preprint, run the provided Jupyter notebooks in the evaluation/ directory.
This is split into three python scripts. Make sure the required python packages from requirements.txt are installed!
First run the classification experiments. This takes multiple hours, so the repo already contains the relevant results if you want to skip this.
The classification experiments are split into the two different experiment types:
python3 evaluation_classification_after_correction/run_classification_train_test_split.py
python3 evaluation_classification_after_correction/run_classification_leave_one_cohort_out.py
The experiment results are saved in evaluation_classification_after_correction/results.
To finally visualize the experiments with plot, simply run the corresponding analyze script:
python3 evaluation_classification_after_correction/analyse_classification_metric_report.py
The resulting plots can be found in evaluation_classification_after_correction/plots.
Understanding the repository layout helps in navigating the files and scripts.
fedRBE/
├── README.md # General repository overview
├── batchcorrection/ # fedRBe FeatureCloud app
├── evaluation_data/ # Data used for evaluation
│ ├── microarray/ # Microarray datasets
├── before/ # Uncorrected data with structure needed to run the app
├── after/ # Corrected data
│ │ └── 01_Preprocessing_and_RBE.ipynb # Data preparation notebook with centralized removeBatchEffect run
│ ├── microbiome/ # Microbiome datasets with similar structure as microarray
│ ├── proteomics/ # Proteomics datasets
│ ├── proteomics_multibatch/ # Multi-batch proteomics datasets (several ba)
│ └── simulated/ # Simulated datasets
├── analyse_fedvscentral.py # Compares federated and centralized batch effect corrections.
├── generate_fedrbe_corrected_datasets.py # A script performing fedRBE on all datasets and save the results.
├── run_sample_experiment.py # A script performing fedRBE on one dataset only
├── evaluation_utils/ # Utility scripts for evaluations
│ ├── evaluation_funcs.R
│ ├── featurecloud_api_extension.py
│ ├── fedRBE_simulation_scrip_simdata.py
│ ├── filtering.R
│ ├── plots_eda.R
│ ├── simulation_func.R
│ ├── upset_plot.py
│ └── utils_analyse.py
├── evaluation/ # Main evaluation scripts to produce results and figures
│ ├── eval_simulation/ # Evaluations on simulated data
│ ├── evaluation_microarray.ipynb # Evaluation of microarray datasets
│ ├── evaluation_microbiome.ipynb
│ ├── evaluation_proteomics.ipynb
└── [other directories/files]
This repository includes several utility scripts to facilitate data processing, analysis, and visualization. All main scripts are located at the repository root; the remaining helpers are placed in evaluation_utils/.
generate_fedrbe_corrected_datasets.py (repo root): Automates the federated batch effect correction process using fedRBE.
Functionality:
analyse_fedvscentral.py (repo root): Compares the results of federated and centralized batch effect corrections.
Functionality:
featurecloud_api_extension.py: Extends FeatureCloud API functionalities to support custom workflows and simulations.
Functionality:
generate_fedrbe_corrected_datasets.py.filtering.R: Includes necessary filters for data preprocessing before centralized batch effect correction using limma’s removeBatchEffect.
Functionality:
plots_eda.R: Includes necessary functions to generates plots to visualize data distributions and corrections.
Functionality:
upset_plot.py: Generates UpSet plots to visualize intersections and overlaps in datasets or features.
Functionality:
Encountering issues? Below are common problems and their solutions:
evaluation_data/[dataset]/before/ directory.For unresolved issues, consider reaching out via the GitHub Issues page.
For questions, issues, or support, please: