fedRBE

HowTo Guide Documentation GitHub FeatureCloud App

Reproduce the fedRBE Preprint

This guide provides step-by-step instructions to reproduce the analyses and results from the fedRBE preprint. It leverages the utility scripts and data provided in this repository to demonstrate both centralized and federated batch effect correction using limma and fedRBE, respectively.

Table of Contents


Prerequisites and setup

  1. Docker: Installation Instructions
  2. Git LFS (required for large files used in this workflow)
  3. Python 3.8+: Installation Instructions with dependencies from requirements.txt.
  4. R 4.0+ with dependencies from requirements_r.txt
  5. System resources:
    • ≥ 16 GB RAM (The script uses very close to 16 GB, so be careful with other programs running!)
    • ≥ 20 GB free disk space

Setup steps

  1. Set up Git LFS:

    Install git lfs following the git lfs documentation.

    Make sure you setup git lfs:

    git lfs install
    

    If you already cloned the repository, you need to now download the large files from the large file storage:

    git lfs pull 
    

    If you have not yet cloned the repository, git clone will automatically download LFS files if git lfs install has been run before.

  2. Clone the repository:

    If you didn’t clone the repository yet please do so:

    git clone https://github.com/Freddsle/fedRBE.git
    cd fedRBE
    
  3. Set up Python environment:

    We recommend using a virtual environment:

    python3 -m venv fedrbe_env
    source fedrbe_env/bin/activate  # on Windows: fedrbe_env\Scripts\activate
    pip install -r requirements.txt
    
  4. Set up R environment:

    limma and variancepartition are Bioconductor packages and must be installed separately from the CRAN packages:

    # Install CRAN packages
    Rscript -e 'pkgs <- readLines("requirements_r.txt"); pkgs <- pkgs[!grepl("^#", pkgs) & nzchar(trimws(pkgs))]; install.packages(pkgs[!pkgs %in% c("limma", "variancepartition")], repos = "http://cran.rstudio.com/")'
    # Install Bioconductor packages
    Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install(c("limma", "variancepartition"))'
    

Running the analysis

This section guides you through running both federated and centralized batch effect corrections and comparing their results.

1. Obtaining federated corrected data

Use the provided utility script to perform federated batch effect correction on your datasets.

python3 ./generate_fedrbe_corrected_datasets.py

Steps Performed by the Script:

  1. Sets up multiple clients: Simulates clients based on the datasets in evaluation_data/[dataset]/before/.
  2. Runs fedRBE on each client: Applies federated batch effect correction using FeatureCloud testing environment.
  3. Aggregates results: Combines corrected data.

Output:

Note: The script may take some time to complete, depending on the dataset size and the number of clients. It usually takes a few hours upto a day. _Note 2: To process this dataset one need >16GB RAM. To skip the correction on microarray datasets, comment the corresponding lines in the script (generate_fedrbe_corrected_datasets.py, search for experiments.append(microarray_experiment) to find the relevant 4 lines of code). Note 3: This was already performed and the fedRBE corrected data is stored in the repository. You can skip this if you just want to look at the results.

Customization:

2. Obtaining centrally corrected data

Perform centralized batch effect correction using limma’s removeBatchEffect for comparison.

  1. Navigate to the dataset directory:

    cd evaluation_data/[dataset_name]
    
  2. Run the data preprocessing and centralized correction script inside ipynb.

The code is located in *central_RBE.ipynb Jupyter notebooks in the evaluation_data/[dataset]/ directory.

Output:

Note: The preprocessing steps and centralized correction are already implemented in the provided notebooks. It is possible to skip this step completely and use the provided corrected data.

3. Comparing federated and central corrections

Use the provided script to analyze and compare the results of federated and centralized batch effect corrections.

python3 ./analyse_fedvscentral.py

What This Does:

Output:

4. Produce tables and figures

To reproduce the tables and figures from the preprint, run the provided Jupyter notebooks in the evaluation/ directory.

5. Reproduce the classification analysis comparing fedRBE corrected to non corrected data

This is split into three python scripts. Make sure the required python packages from requirements.txt are installed!

First run the classification experiments. This takes multiple hours, so the repo already contains the relevant results if you want to skip this.

The classification experiments are split into the two different experiment types:

  1. train_test_split: Each client reserves 20% of the data as test data, trains on the other 80% and reports metrics when predicting on the test data. This takes upto an hour. To run it, simply run
    python3 evaluation_classification_after_correction/run_classification_train_test_split.py
    
  2. leave_one_cohort_out: The classification model is trained on all except one client. Then the model is used to predict on all of the data of the left out client and the client reports the metrics. Therefore, n_clients models are trained. For this reason, this takes multiple hours. To run it, simply run:
    python3 evaluation_classification_after_correction/run_classification_leave_one_cohort_out.py
    

The experiment results are saved in evaluation_classification_after_correction/results.

To finally visualize the experiments with plot, simply run the corresponding analyze script:

python3 evaluation_classification_after_correction/analyse_classification_metric_report.py

The resulting plots can be found in evaluation_classification_after_correction/plots.


Repository structure

Understanding the repository layout helps in navigating the files and scripts.

fedRBE/
├── README.md                                   # General repository overview
├── batchcorrection/                            # fedRBe FeatureCloud app
├── evaluation_data/                            # Data used for evaluation
│   ├── microarray/                             # Microarray datasets
        ├── before/                             # Uncorrected data with structure needed to run the app
        ├── after/                              # Corrected data
│   │   └── 01_Preprocessing_and_RBE.ipynb      # Data preparation notebook with centralized removeBatchEffect run
│   ├── microbiome/                             # Microbiome datasets with similar structure as microarray
│   ├── proteomics/                             # Proteomics datasets
│   ├── proteomics_multibatch/                  # Multi-batch proteomics datasets (several ba)
│   └── simulated/                              # Simulated datasets
├── analyse_fedvscentral.py                     # Compares federated and centralized batch effect corrections.
├── generate_fedrbe_corrected_datasets.py       # A script performing fedRBE on all datasets and save the results.
├── run_sample_experiment.py                    # A script performing fedRBE on one dataset only
├── evaluation_utils/                           # Utility scripts for evaluations
│       ├── evaluation_funcs.R
│       ├── featurecloud_api_extension.py
│       ├── fedRBE_simulation_scrip_simdata.py
│       ├── filtering.R
│       ├── plots_eda.R
│       ├── simulation_func.R
│       ├── upset_plot.py
│       └── utils_analyse.py
├── evaluation/                                 # Main evaluation scripts to produce results and figures
│   ├── eval_simulation/                        # Evaluations on simulated data
│   ├── evaluation_microarray.ipynb             # Evaluation of microarray datasets
│   ├── evaluation_microbiome.ipynb
│   ├── evaluation_proteomics.ipynb
└── [other directories/files]

Utility scripts overview

This repository includes several utility scripts to facilitate data processing, analysis, and visualization. All main scripts are located at the repository root; the remaining helpers are placed in evaluation_utils/.

Troubleshooting

Encountering issues? Below are common problems and their solutions:

For unresolved issues, consider reaching out via the GitHub Issues page.

Additional resources

Contact information

For questions, issues, or support, please: