Batch integration of Omics Data

Several integration algorithms were introduced in the previous tutorials. However, the demonstrated approach was limited to a single sample. In some cases, multiple samples are available and the context-specific models are required for each. Hence, making the integration of multiple samples a necessity.

batch_run is a function from Cobamp that allows multiprocessing and is fully compatible with the Troppo framework. Thus allowing the integration of multiple samples in a single run. This function requires four parameters:

  • function: the function that will run the reconstruction that needs to be parallelized.

  • sequence: a list with the containers for each sample.

  • paramargs: a dictionary with the parameters for the function.

  • threads: the number of parallel processes to run.

Initial setup

import pandas as pd
import cobra
import re

from troppo.omics.readers.generic import TabularReader
from troppo.methods_wrappers import ReconstructionWrapper
from cobamp.utilities.parallel import batch_run

patt = re.compile('__COBAMPGPRDOT__[0-9]{1}')
replace_alt_transcripts = lambda x: patt.sub('', x)

# load the model
model = cobra.io.read_sbml_model('data\HumanGEM_Consistent_COVID19_HAM.xml')

# Create the reconstruction wrapper
model_wrapper = ReconstructionWrapper(model=model, ttg_ratio=9999,
                                      gpr_gene_parse_function=replace_alt_transcripts)

# load the data
omics_data = pd.read_csv(filepath_or_buffer=r'data\Desai-GTEx_ensembl.csv', index_col=0)
omics_data = omics_data.loc[['Lung_Healthy','Lung_COVID19']]

# creat omics container
omics_container = TabularReader(path_or_df=omics_data, nomenclature='entrez_id',
                                omics_type='transcriptomics').to_containers()

Define the function to be parallelized

This function uses the run_from_omics method from the ReconstructionWrapper class. This requires the following parameters:

  • omics_data: the omics data container for the sample.

  • algorithm: a string containing the algorithm to use for the reconstruction.

  • and_or_funcs: a tuple with the functions to use for the AND and OR operations of the GPR.

  • integration_strategy: a tuple with the integration strategy and the function to apply to the scores.

  • solver: the solver to use for the optimization.

  • **kwargs: additional parameters for the reconstruction that are specific to used algorithm.

def reconstruction_function_gimme(omics_container, parameters: dict):

    def score_apply(reaction_map_scores):
        return {k:0  if v is None else v for k, v in reaction_map_scores.items()}

    flux_threshold, obj_frac, rec_wrapper, method = [parameters[parameter] for parameter in
                                      ['flux_threshold', 'obj_frac', 'reconstruction_wrapper',
                                       'algorithm']]

    reac_ids = rec_wrapper.model_reader.r_ids
    metab_ids = rec_wrapper.model_reader.m_ids
    AND_OR_FUNCS = (min, sum)

    return rec_wrapper.run_from_omics(omics_data=omics_container, algorithm=method,
                                      and_or_funcs=AND_OR_FUNCS,
                                      integration_strategy=('continuous', score_apply),
                                      solver='CPLEX', obj_frac=obj_frac,
                                      objectives=[{'biomass_human': 1}], preprocess=True,
                                      flux_threshold=flux_threshold, reaction_ids=reac_ids,
                                      metabolite_ids=metab_ids)

Considering the function above, the parameters for the reconstruction are defined in a dictionary as follows:

parameters = {'flux_threshold': 0.8, 'obj_frac': 0.8, 'reconstruction_wrapper': model_wrapper,
              'algorithm': 'gimme'}

Run the batch integration

batch_gimme_res = batch_run(reconstruction_function_gimme, omics_container, parameters, threads=2)