Batch integration of Omics Data
Several integration algorithms were introduced in the previous tutorials. However, the demonstrated approach was limited to a single sample. In some cases, multiple samples are available and the context-specific models are required for each. Hence, making the integration of multiple samples a necessity.
batch_run is a function from Cobamp that allows multiprocessing and is fully compatible with the Troppo framework. Thus allowing the integration of multiple samples in a single run. This function requires four parameters:
function: the function that will run the reconstruction that needs to be parallelized.
sequence: a list with the containers for each sample.
paramargs: a dictionary with the parameters for the function.
threads: the number of parallel processes to run.
Initial setup
import pandas as pd
import cobra
import re
from troppo.omics.readers.generic import TabularReader
from troppo.methods_wrappers import ReconstructionWrapper
from cobamp.utilities.parallel import batch_run
patt = re.compile('__COBAMPGPRDOT__[0-9]{1}')
replace_alt_transcripts = lambda x: patt.sub('', x)
# load the model
model = cobra.io.read_sbml_model('data\HumanGEM_Consistent_COVID19_HAM.xml')
# Create the reconstruction wrapper
model_wrapper = ReconstructionWrapper(model=model, ttg_ratio=9999,
gpr_gene_parse_function=replace_alt_transcripts)
# load the data
omics_data = pd.read_csv(filepath_or_buffer=r'data\Desai-GTEx_ensembl.csv', index_col=0)
omics_data = omics_data.loc[['Lung_Healthy','Lung_COVID19']]
# creat omics container
omics_container = TabularReader(path_or_df=omics_data, nomenclature='entrez_id',
omics_type='transcriptomics').to_containers()
Define the function to be parallelized
This function uses the run_from_omics method from the ReconstructionWrapper class. This requires the following parameters:
omics_data: the omics data container for the sample.
algorithm: a string containing the algorithm to use for the reconstruction.
and_or_funcs: a tuple with the functions to use for the AND and OR operations of the GPR.
integration_strategy: a tuple with the integration strategy and the function to apply to the scores.
solver: the solver to use for the optimization.
**kwargs: additional parameters for the reconstruction that are specific to used algorithm.
def reconstruction_function_gimme(omics_container, parameters: dict):
def score_apply(reaction_map_scores):
return {k:0 if v is None else v for k, v in reaction_map_scores.items()}
flux_threshold, obj_frac, rec_wrapper, method = [parameters[parameter] for parameter in
['flux_threshold', 'obj_frac', 'reconstruction_wrapper',
'algorithm']]
reac_ids = rec_wrapper.model_reader.r_ids
metab_ids = rec_wrapper.model_reader.m_ids
AND_OR_FUNCS = (min, sum)
return rec_wrapper.run_from_omics(omics_data=omics_container, algorithm=method,
and_or_funcs=AND_OR_FUNCS,
integration_strategy=('continuous', score_apply),
solver='CPLEX', obj_frac=obj_frac,
objectives=[{'biomass_human': 1}], preprocess=True,
flux_threshold=flux_threshold, reaction_ids=reac_ids,
metabolite_ids=metab_ids)
Considering the function above, the parameters for the reconstruction are defined in a dictionary as follows:
parameters = {'flux_threshold': 0.8, 'obj_frac': 0.8, 'reconstruction_wrapper': model_wrapper,
'algorithm': 'gimme'}
Run the batch integration
batch_gimme_res = batch_run(reconstruction_function_gimme, omics_container, parameters, threads=2)