Readers for Omics data

Generic Reader

Created by Jorge Gomes on 06/09/2018 source generic_reader

class troppo.omics.readers.generic.GenericReader(path: str, idCol: int, expCol: int, header_start: int = 0, sep: str = ',')[source]

Bases: object

A generic reader to be used with omics files that are unable to be loaded by ProbeReader, or HpaReader, such as RNA-seq files from the gdc. Capable of handling files with additional info before the file header when supplied header_start by the user.

Arguments

path: str: complete path to the file from which expresion data is read.
idCol: int or str: either the name of the identifier column or its index in the file header
expCol: int or str: either the name of the expression values column or its index in the file header
header_start: int: line of the file header. Default = 0
sep: str: field separator used in the omics file. Default = “,”

load(**kwargs)[source]: Executes the loading of supplied omics file.

Returns

dict: dictionary with the identifiers as keys and the expression values as values.

class troppo.omics.readers.generic.TabularReader(path_or_df: str, index_col: int = 0, sample_in_rows: bool = True, header_offset: int = 0, cache_df: bool = False, ignore_samples: Optional[list] = None, omics_type: str = 'transcriptomics', nomenclature: Optional[str] = None, dsapply=None, **kwargs)[source]

Bases: object

A generic reader for tabular files. It can be used to read any tabular file, but it is recommended to use specialized readers for specific file types, such as ProbeReader for microarray files, or HpaReader for HPA files.

Arguments

path_or_df: str or pandas.DataFrame: The path to the file to be read, or a pandas DataFrame
index_col: int, optional: The index column of the file, by default 0
sample_in_rows: bool, optional: Whether the samples are in rows or columns, by default True
header_offset: int, optional: The number of lines to skip before the header, by default 0
cache_df: bool, optional: Whether to cache the DataFrame, by default False
ignore_samples: list, optional: A list of samples to ignore, by default None
omics_type: str, optional: The type of omics, by default ‘transcriptomics’
nomenclature: str, optional: The nomenclature of the omics, by default None
dsapply: function, optional: A function to apply to the DataFrame, by default None
**kwargs: dict, optional: Additional arguments to pass to pandas.read_csv

Methods

__iter__:: Iterates over the file, yielding a tuple of (sample, data)
to_containers:: Converts the file to a list of OmicsContainers

to_containers()[source]: Converts the file to a list of OmicsContainers

Returns

list : A list of OmicsContainers

HPA Reader

Created by Jorge Gomes on 09/03/2018 source HPA_Reader

class troppo.omics.readers.hpa.HpaReader(fpath: str, tissue: str, id_col: int = 0, includeNA: bool = False)[source]

Bases: object

Reads the HPA pathology.tsv file from a fpath in the system. Discrete values are converted to numerical and expression values account for the level with the most patients.

Parameters

fpath: str: complete path to the file from which omics data is read
tissue: str: Exactly as in the file, regarding the column where expression values should be retrieved
id_col: int,: either 0 (=”ensembl”) or 1(=”gene_symbol”) regarding which column shall be used for gene id
includeNA: bool: flag if NA values should be included or not

load()[source]: Executes the loading of supplied omics file.

Returns

dict: a dictionary of geneID: expressionValue

Microarray Reader

Created by Jorge Gomes on 19/03/2018 source probe_reader

class troppo.omics.readers.microarray.ProbeReader(fPath: str, expCol: int, annotFile: str, convTarget: str, convSep: str = ',', expSep: str = ',')[source]

Bases: object

Reads expression files sourced from microarrays DBs such as Gene Expression Barcode or Gene Expression OmniBus. Considers each value is identified by a probeID on the first column of the file. An annotation file supplied by the microarray chip vendor must be supplied for appropriate probe to gene Id conversion. Cases where a probe has no match with convTarget nomenclature will be ignored. Handles cases where more than one probe translate to the same gene, and where a probe translates to more than a gene.

Parameters

fPath: str: complete path to the file from which expresion data is read.
expCol: int: index of the column where expression values are retrieved from.
annotFile: str: complete path to the annotation file.
convTarget: str: exact match to the column name of the nomenclature used for probeID to geneID conversion recommended: Either Gene Symbol or Entrez Gene or equivalent.
expSep: str: field separator used in the probe intesity/expression file. Default is “,”

load() → dict[source]: Executes the loading of supplied omics file.

Returns

dict: a dictionary of geneID: expressionValue

Readers for Omics data

Generic Reader

Arguments

Returns

Arguments

Methods

Returns

HPA Reader

Parameters

Returns

Microarray Reader

Parameters

Returns