Readers for Omics data

Generic Reader

Created by Jorge Gomes on 06/09/2018 source generic_reader

class troppo.omics.readers.generic.GenericReader(path: str, idCol: int, expCol: int, header_start: int = 0, sep: str = ',')[source]

Bases: object

A generic reader to be used with omics files that are unable to be loaded by ProbeReader, or HpaReader, such as RNA-seq files from the gdc. Capable of handling files with additional info before the file header when supplied header_start by the user.

Arguments

path: str

complete path to the file from which expresion data is read.

idCol: int or str

either the name of the identifier column or its index in the file header

expCol: int or str

either the name of the expression values column or its index in the file header

header_start: int

line of the file header. Default = 0

sep: str

field separator used in the omics file. Default = “,”

load(**kwargs)[source]

Executes the loading of supplied omics file.

Returns

dict: dictionary with the identifiers as keys and the expression values as values.

class troppo.omics.readers.generic.TabularReader(path_or_df: str, index_col: int = 0, sample_in_rows: bool = True, header_offset: int = 0, cache_df: bool = False, ignore_samples: Optional[list] = None, omics_type: str = 'transcriptomics', nomenclature: Optional[str] = None, dsapply=None, **kwargs)[source]

Bases: object

A generic reader for tabular files. It can be used to read any tabular file, but it is recommended to use specialized readers for specific file types, such as ProbeReader for microarray files, or HpaReader for HPA files.

Arguments

path_or_df: str or pandas.DataFrame

The path to the file to be read, or a pandas DataFrame

index_col: int, optional

The index column of the file, by default 0

sample_in_rows: bool, optional

Whether the samples are in rows or columns, by default True

header_offset: int, optional

The number of lines to skip before the header, by default 0

cache_df: bool, optional

Whether to cache the DataFrame, by default False

ignore_samples: list, optional

A list of samples to ignore, by default None

omics_type: str, optional

The type of omics, by default ‘transcriptomics’

nomenclature: str, optional

The nomenclature of the omics, by default None

dsapply: function, optional

A function to apply to the DataFrame, by default None

**kwargs: dict, optional

Additional arguments to pass to pandas.read_csv

Methods

__iter__:

Iterates over the file, yielding a tuple of (sample, data)

to_containers:

Converts the file to a list of OmicsContainers

to_containers()[source]

Converts the file to a list of OmicsContainers

Returns

list : A list of OmicsContainers

HPA Reader

Created by Jorge Gomes on 09/03/2018 source HPA_Reader

class troppo.omics.readers.hpa.HpaReader(fpath: str, tissue: str, id_col: int = 0, includeNA: bool = False)[source]

Bases: object

Reads the HPA pathology.tsv file from a fpath in the system. Discrete values are converted to numerical and expression values account for the level with the most patients.

Parameters

fpath: str

complete path to the file from which omics data is read

tissue: str

Exactly as in the file, regarding the column where expression values should be retrieved

id_col: int,

either 0 (=”ensembl”) or 1(=”gene_symbol”) regarding which column shall be used for gene id

includeNA: bool

flag if NA values should be included or not

load()[source]

Executes the loading of supplied omics file.

Returns

dict: a dictionary of geneID: expressionValue

Microarray Reader

Created by Jorge Gomes on 19/03/2018 source probe_reader

class troppo.omics.readers.microarray.ProbeReader(fPath: str, expCol: int, annotFile: str, convTarget: str, convSep: str = ',', expSep: str = ',')[source]

Bases: object

Reads expression files sourced from microarrays DBs such as Gene Expression Barcode or Gene Expression OmniBus. Considers each value is identified by a probeID on the first column of the file. An annotation file supplied by the microarray chip vendor must be supplied for appropriate probe to gene Id conversion. Cases where a probe has no match with convTarget nomenclature will be ignored. Handles cases where more than one probe translate to the same gene, and where a probe translates to more than a gene.

Parameters

fPath: str

complete path to the file from which expresion data is read.

expCol: int

index of the column where expression values are retrieved from.

annotFile: str

complete path to the annotation file.

convTarget: str

exact match to the column name of the nomenclature used for probeID to geneID conversion recommended: Either Gene Symbol or Entrez Gene or equivalent.

expSep: str

field separator used in the probe intesity/expression file. Default is “,”

load() dict[source]

Executes the loading of supplied omics file.

Returns

dict: a dictionary of geneID: expressionValue