BNoL Modules

bnol.data

Data for use in examples and tests.

bnol.data.BerrettaExpression()[source]

Example frequency distributions with desirable properties:

  • First two specimens are permutations of each other;
  • Final specimen is uniformly distributed.

See Berretta et al. Cancer biomarker discovery: The entropic hallmark.

Returns:shape (4,5)
Return type:numpy.ndarray
bnol.data.epsilon()[source]

Epsilon value for use in testing as the tolerance in numpy.isclose().

Returns:1e-12.
Return type:float

bnol.information

Information-theoretic measures (e.g. entropy and divergence) and analyses (e.g. Minimum Description Length Principle for feature selection).

bnol.information.Complexity(distributions, reference=None)[source]

For each of a group of n specimens, return the specimen entropy multiplied by the Jensen-Shannon divergence (each in bits).

See Berretta et al. Cancer biomarker discovery: The entropic hallmark.

Parameters:
  • distributions (numpy.ndarray) – shape (n,p) defining the relative frequencies of features in the n specimens.
  • reference (Optional[numpy.ndarray]) – shape(1,p) defining a reference sequence against which complexity is calculated. If not provided then the average of all distributions is used.
Returns:

n complexity values, one for each specimen.

Return type:

numpy.ndarray

class bnol.information.Discretize[source]

Information-theoretic method for discrete feature selection based on Minimum Description Length Principle (MDLP).

This method additionally provides an entropy-minimising means of defining a context-dependent threshold for over- / under-expression of genes across categorical designations. This is the (binary) discretisation step.

Along with gene-expression values, all specimens are labelled by class (i.e. context-dependency) such as cancer vs normal tissue, and a threshold is produced such that the entropy of classes within the groups (above and below said threshold) is minimised. Any number of classes (>=2) can be used.

Features are only included if the reduction in entropy, compared to no separation by a threshold, is sufficiently great (the MDLP criterion).

See Fayyad and Irani. Multi-interval discretization of continuous-valued attributes for classification learning.

baseEntropy

float

class entropy for the whole set of specimens.

bestThresholds

numpy.ndarray

float; shape(p,) optimal, entropy-minimising, threshold for each of the p features.

discretizedFeatures

numpy.ndarray

bool; shape (n,p); whether or not each specimen-feature value exceeds the optimal threshold for said feature.

includeFeatures

numpy.ndarray

bool; shape(p,); whether or not the optimal threshold for each feature is sufficient such that the decrease in entropy meets the MDLP criterion.

gains

numpy.ndarray

float; shape(p,); entropy improvement based on best threshold for each feature.

mdlpCriteria

numpy.ndarray

float; shape(p,); minimum gain required for inclusion of each feature.

static deltaMDLP(classes, baseEntropy, bestSeparation, bestSeparationEntropies)[source]

Calculate the delta value for a particular feature, as defined in Fayyad and Irani paper for calculating MDLP criterion.

fit(distributions, classes)[source]

Determine threshold values for each feature.

Parameters:
  • distributions (numpy.ndarray) – shape (n,p) defining the relative frequencies of features in the n specimens.
  • classes (numpy.ndarray) – shape (n,) defining categorical designation of each specimen.
Raises:

Exception – if number of samples is different in distributions and classes arguments.

fit_transform(distributions, classes, allFeatures=False)[source]

Determine threshold values as if only calling fit(distributions, classes) and return the discretized features.

Tip

Use Discretize.includeFeatures attribute to determine which features are included when allFeatures==False.

Parameters:
  • distributions (numpy.ndarray) – shape (n,p) defining the relative frequencies of features in the n specimens.
  • classes (numpy.ndarray) – shape (n,) defining categorical designation of each specimen.
  • allFeatures (bool) – if False will exclude those features for which the MDLP criterion was not met.
Returns:

boolean; shape (n,p) if allFeatures==true otherwise shape (n`,p) where n` is the number of features for which MDLP criterion was met; boolean value represents whether or not each specimen-feature value exceeds the optimal threshold for said feature.

Return type:

numpy.ndarray

static getSeparation(feature, threshold)[source]

Ensure that thresholding is performed identically at all times. Threshold candidates are simply the values of the features so we MUST use > and not >=.

Parameters:
  • feature (numpy.ndarray) – shape (n,) vector defining relative frequency of feature across all specimens.
  • threshold (float) – value for separation of specimens into groups based on their feature value.
Returns:

boolean array defining if a specimen is greater than the threshold.

Return type:

numpy.ndarray

static groupClassEntropy(classes)[source]

Calculate the entropy of classes a group of specimens. Note that this may be the full set of features or separations based on a threshold.

transform(distributions, allFeatures=False)[source]

Not yet implemented.

Todo

Define a function to discretize novel specimens that weren’t used in determining the optimal thresholds and feature selection.

class bnol.information.Divergence(P=None, Qs=None)[source]

Bases: object

Calculate different divergence measures, D(P||Q), to describe the ‘statistical distance’ between a set of probability distributions and a reference.

Utilises scipy.stats.entropy() under the hood but can accept multiple values for Q simultaneously.

See Wikipedia: Statistical distance as well as links in individual measures.

Parameters:
  • P (Optional[numpy.ndarray]) – shape (1,p) defining a reference sequence against which divergence is calculated. If not provided then a p-dimensional discrete uniform is used.
  • Qs (numpy.ndarray) – shape (n,p) defining the relative frequencies of features in the n specimens. NOTE: Although it has a default value of None, this is not optional (the ordering of P and Qs parameters is a sane choice given scipy.stats.entropy and this requires that Qs have a default if P does).
Raises:
  • Exception – If no value is provided for Qs.
  • Exception – If P is provided and it has a different number of features to Qs (i.e. not P.shape[1]==Qs.shape[1]).
  • Exception – If P is provided and it contains more than one reference (i.e. P.shape[0]>1).
JS()[source]

Return the Jensen-Shannon divergence (in bits).

Unlike Kullback-Leibler divergence (KLD), this is symmetrical. The square-root of the value is also a metric so it will satisfy the triangle inequality.

See Wikipedia: Jensen-Shannon divergence.

Note

JSD(P||Q) is calculated from KLD(P||Q) as follows:

let M = (P+Q) / 2
JSD(P||Q) = (KLD(P||M) + KLD(Q||M)) / 2
Returns:n divergence values, one for each specimen.
Return type:numpy.ndarray
KL()[source]

Return the Kullback-Leibler divergence (in bits).

Note that this is not symmetrical.

See Wikipedia: Kullback-Leibler divergence.

Returns:n divergence values, one for each specimen.
Return type:numpy.ndarray
bnol.information.Entropy(distributions)[source]

Calculate information entropy (in bits) for a set of distributions. Wrapper for scipy.stats.entropy(distributions.T) with base=2.

Parameters:distributions (numpy.ndarray) – shape (n,p) representing relative frequencies of the p features across n specimens.
Returns:n information entropy values, one for each specimen.
Return type:numpy.ndarray()

bnol.utility

General helper functions used throughout modules but likely of use in other settings.

bnol.utility.BooleanListIndexing(listVals, included=None)[source]

Mimic numpy indexing by boolean flags, i.e. return those values of listVals for which the corresponding value in included is truthy.

If include is None then return the original list.

Parameters:
  • listVals (list) – values from which we wish to return a subset.
  • included (list) – whether or not to return the corresponding value in listVals.
Returns:

numpy equivalent of listVals[included==True]

Return type:

list

Raises:

Exception – if lengths of listVals and include differ.

bnol.utility.DiscreteUniform(p)[source]

Generate a numpy array of equal-value floats representing a discrete uniform distribution.

Parameters:p (int) – the number of possible outcomes for the distribution; values are cast as int(p).
Returns:shape (1,p) where all values are equal to 1/p.
Return type:numpy.ndarray
Raises:Exception – if p, after being cast to an integer, is less than one.
bnol.utility.Normalize(freqs)[source]

Probability-normalize features for a set of specimens (i.e. divide each by their total such that they sum to one).

Parameters:freqs (numpy.ndarray) – shape (n,p) defining relative frequencies of p features in each of n specimens.
Returns:shape (n,p) where the sum across axis=1 is one for all n specimens.
Return type:numpy.ndarray
bnol.utility.VectorToMatrix(vec)[source]

Most BNoL functions are developed to work on matrices such that each row represents a specimen. In some cases we may wish to pass a single specimen that is defined as a one-dimensional numpy.ndarray and this will cause problems if darray.shape[1] is utilised.

Expand dimensions for such vectors such that they have shape (1,p) instead of (p,). Leave matrices unchanged.

Parameters:vec (numpy.ndarray) – either a (n,p) matrix defining n specimens or a (p,) vector defining a single specimen.
Returns:the original object if it was an (n,p) matrix or an expanded-dimension (1,p) vector of the single-specimen values.
Return type:numpy.ndarray

bnol.workflows

class bnol.workflows.CuffnormInformativeGenes(cuffnormOutputPath, classes)[source]

Bases: bnol.workflows.PandasInformativeGenes, bnol.workflows.CuffnormReader

Convenience wrapper for PandasInformativeGenes to automatically load data from cuffnorm gene and FPKM counts.

Parameters:
  • cuffnormOutputPath (string) – path to cuffnorm output
  • classes (numpy.ndarray) – shape (n,) defining categorical designation of each specimen present in cuffnormOutputPath
class bnol.workflows.CuffnormMultiClass(cuffnormOutputPath, multiClasses)[source]

Bases: bnol.workflows.PandasMultiClass, bnol.workflows.CuffnormReader

Convenience wrapper for PandasMultiClass to automatically load data from cuffnorm gene and FPKM counts.

Parameters:
  • cuffnormOutputPath (string) – path to cuffnorm output
  • multiClasses (list) – length (n) class designation for each of the n specimens present in cuffnormOutputPath; can include any type that may be used as a dict key.
class bnol.workflows.CuffnormReader[source]

Bases: object

Class to abstract conversion of cuffnorm output to Pandas DataFrame.

static getSpecimens(cuffnormOutputPath)[source]
bnol.workflows.EnsemblLookup(ensemblIDs, lookupFormat='full', rebuildCache=False)[source]

Use the Ensembl REST API ‘lookup’ to return data corresponding to a particular ID. Will create a local cache.

http://rest.ensembl.org/documentation/info/lookup

Parameters:
  • ensemblIDs (list or string) – Ensembl IDs in a list; will accept a string.
  • lookupFormat (string) – One of ‘full’ or ‘condensed’ as described in the API documentation.
  • rebuildCache (bool) – If True, fetch fresh data from Ensembl even if a locally-cached version exists.
Returns:

The JSON data returned by the API, converted to a dict. Will return a single dict if string ID passed.

Return type:

list or dict

Raises:

Exception – if an invalid format string is passed.

class bnol.workflows.PandasInformativeGenes(specimens, classes)[source]

Bases: object

Binary comparison of sub-classes of specimens. In the case of multiple specimen classes, expected use is in a one-vs-rest fashion.

Provides standardized access to means of ranking genes e.g. through entropy improvement via Discretize.

Parameters:
  • specimens (pandas.DataFrame) – shape (n,p) where the n index values constitute the specimens and the p columns the genes.
  • classes (numpy.ndarray) – shape (n,) class designation for each of the n specimens.
Raises:

Exception – if there is not exactly one class designation for each specimen.

informativeGenes(allGenes=False)[source]

Determine genes considered to be informative by means of decreasing class entropy through. Convenience wrapper for Discretize.

Will additionally provide an optimal threshold for determining over- vs under-expression between the two classes. The proportion of specimens in each class, considered over-expressed by this threshold, will also be provided.

Parameters:allGenes (bool) – as it says on the box if True else only return those genes for which the entropy gain is greater than the MDLP criterion.
Returns:details regarding genes, ranked in descending order of amount for which entropy gain exceeds MDLP criterion.
Return type:pandas.DataFrame
class bnol.workflows.PandasMultiClass(specimens, multiClasses)[source]

Bases: bnol.workflows.PandasInformativeGenes

Convenience wrapper to run PandasInformativeGenes analyses on multi-class data by running c different one-vs-rest analyses, where c is the total number of classes.

Parameters:
  • specimens (pandas.DataFrame) – shape (n,p) where the n index values constitute the specimens and the p columns the genes.
  • multiClasses (list) – length (n) class designation for each of the n specimens; can include any type that may be used as a dict key.
informativeGenes(allGenes=False)[source]

See bnol.workflows.PandasInformativeGenes.informativeGenes for base behaviour, repeated for each class.

Parameters:allGenes (bool) – as it says on the box if True else only return those genes for which the entropy gain is greater than the MDLP criterion.
Returns:pandas.DataFrame: each entry being a pandas.DataFrame object returned by one-vs-rest analysis.
Return type:dict