BNoL

Source code available on GitHub

BNoL (pron. bee-noll) is a Python library for discrete feature selection with a primary focus on transcriptomic (gene-expression) analysis. Within transcriptomics there is a need for discovery of specific features (biomarkers) rather than weighted combinations as can be found through feature extraction methods such as PCA.

Although documented and tested, the code is very much in its infancy and function interfaces are open to change. Unit tests are currently focussed on sanity (e.g. dimensions of matrices / vectors) and very simple hand-worked expected results.

BNoL stands for Bare Necessities of Life, inspired by the line in the Jungle Book song: “Old Mother Nature’s recipes; that bring the bare necessities of life”.

Example Usage

Differential gene expression can be determined optimally via information-theoretic means with bnol.information.Discretize which, for every gene, determines the expression threshold that minimises the entropy of specimen classes above and below the threshold. Classes are categorical and may take two or more values; e.g. cancer vs normal, or treatment vs control. A threshold that perfectly segregates classes will result in zero entropy.

Automated workflows ease the process of analysis by, for example, importing directly from Cufflinks as below. Note that the same workflows can be used for any ordinal data by directly utilising bnol.workflows.PandasInformativeGenes.

1
2
3
4
5
from bnol import workflows
specimenClasses = ['TissueA']*30 + ['TissueB']*29 + ['TissueC']*30 + ['TissueD']*30
analysis = workflows.CuffnormInformativeGenes('tests/data/genes.count_table', specimenClasses)
genes = analysis.informativeGenes(allGenes=True)
print(genes)

The decrease in entropy, or the gain as described by Fayyad and Irani[1], is used to determine whether or not each gene is considered informative as per the mimimum-description length principle which is a formalised Occam’s razor. Output columns include the entropy Gain, determined at the optimal Threshold that defines the cutoff for classifcation as over- or under-expression. The gain must exceed the MDLP-Criterion [1] in order to be considered Informative. Output is ranked by (Gain - MDLP-Criterion), in descending order, and entropy values are base-2 (bits). For each category included in classes, the proportion of specimens that exceed the expression threshold for the gene in question is also calculated.

        Informative      Gain  MDLP-Criterion   Threshold  \
GENE_F          1.0  0.769482        0.116362  307.829500   
GENE_D          1.0  0.596045        0.120020  413.201500   
GENE_C          1.0  0.177876        0.138511  184.746000   
GENE_I          1.0  0.168772        0.142998    0.146106   
GENE_G          0.0  0.127129        0.169225    0.802509   
GENE_A          0.0  0.034032        0.110796    0.715290   
GENE_E          0.0  0.040814        0.124415    0.413522   
GENE_H          0.0  0.034888        0.127572    8.038390   
GENE_B          0.0  0.043815        0.173531    2.299720   

        OverExpressed-TissueA  OverExpressed-TissueB  OverExpressed-TissueC  \
GENE_F               1.000000               0.931034               0.033333   
GENE_D               0.033333               0.000000               0.200000   
GENE_C               0.000000               0.482759               0.066667   
GENE_I               0.366667               0.000000               0.400000   
GENE_G               0.166667               0.448276               0.400000   
GENE_A               0.000000               0.000000               0.000000   
GENE_E               0.100000               0.000000               0.033333   
GENE_H               0.000000               0.068966               0.000000   
GENE_B               0.800000               0.655172               0.866667   

        OverExpressed-TissueD  
GENE_F               0.066667  
GENE_D               0.966667  
GENE_C               0.133333  
GENE_I               0.066667  
GENE_G               0.733333  
GENE_A               0.066667  
GENE_E               0.000000  
GENE_H               0.066667  
GENE_B               0.600000  

[1] Fayyad and Irani (1993). Multi-interval discretization of continuous-valued attributes for classification learning.

Compatibility

Continuous integration is performed against the following versions of Python (See BNoL testing results):

  • 2.7
  • 3.3
  • 3.4
  • 3.5

Acknowledgements

This work was made possible by funding from, in alphabetical order: