BNoL¶
Source code available on GitHub
BNoL (pron. bee-noll) is a Python library for discrete feature selection with a primary focus on transcriptomic (gene-expression) analysis. Within transcriptomics there is a need for discovery of specific features (biomarkers) rather than weighted combinations as can be found through feature extraction methods such as PCA.
Although documented and tested, the code is very much in its infancy and function interfaces are open to change. Unit tests are currently focussed on sanity (e.g. dimensions of matrices / vectors) and very simple hand-worked expected results.
BNoL stands for Bare Necessities of Life, inspired by the line in the Jungle Book song: “Old Mother Nature’s recipes; that bring the bare necessities of life”.
Example Usage¶
Differential gene expression can be determined optimally via information-theoretic means with bnol.information.Discretize
which, for every gene, determines the expression threshold that minimises the entropy of specimen classes
above and below the threshold. Classes are categorical and may take two or more values; e.g. cancer vs normal, or treatment vs control. A threshold that perfectly segregates classes will result
in zero entropy.
Automated workflows ease the process of analysis by, for example, importing directly from Cufflinks as below. Note that the same workflows
can be used for any ordinal data by directly utilising bnol.workflows.PandasInformativeGenes.
1 2 3 4 5 | from bnol import workflows
specimenClasses = ['TissueA']*30 + ['TissueB']*29 + ['TissueC']*30 + ['TissueD']*30
analysis = workflows.CuffnormInformativeGenes('tests/data/genes.count_table', specimenClasses)
genes = analysis.informativeGenes(allGenes=True)
print(genes)
|
The decrease in entropy, or the gain as described by Fayyad and Irani[1], is used to determine whether or not each gene is considered informative as per the mimimum-description length principle which is a formalised Occam’s razor. Output columns include the entropy Gain, determined at the optimal Threshold that defines the cutoff for classifcation as over- or under-expression. The gain must exceed the MDLP-Criterion [1] in order to be considered Informative. Output is ranked by (Gain - MDLP-Criterion), in descending order, and entropy values are base-2 (bits). For each category included in classes, the proportion of specimens that exceed the expression threshold for the gene in question is also calculated.
Informative Gain MDLP-Criterion Threshold \
GENE_F 1.0 0.769482 0.116362 307.829500
GENE_D 1.0 0.596045 0.120020 413.201500
GENE_C 1.0 0.177876 0.138511 184.746000
GENE_I 1.0 0.168772 0.142998 0.146106
GENE_G 0.0 0.127129 0.169225 0.802509
GENE_A 0.0 0.034032 0.110796 0.715290
GENE_E 0.0 0.040814 0.124415 0.413522
GENE_H 0.0 0.034888 0.127572 8.038390
GENE_B 0.0 0.043815 0.173531 2.299720
OverExpressed-TissueA OverExpressed-TissueB OverExpressed-TissueC \
GENE_F 1.000000 0.931034 0.033333
GENE_D 0.033333 0.000000 0.200000
GENE_C 0.000000 0.482759 0.066667
GENE_I 0.366667 0.000000 0.400000
GENE_G 0.166667 0.448276 0.400000
GENE_A 0.000000 0.000000 0.000000
GENE_E 0.100000 0.000000 0.033333
GENE_H 0.000000 0.068966 0.000000
GENE_B 0.800000 0.655172 0.866667
OverExpressed-TissueD
GENE_F 0.066667
GENE_D 0.966667
GENE_C 0.133333
GENE_I 0.066667
GENE_G 0.733333
GENE_A 0.066667
GENE_E 0.000000
GENE_H 0.066667
GENE_B 0.600000
Compatibility¶
Continuous integration is performed against the following versions of Python (See BNoL testing results):
- 2.7
- 3.3
- 3.4
- 3.5
Acknowledgements¶
This work was made possible by funding from, in alphabetical order:
- Cambridge Trust
- Cambridge Trust Scholarship
- The Royal College of Pathologists of Australasia
- RCPA Foundation Mike and Carole Ralston Travelling Fellowship
- The University of Sydney Travelling Scholarships
- Charles Gilbert Heydon Travelling Fellowship in Biological Sciences
- Eleanor Sophia Wood Postgraduate Research Travelling Scholarship