Making Sense from Sequence#

cogent3 is a Python library for the analysis of biological sequence data. We endeavour to provide a first-class experience within Jupyter notebooks, but the algorithms also support parallel execution on compute systems with 1000’s of processors.

Check out the other tabs on this page for installation instructions and highlights of what you can do with cogent3. See the links at the top of the page for an image gallery and detailed user guides.

For most uses, we recommend installation with the “extra” dependencies as these add support for visualisation and Jupyter notebooks.

pip install "cogent3[extra]"

For users on HPC systems, do the vanilla installation.

pip install cogent3

cogent3 provides an extensive suite of capabilities for manipulating and analysing sequence data. For instance, the ability to read standard biological data formats, manipulate sequences by their annotations, to perform multiple sequence alignment (app docs) using any of our substitution models, phylogenetic reconstruction and tree manipulation, manipulation of tabular data, visualisation of phylogenies (image gallery) and much more.

🎬 Data wrangling with sequence annotations

Differences in the frequency of nucleotides between species are common. In such cases, non-reversible models of sequence evolution are required for robust estimation of important quantities such as branch lengths, or measuring natural selection [1, 2] (see using non-stationary models.). We have done more than just invent these new methods, we have established the most robust algorithms [3] for their implementation and their suitability for real data [4].

🎬 Testing a hypothesis involving a non-stationary nucleotide process

You don’t have to be an expert in structural programming languages (like Python) to use cogent3! Interactive usage in Jupyter notebooks and a functional programming style interface lowers the barrier to entry. Individuals comfortable with R should find this interface less complex. (See the cogent3.app documentation.)

🎬 Using cogent3 apps

cogent3 has a plugin architecture that allows third-party packages to extend its capabilities. Plugins integrate seamlessly – users access new functionality through familiar cogent3 methods without changing their workflow. Plugins can provide hook-style computation backends (e.g. piqtree for phylogenetic inference via Alignment.quick_tree()), rust-based k-mer counting (via cogent3-pykmertools), new formats for reading and writing sequences, alternate storage backends such as cogent3-h5seqs for HDF5-compressed sequence collections (see :ref:`third-party storage <storage-plugin>`_), and custom annotation database backends. Want to write a plugin? Get in touch.

🆕 Features & 📣 Announcements#

🆕 Drawing genome annotations

The new cogent3.draw_annotations() function allows drawing genomic features from the annotation database alone. Check out the new section in the Gallery.

📣 The cogent3 code-sharing site

Share your cogent3 ecosystem code solutions for others to benefit from your awesomeness 😎. Click the “Code Sharing” link at the top of this page to read more.

📣 The diverse-seq package has been rewritten in rust 🚀!

The sequence sampling tool diverse-seq, which provides multiple apps for sampling representative sequences, just got faster! The performance critical code has been rewritten in Rust. Give it a try 😀.

🆕 Improved import performance 🎉

The import cogent3 statement is now much faster! Previously, this statement would trigger imports of many of our dependencies too. Give it a try and report any issues you encounter.


Citations

[1]

Benjamin D Kaehler, Von Bing Yap, Rongli Zhang, and Gavin A Huttley. Genetic distance for a general non-stationary Markov substitution process. Systematic Biology, 64:281–93, 2015. URL: https://www.ncbi.nlm.nih.gov/pubmed/25503772.

[2]

Benjamin D Kaehler, Von Bing Yap, and Gavin A Huttley. Standard codon substitution models overestimate purifying selection for non-stationary data. Genome Biology and Evolution, 9:134–149, 2017. URL: https://www.ncbi.nlm.nih.gov/pubmed/28175284.

[3]

Harold W Schranz, Von Bing Yap, Simon Easteal, Rob Knight, and Gavin A Huttley. Pathological rate matrices: from primates to pathogens. BMC Bioinformatics, 9:550, 2008. URL: https://www.ncbi.nlm.nih.gov/pubmed/19099591.

[4]

Klara L Verbyla, Von Bing Yap, Anuj Pahwa, Yunli Shao, and Gavin A Huttley. The embedding problem for Markov models of nucleotide substitution. PLoS ONE, 8:e69187, 2013. URL: https://pubmed.ncbi.nlm.nih.gov/23935949/.