Removing degenerate characters#

Degenerate IUPAC base symbols represent a site position that can have multiple possible characters. For a DNA example, “Y” represents pyrimidines where the site can be either “C” or “T”.

Note

In many molecular evolutionary and phylogenetic analyses, the gap character “-” is treated “N”, meaning any base.

Let’s create sample data with degenerate characters

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs({"s1": "ACGA-GACG", "s2": "GATGATGYT"}, moltype="dna")
aln

	0
s1	ACGA-GACG
s2	GATGATGYT

2 x 9 dna alignment

Omit aligned columns containing a degenerate character#

from cogent3 import get_app

omit_degens = get_app("omit_degenerates", moltype="dna")
result = omit_degens(aln)
result

	0
s1	ACGAGAG
s2	GATGTGT

2 x 7 dna alignment

Omit all degenerate characters except gaps from an alignment#

If we create the app with the argument gap_is_degen=False, we can omit degenerate characters but retain gaps.

from cogent3 import get_app

omit_degens_keep_gaps = get_app("omit_degenerates", moltype="dna", gap_is_degen=False)
result = omit_degens_keep_gaps(aln)
result

	0
s2	GATGATGT
s1	ACGA-GAG

2 x 8 dna alignment

Omit k-mers which contain degenerate characters#

If we create omit_degenerates with the argument motif_length, it will split sequences into non-overlapping tuples of the specified length and exclude any tuple that contains a degenerate character.

from cogent3 import get_app

omit_degenerates_app = get_app("omit_degenerates", moltype="dna", motif_length=2)
result = omit_degenerates_app(aln)

result

	0
s1	ACGA
s2	GATG

2 x 4 dna alignment