Loading unaligned sequence data#

We can load unaligned sequence data using the load_unaligned app, this will return a SequenceCollection.

Loading unaligned DNA sequences from a single fasta file#

In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned app. We specify the molecular type (moltype="protein") and the file format (format="fasta").

from cogent3 import get_app

load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
0
1091044_fragmentIPLDFDKEFRDKTVVIVAIPGAFTPT
13541053_fragmentKKKNTEVISVSEDTVYVHKAWVQYD
15605725_fragmentFEILAINMDPENLTGFLKNNP

3 x {min=21, median=25, max=26} protein sequence collection

Loading unaligned DNA sequences from multiple fasta files#

To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.

1. A data store that identifies the files we are interested in#

Here we open a read-only (mode="r") data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.

from cogent3 import get_app, open_data_store

fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)

2. A composed process that defines our workflow#

In this example, our process loads the unaligned sequences using load_unaligned, then applies jaccard_dist to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular.

Note

Apps that are “writers” require a data store to write to, learn more about writers here!.

out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")

load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format="tsv")

process = load_unaligned_app + jdist + writer

Tip

When running this code on your machine, remember to replace path_to_dir with an actual directory path.

Now we’re good to go! We can apply process to our data store of fasta sequences. result is a data store, which you can index to see individual data members. We can inspect a given data member using the .read() on data members.

result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
dim-1	dim-2	value
DogFaced	FlyingFox	0.5686327806469149
DogFaced	FreeTaile	0.8466550825369245
DogFaced	LittleBro	0.8740257004423847
DogFaced	TombBat	0.8704297626683771
FlyingFox	DogFaced	0.5686327806469149
FlyingFox	FreeTaile	0.8094075881961393
FlyingFox	LittleBro	0.8354894351013368
FlyingFox	TombBat	0.8446223761090673
FreeTaile	DogFaced	0.8466550825369245
FreeTaile	FlyingFox	0.8094075881961393
FreeTaile	LittleBro	0.7395833333333333
FreeTaile	TombBat	0.7732452142206017
LittleBro	DogFaced	0.8740257004423847
LittleBro	FlyingFox	0.8354894351013368
LittleBro	FreeTaile	0.7395833333333333
LittleBro	TombBat	0.8105378704720088
TombBat	DogFaced	0.8704297626683771
TombBat	FlyingFox	0.8446223761090673
TombBat	FreeTaile	0.7732452142206017
TombBat	LittleBro	0.8105378704720088