Loading unaligned sequence data#
We can load unaligned sequence data using the load_unaligned
app, this will return a SequenceCollection
.
Loading unaligned DNA sequences from a single fasta file#
In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned
app. We specify the molecular type (moltype="protein")
and the file format (format="fasta")
.
from cogent3 import get_app
load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
0 | |
1091044_fragment | IPLDFDKEFRDKTVVIVAIPGAFTPT |
13541053_fragment | KKKNTEVISVSEDTVYVHKAWVQYD |
15605725_fragment | FEILAINMDPENLTGFLKNNP |
3 x {min=21, median=25, max=26} protein sequence collection
Loading unaligned DNA sequences from multiple fasta files#
To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.
1. A data store that identifies the files we are interested in#
Here we open a read-only (mode="r"
) data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.
from cogent3 import get_app, open_data_store
fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)
2. A composed process that defines our workflow#
In this example, our process loads the unaligned sequences using load_unaligned
, then applies jaccard_dist
to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular
.
Note
Apps that are “writers” require a data store to write to, learn more about writers here!.
out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")
load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format="tsv")
process = load_unaligned_app + jdist + writer
Tip
When running this code on your machine, remember to replace path_to_dir
with an actual directory path.
Now we’re good to go! We can apply process
to our data store of fasta sequences. result
is a data store, which you can index to see individual data members. We can inspect a given data member using the .read()
on data members.
result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
dim-1 dim-2 value
DogFaced FlyingFox 0.5686327806469149
DogFaced FreeTaile 0.8466550825369245
DogFaced LittleBro 0.8740257004423847
DogFaced TombBat 0.8704297626683771
FlyingFox DogFaced 0.5686327806469149
FlyingFox FreeTaile 0.8094075881961393
FlyingFox LittleBro 0.8354894351013368
FlyingFox TombBat 0.8446223761090673
FreeTaile DogFaced 0.8466550825369245
FreeTaile FlyingFox 0.8094075881961393
FreeTaile LittleBro 0.7395833333333333
FreeTaile TombBat 0.7732452142206017
LittleBro DogFaced 0.8740257004423847
LittleBro FlyingFox 0.8354894351013368
LittleBro FreeTaile 0.7395833333333333
LittleBro TombBat 0.8105378704720088
TombBat DogFaced 0.8704297626683771
TombBat FlyingFox 0.8446223761090673
TombBat FreeTaile 0.7732452142206017
TombBat LittleBro 0.8105378704720088