The sequences representing all high-quality S protein variants were used for maximum likelihood tree calculation by VeryFastTree 3.0.1 using an LG substitution model. Sequences were further clustered into identical sequence clusters using CD-HIT v4.8.1 (command line option “-c 1.0”). Sequences shorter than 1175 residues or having more than one stop codon or having any number of undetermined residues were discarded. The S protein-based phylogeny was based on the S protein sequences extracted from GISAID by Nextclade and aligned to the reference COVID-19 sequence. The set of sequences collected as described above was clustered into transmission clusters using Phydelity v2.0. The maximum likelihood emergence time and origin of country for inner nodes were calculated by treetime 0.8.1 as described by the aforementioned workflow. Ultrafast bootstrap with 1000 replicates was used. A general time reversible model with unequal rates and unequal base frequencies was used allowing for a proportion of invariable sites together with a discrete Gamma model. The maximum likelihood tree was calculated using a modified version of Nextstrain workflow. The sequences with poor overall quality control status or with more than 1000 bps missing (as indicated by Nextclade analysis) were discarded. The limits were chosen arbitrarily after several attempts to look for cut-offs resulting in a set of sequences that includes most sequences from the lineage and some more diverged ones classified by pangolin as belonging to other lineages. The alignment against the GISAID sequences was conducted using minimap2 2.20-r1061. The reference sequence was chosen as it was closest to the one this lineage sequenced first in Lithuania but with smaller gap regions. The 95% refers to the level of query coverage in the alignment despite the identity level, the aligned fraction, and the 99.3% refers to the identity level in the aligned region. These two limits 99.3% and 95% are of a different nature. The number of matched residues amounts to equal or more than 95% of the reference sequence. The sequences chosen for analysis were composed of the union of two sets of sequences: (i) sequences that were assigned B.1.1.523 lineage by pangolin and (ii) sequences that were at least 99.3% identical to the Latvian B.1.1.523 sequence EPI_ISL_1590462. In order to elucidate the potential origin of the lineage and transmission cluster, a supa phylogeny analysis of the full genomes representing a small subset of GISAID was performed.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |