Are lineages the same as “Variants of Concern”?
No, they are not, despite the fact that many people use them interchangeably. Variants of concern (VOCs) typically exist on the same phylogenetic scale as Pango lineages, but just as every lineage isn’t a variant of concern, every variant of concern may not be a lineage. Lineages represent shared ancestry and variants may not necessarily fit this criteria. For example, VOC B.1.1.7+E484K refers to lineage B.1.1.7 with the E484K mutation in the spike protein. Within the UK alone this mutation has arisen independently a number of times and each independent occurrence isn’t given a new lineage designation. Pango does not attempt to identify or define VOCs. That is an activity for public health agencies such as PHE in the UK, CDC in the USA, and the WHO.
I’m confused, are B117, B.1.1.7, and B.1.177 the same lineage?
The dots are very important for distinguishing among different lineages. B.1.1.7 is a different lineage to B.1.177. B117 isn’t a proper lineage name. When spoken out loud, the dots often get omitted and B.1.1.7 becomes “Bee-One-One-Seven”. This omission is usually clear when talking about a well-known lineage like B.1.1.7, but could be ambiguous in other contexts. Resist the urge to shorten further, for example, “Bee-One-Seventeen” is B.1.17 and is a different lineage to B.1.1.7.
How is P.1 related to B.1.1.28?
Based on the hierarchical system, lineage P.1 (like other lineages P.2, P.3 etc) is a descendent lineage of B.1.1.28. The prefix P is an alias for B.1.1.28.
I understand the hierarchy and the aliases, but why are there A and B lineages at the start?
A and B represent the two earliest SARS-CoV-2 lineages that were sampled and sequenced during the pandemic. Either is plausibly the first lineage to begin circulating in humans, although at present A is slightly more closely related to the most similar SARS-CoV-like virus from bats. The Pango system treats both A and B as lineages whose ancestral lineages are not known with certainty and therefore they are special case lineages.The NextStrain clade 19A corresponds to Pango lineage B and NextStrain clade 19B corresponds to Pango lineage A.
What is the difference between designation and assignation of a sequence?
A sequence designation is made based on manual curation and analysis by the Pango team. We take the genome sequences and epidemiological data into account to determine which lineage it clusters within. So designations have been examined carefully. A sequence assignment is performed by a software tool such as pangolin and is the best estimate of the lineage the sequence belongs to, based on current data. As outlined here, both the designation and assignment of a sequence might change as more data comes in.
A more detailed explanation of designations vs assignations is available here.
Do you have the defining mutations for a lineage?
Pango lineages are defined through phylogenetic analysis, not by the presence of absence of particular SNPs in genome. That said, it is usually possible to list the mutations that are carried by the majority of designated sequences within a lineage. This is a resource we’re working on providing, but in the meantime there are a number of very useful resources that can provide you with mutation/SNP information including https://outbreak.info/, https://covariants.org/ and the constellations repository https://github.com/cov-lineages/constellations.
Where does the data come from?
Almost all of the data that we use to designate lineages is sourced from GISAID, much of which is generated by the COG-UK project. We are unable to share genomic data from GISAID, but we provide sequence names so that users with a GISAID account can access the relevant data.
How do I know if a lineage is still circulating?
Each lineage has a regularly updated status in the lineage description list, which defines when the lineage was last seen. This can be used to indicate whether lineages are known to be circulating. The potential lineage statuses are:
- ACTIVE – sampled within the last 3 months
- UNOBSERVED – last sampled 3 – 9 months ago so not seen in the last 3 months
- INACTIVE – not sampled in the last 9 months
- WITHDRAWN – name no longer in use
Why is it called Pango?
Pango is a latin verb meaning “I fix or set”, or record, or tell accounts of something. The nomenclature system and its lineages are called Pango. Our logo represents this in three ways (good luck finding them all!).The name pangolin only refers to the software tool for sequence assignment, available here, and is an acronym of “Phylogenetic Assignment of Named Global Outbreak LINeages”.
What is required for a new lineage?
A new lineage is designated when there is support for a clade of viruses being epidemiologically distinct from their parental viruses. A lineage needs to be a clade within the phylogenetic tree that has one or more defining evolutionary events. The lineage also needs to exhibit evidence of onward transmission (defined as having a branch within the clade that has at least one SNP and multiple descendents).
While the clade needs to have at least one defining mutation, a mutation is not sufficient to designate a new lineage. The clade should represent one or more epidemiological events to distinguish it from continued circulation of the parental lineage. These events can include:
- Introduction into a new geographical area with onward transmission there
- Rapid growth compared to other lineages in the same area
- Observed or predicted changes in phenotype, e.g. transmissibility or immunogenicity
- A constellation of mutations of interest
So if the clade in the tree has one or more defining mutations, at least 5 sequences, an internal non-zero length branch and one or more epidemiological events, then it might constitute a new lineage.
A formal description of characteristics required for a lineage are found here.
What do I do if I think there should be a new lineage?
Follow the detailed instructions on what to do if you think you’ve found a new lineage and submit a GitHub issue, so one of the Pango team can assess the proposal.
What if my lineage suggestion isn’t designated?
There may be times when, after a thorough review of the clade, the Pango Team decide it doesn’t match the required criteria to designate a new lineage. One of the Pango team will reply to your GitHub issue and outline why we don’t think it qualifies at this stage. In these cases, there are two possible outcomes:
- If the clade looks like it might become a lineage in the future but isn’t a lineage yet, we’ll add the “monitor” flag to the GitHub issue and leave it open. As more data comes in, we’ll revisit the clade and see if it does subsequently meet the criteria for a lineage, and if it does, designate it then.
- If the clade doesn’t currently meet the lineage criteria and it doesn’t look like it will in the future, we’ll close the issue after explaining why it doesn’t qualify.
Will you merge lineages if required?
Yes. We anticipate that there might be cases where multiple lineages are designated initially but as more data comes in, it becomes clear that there’s nothing to distinguish them and they should therefore be treated as a single lineage. If you think you’ve found a case where lineages should be merged, please submit a GitHub issue (as outlined here) that includes a description of why the lineages should be merged.
What is a withdrawn lineage?
In some cases, a set of genome sequences will initially satisfy the criteria for a lineage designation but new data means that one or more of those criteria are no longer met (see the FAQ “Why might a lineage designation change?”). In these cases, the lineage is withdrawn.
Alternatively, it might become clear that two or more designated lineages should actually be merged into a single lineage. In these cases, the lineages are merged and will retain one of the original lineage names while the other lineage names will be withdrawn.
How often are the lineages, software tools and websites updated?
New lineages are designated regularly as they appear. New lineage designations are added to the lineage description and sequence designation lists at the next update.
As new lineages are designated, the new information feeds into pangolin, for which new models are trained on a weekly basis. The cov-lineages.org website is updated on a daily basis using all full genome sequences on GISAID.
Why might a lineage designation change?
A lineage designation is made taking the genome sequence data and associated epidemiological data into account. As different labs from different countries around the world are producing data at different rates, we may only have part of the picture at a given time point when lineages are designated.
The example below doesn’t represent a real lineage, but aims to illustrate a possible situation. In panel (a) we see, at time point 1, what looks like a clear geographic distinction, potentially an introduction event from one location to another with evidence of onward spread, as reflected by the internal nodes within the bottom clade. At time point 2, new data has come to light that shows the events may not be straightforward. There doesn’t appear to be a single introduction event, and it’s unclear whether there has been onward spread. This now no longer fits what we would describe as a distinct lineage.
Panel (b) shows a second example, with additional data back-filling the tree. What looked like a distinct cluster is now phylogenetically indistinguishable from the diversity of the parent lineage.
These are simple illustrative examples, but show how new data can change the narrative around a cluster of sequences.
It‘s also possible that a sequence might be designated to a lineage initially (let’s call it lineage Z), but as more data arrives, it becomes clear that there’s an emerging descendent lineage within Z (e.g. Z.1) that is distinct enough to be a new lineage. In this case the sequence would change its designation from Z to Z.1.
Why might a lineage assignment change?
A lineage assignment is a “best guess” at what the lineage of an unknown or new sequence may be. This assignment comes with some uncertainty. The accuracy of assignment may depend on a number of factors, including the number of sequences in the lineage (i.e. quantity of data), the amount of ambiguity in those sequences (i.e. quality of data), and how unique the SNPs are for that lineage (i.e. E484K may be found in a number of different lineages). The assignment may change as new sequence designations are made and as new releases of the pangolin model are produced. It may be that a sequence becomes included in a new lineage designation that didn’t exist when you first ran your sequence through pangolin. The full list of designated sequences can be found here.
Can my partial genomes get a lineage designation?
We currently define and designate lineages only from full genomes, with less than 5% ambiguity across the virus genome. Partial genomes may be able to get a predicted lineage assignment using pangolin or another software tool. However, if a lot of data is missing then assignment may be unreliable.
How do these lineages relate to NextStrain or GISAID clades?
Alm et al. 2020 provide an overview of the diversity of SARS-CoV-2 circulating in Europe in 2020. That paper contains a figure that summarises the structure of NextStrain and GISAID clades and Pango lineages, and how they relate to one another. The figure is now somewhat outdated but is still a useful guide.
How long do I need to wait to have a lineage designation issue investigated?
We have a team of people working on lineage designation requests, however we’re academics working on Pango alongside our day-jobs and designation requests may take some time to process. Urgent requests for lineages of public health or epidemiological significance will be prioritised. The more information you provide with your request, the more easily we can process it. Remember that not all lineage requests will get approved. Lineage designation is at the discretion of the Lineage Designation Committee and the issue may be closed without a designation being made. We do appreciate all suggestions, however, and thank we the community for contributing to this ongoing surveillance effort.
What if I think one or more sequence designations is incorrect?
It’s very possible that sequence designations will change through time as new lineages are designated and withdrawn. If you think that a designation needs to be updated, please let us know by submitting a GitHub issue outlining which sequence designations need to be updated and why.
I think my sequence has been misassigned, what do I do?
Assignments are distinct from lineage designations. Assignments can be performed by software such as pangolin and may be subject to change depending on the inference engine used to predict the lineage of a given sequence. That being said, the best solution to a mis-assignment will often be to examine the sequence in question and file a designation request to rectify this issue on the pango-designation GitHub.
Can we designate a lineage based on a mutation?
In general, new lineages won’t be designated on the basis that they carry a single mutation of interest. Rather, we expect the clade to also represent one or more epidemiological events, as outlined in the rules statement. These events can include the appearance of a set (constellation) of interesting mutations, so if a clade has such a set then these may be sufficient to designate a lineage. One mutation, by itself, usually won’t be enough.
I’ve detected E484K in our sequences, should I submit a lineage designation request?
No, not unless there an associated epidemiological event or other extenuation factor (see FAQ “Can we designate a lineage based on a mutation?”). In general, the presence of one particular mutation of interest won’t lead to the designation of a new lineage. Some researchers choose to add +E484K after a Pango lineage name to indicate the presence of that particular mutation within one or more members of the lineage. The use of such a suffix is not part of the Pango Nomenclature system, but may prove useful in some circumstances.
What about N501Y?
Surely L452R though?