after assembling a genome with Supernova v2.0.0, how can we identify the two possible variations (phased blocks in the fasta output) of a heterozygous region in the megabubbles (style 2) output. For example, after aligning some Supernova scaffolds against the genome of a closely related individual, there are a few scaffold pairs that align to the exact same region but with noticeable alignment differences between them (although only about 50% of the scaffolds can align to this region they do not align elsewhere in the genome). I would guess these scaffolds represent an heterozygous phase bubble arm, but looking at their fasta headers it seems impossible to make the link between the pairs.
For example, the fasta headers of two scaffolds that align in the same overlapping region are as follows:
>190 edges=1705814,1073804,1741316,1103168,1741315,1071237,1701025,20609,1502706,361360,1530762,1893530,1528592,1893494,1740937,1073104,1091307,1073102,1740758,1740615,1740225,1893502,1740618,1779548,1465216,1720539,1720536,1720531,1720533,1893523,1528817,1787450,1702918 left=42374 right=78474 ver=1.9 style=2
>191 edges=1743396,1175444,1743412,1175440,1743392,1175438,1743415,1175437,1743421,1743406,1743410,1175442,1534756,1175443,1743408,1904778,1743419,1904788,1743404,1780562,1743400,1743398,1743417,1800536,1743394 left=42374 right=78474 ver=1.9 style=2
However, there's no shared edge between the two. I'm a bit puzzled about this, because I was expecting at least some edges to be shared betwen the two, corresponding for example to linked homozygous edges... Or is my interpretation incorrect?
Edit 05 Mar: actually, in the example above, although the edges are not the same, the "left" and "right" edges (?) are the same, but this doesn't seem to happen always. In another example, both scaffolds 184187 and 184188 align to the same reference sequence in the entire span of the scaffolds, but their headers are totally different:
>184187 edges=2056916,2056898,2056895,2056896,2056901 left=1633472 right=1633470 ver=1.9 style=2
>184188 edges=2056871,2056866,2056865,2056868,2056886 left=1633450 right=1633452 ver=1.9 style=2
Again, how can we make the link between the two haplotypes of a single region in the fasta megabubbles output?
By the way, these two "haplotypes" are 100% identitical, but Iwould say that might be related with the FAQ '
I can't find some known heterozygous sites in the megabubble FASTA. How is that possible? '
Solved! Go to Solution.
You've discovered the basic relationship. If left/right are the same between two fasta entries, that is an indication they are homologous megabubble arms. The left/right keywords describe the vertices that the edges of the megabubble arms connect to. See https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/output/graphs for a more complete description.
If two sequences are matching megabubble arms, then they should have the same left and right. It the sequence is precisely the same, then they’re unlikely to be matching megabubble arms (although it’s possible). If they have different left and right vertices, then it's most likely repeated sequence from different places in the assembly.
The "happy" situation where megabubbles are separated by nominally homologous sequence in between (unphased regions due to lack of heterozygosity) describes the vast majority of sequence in long scaffolds, but weird things happen at the ends of scaffolds and in portions of the assembly that, for instance, contain loops. Those things should still be described correctly in the fasta file, but if the underlying assembly graph doesn't look like that typical situation, that can produce apparent inconsistencies.
I hope this helps. If you have any further questions feel free to contact 10x Customer Support directly at email@example.com.