Visualise presence of coding sequences in multiple genomes

Posted By: Sanderkr, on Jul 5, 2017 at 8:31 AM
Hello everyone!

First of all I'm new to this forum, so forgive me if I'm posting in the wrong subforum.

We are currently working on a project in which we compare the genomes of bacteria. We have 160 genomes to compare based on their coding sequences (CDS), with which we plan to make a phylogeny.

So at this moment we have 160 genomes with each approx 2500 CDS, half of which are present in 100+ genomes and half in less than a 100 genomes. We have the unique accession numbers for these CDS. We have been able to process the data and know exactly which genomes contain which CDS's, and now need to choose which CDS's (approx 1000) and which genomes we base our phylogeny on.

So we need to visualise our data into a sort of binary heatmap, in which we can profile each genome and the containing CDS's, so we use that to choose which genomes and CDS's we're going to use.

The problems is the data format, which most heatmappers can't handle. It's look like this:

Chr1: gene1, gene3, gene4, gene5
Chr2: gene1, gene2, gene4, gene5
Chr3: gene1, gene4, gene5

So now we need to visualise this data so it will become apparent that gene1, gene4 and gene5 are present in all genomes and will be used for our phylogeny.

Anyone has any idea as to how we can work this out? We have not managed to find a useful tool that can help us.

Thanks in advances!