From 747ded5aed8304a84e4b44a22719070f1d71a4c7 Mon Sep 17 00:00:00 2001 From: slhogle <shane.hogle@gmail.com> Date: Thu, 7 Dec 2023 14:10:23 +0200 Subject: [PATCH] Cleaning up some of the readme --- README.md | 46 ++++++++++++++++++++++++++++++++++------------ 1 file changed, 34 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 94dba7c..a401aea 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,8 @@ This repository contains the final genome [assemblies](data/assemblies), [logs]( 30 Species from the HAMBI community were sent for Oxford Nanopore or PacBio HiFi sequencing at [SeqCenter](https://www.seqcenter.com/) early November 2022. I assembled multiple subsets of the read data for each genome and generated consensus assemblies using [Trycycler](https://github.com/rrwick/Trycycler). +All sequence data is available under BioProject [PRJNA1047486](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1047486) + # Results | GTDB species | Strain | Trycycler consensus steps | Sum length (bp) | Chromosomes | Plasmids | Coding Genes | 16S rRNA gene copies | @@ -39,8 +41,14 @@ This repository contains the final genome [assemblies](data/assemblies), [logs]( | *Microvirga lotononidis* | HAMBI_3237 | [Link](analysis/HAMBI_3237/HAMBI_3237.md) | 7,316,924 | 3 | 1 | 7211 | 6 | | *Pseudomonas fluorescens* SBW25 | NA | [Link](analysis/PsFluSBW25/PsFluSBW25.md) | 6,722,400 | 1 | 0 | 6020 | 5 | | *Serratia marcescens* ATCC 13880 | NA | [Link](analysis/Smarc13880/Smarc13880.md) | 5,160,509 | 1 | 1 | 4734 | 7 | + +# Consensus assembly pipeline +Long-read PacBio HiFi and Oxford Nanopore assemblies were generated using Trycycler ([paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02483-z), [code](https://github.com/rrwick/Trycycler)) following the general instructions on [project wiki](https://github.com/rrwick/Trycycler/wiki). Full Trycycler analysis walkthroughs are available for each species in the [`analysis`](/analysis) directory. The general steps are as follows: + +## 1. [Generate Assemblies](https://github.com/rrwick/Trycycler/wiki/Generating-assemblies) +Generate assemblies using read subsets and different asseblers. I followed the [Trycycler recommendations](). -# Assemblers +### a) Assemblers used in this study - [Flye v2.9.1](https://github.com/fenderglass/Flye) - `apptainer build flye.sif docker://quay.io/biocontainers/flye:2.9.1--py310h590eda1_0` - [Hifiasm v0.16.1](https://github.com/chhylp123/hifiasm) @@ -48,17 +56,35 @@ This repository contains the final genome [assemblies](data/assemblies), [logs]( - [Canu v2.2](https://github.com/marbl/canu/releases/tag/v2.2) - [Raven v1.8.1](https://github.com/lbcb-sci/raven) - `apptainer build raven.sif docker://quay.io/biocontainers/raven-assembler:1.8.1--h5b5514e_1` - - Note: Raven was only used for HAMBI_3237 for scaffolding the uncircularized contigs. - -# Consensus assembly pipeline -Long-read PacBio HiFi and Oxford Nanopore assemblies were generated using Trycycler ([paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02483-z), [code](https://github.com/rrwick/Trycycler)) following the general instructions on [project wiki](https://github.com/rrwick/Trycycler/wiki). Full Trycycler analysis walkthroughs are available for each species in the [`analysis`](/analysis) directory. The first step after clustering the assembly contigs by mash distance is to manually inspect the clusters as in the [Trycycler tutorial.](https://github.com/rrwick/Trycycler/wiki/Clustering-contigs) In general we look for clusters that contain contigs of similar length and that were assembled by multiple different assemblers, which suggests the contig is real/correct. If a cluster consists entirely of assemblies from one assembler then this is probably not real. Canu seems to produce a lot of fragmented short contigs, but it also finds the major chromosome/plasmids that the other assemblers find. Thus, I will keep the Canu assemblies, but they required some extra effort to manually inspect. + - Note: Raven was only used for ONT reads and for HAMBI_3237 from the HiFi reads for scaffolding the uncircularized contigs. + +## 2. [Cluser contigs](https://github.com/rrwick/Trycycler/wiki/Clustering-contigs) +Here we cluster contigs from the different assemblers/read subsets into per-contig/replicon groups. This steps also helps remove any junk, but we need to do some manual inspection. In general we look for clusters that contain contigs of similar length and that were assembled by multiple different assemblers, which suggests the contig is real/correct. If a cluster consists entirely of assemblies from one assembler then this is probably not real. Canu seems to produce a lot of fragmented short contigs, but it also finds the major chromosome/plasmids that the other assemblers find. + + +## 3. [Reconciling contigs](https://github.com/rrwick/Trycycler/wiki/Reconciling-contigs) +Here Trycycler tries to reconcile the contigs with each other on a per-cluster basis. This step is done per-cluster, so if your assemblies yielded three good contig clusters (e.g. one chromosome and two plasmids) then you will carry out this step on each of them. + +We also run Trycycler dotplot on all clusters to visualise the relationship between the sequences. For example, sometimes assemblers will make copies of the same sequence, in a single sequence. This will be visible in the dotplot with itself and the dotplots with other sequences. An example of this is in [Cluster 001 for HAMBI_2160](analysis/HAMBI_2160/HAMBI_2160.md) + +## Steps 4, 5, & 6 + +The remaining steps are pretty straightfoward and can just be followed as a recipe from the Trycycler wiki. + +4. [Multiple Sequence alighnment](https://github.com/rrwick/Trycycler/wiki/Multiple-sequence-alignment) +5. [Partition Reads](https://github.com/rrwick/Trycycler/wiki/Partitioning-reads) +6. [Generate consensus](https://github.com/rrwick/Trycycler/wiki/Generating-a-consensus) + +## 7. [Polishing](https://github.com/rrwick/Trycycler/wiki/Polishing-after-Trycycler) + +The PacBio and ONT assemblies were polished first using polypolish and then polcapolish. For the PacBio data this was unecessary (zero to only a handful of changes per assembly), but I wanted to see if there were any clear issues where the Illumina data disagreed with the PacBio HiFi data. Most assemblies had no to very few changes (< 10 single bp changes) all in homopolymer regions. # Short read plamid target assembly -Short read assemblies of ancestral HAMBI clones from earlier projects generated using Unicycler. These were used to find any small plasmids that the Trycycler asembly may have missed. +Short read assemblies of ancestral HAMBI clones from earlier projects were generated using Unicycler. These were used to find any small plasmids that the Trycycler asembly may have missed. 1. I generated [Unicycler Spades optimized assemblies](https://github.com/rrwick/Unicycler#method-illumina-only-assembly) using short read only data from: - - Illumina [HiSeq sequencing](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA476209) reads from [Cairns 2018 Frontiers Genetics](https://doi.org/10.3389/fgene.2018.00312) - - Illumina MiSeq sequencing reads from the 24 species HAMBI ancestor clones sequenced for other projects in the lab. + - Illumina HiSeq sequencing ([PRJNA476209](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA476209)) reads from [Cairns 2018 Frontiers Genetics](https://doi.org/10.3389/fgene.2018.00312) + - Illumina MiSeq sequencing reads from the 24 species HAMBI ancestor clones release here: [PRJNA1047486](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1047486) 2. These assemblies were manually inspected in [Bandage](https://github.com/rrwick/Bandage) for short circularized contigs/replicons. The assembly graphs were then blasted against the consensus long-read assemblies from Trycycler. If there was no match between a replicon in the graph and the long-read assembly I added the plasmid sequence to the long-read consensus fasta file. @@ -77,7 +103,3 @@ The HiSeq data produced ciruclarized plasmid sequences for HAMBI_0097 and HAMBI | HAMBI_2792 | PRJNA476209 | 4,851 | yes | yes | | | | **HAMBI_2792** | **PRJNA476209** | **3,696** | **yes** | **no** | **HAMBI_2792_plas02_circ** | | -# Polishing -The PacBio and ONT assemblies were polished first using polypolish and then polcapolish. For the PacBio data this was unecessary (zero to only a handful of changes per assembly), but I wanted to see if there were any clear issues where the Illumina data disagreed with the PacBio HiFi data. - -Most assemblies had no to very few changes (< 10 single bp changes) all in homopolymer regions. -- GitLab