Background & Summary

Cannabis sativa L. is a primarily annual, dioecious or monecious herb that has been traditionally cultivated for fiber production, with a history dating back to around 8,000 BC1,2. Although many fiber-use cannabis plants are still being cultivated today, there has been a recent increase in interest in the unique chemical components of cannabis called cannabinoids, and the research and medicinal application have been growing3,4,5,6,7,8.

Generally, Δ9-tetrahydrocannabinol (Δ9-THC) and cannabidiol (CBD) are the most well-known among over 100 cannabinoids, as they are the most abundant9,10. These two components mainly exist in the form of Δ9-tetrahydrocannabinolic acid (Δ9-THCA) and cannabidiolic acid (CBDA) within the plant, and they are converted to CBD and Δ9-THC through the process of decarboxylation, in which the carboxyl group is removed upon heating and light exposure, through chemical reactions11.

These two main cannabinoids are used for different purposes. Δ9-THC, a representative drug permitted in 25 states of the USA and a few countries such as Canada, the UK, Croatia, and the Czech Republic, is often used for recreational purposes due to its psychoactive properties12,13. However, ongoing medical research is being conducted to explore its potential uses. On the other hand, CBD is reported to be effective for medical purposes such as anti-anxiety14, antioxidant and anti-inflammatory15, anticonvulsant16, and synergistic effects with anti-cancer drugs17. In the cannabis cultivation industry for medical purposes, there have been active breeding efforts to reduce Δ9-THC levels and increase CBD levels for several years18. Researchers continue to seek a better understanding of the biological and physiological characteristics of medicinal (Type III) cannabis to further advance its breeding19.

Since the completion of the initial draft genome of the marijuana strain ‘Purple Kush’ in 201120, efforts have been made to establish a comprehensive database and obtain high-quality data for genomes of various strains (Table 1)21,22,23,24. The current cannabis assemblies lack consistency in terms of total assembly size, and the naming of chromosome numbers and orientations is not standardized25. Previously published chromosome-level cannabis assemblies contain at least 147 scaffolds, indicating a need for better continuity (Table 1). Additionally, the average number of N’s per 100 kbp is 2,772, reflecting a very high proportion of unknown sequences. Kovalchuk et al. (2020) pointed out that the Cannabis genome assembly is incomplete, contains gaps, is poorly aligned with low resolution, and the quality of the consensus sequence obscures the accuracy of annotations26. Furthermore, such assemblies create confusion for data users in distinguishing between real genome differences and assembly errors.

Table 1 The list of assemblies of Cannabis sativa L. currently available in NCBI GenBank.

With the increasing use of Cannabis for both agricultural and medicinal purposes, it has become essential to establish a comprehensive and high-resolution cannabis genomic database. This resource is crucial for comparative genomics, evolutionary studies, breeding improvements, and understanding the genetic regulation of key agronomic traits, such as cannabinoid production. Recently, there has been growing number of studies examining small-scale variations such as single nucleotide polymorphisms (SNPs) in specific genes, as well as mid-larger scale variations like long terminal repeats (LTRs), using genomic data. The accurate identification of variations relies on the quality of sequencing and genome assembly. Therefore, ensuring high-quality genomic data is critical for the reliable interpretation of genetic variation.

To achieve a high-precision cannabis genome assembly, we utilized three sequencing technologies: Pacbio Single Molecule, Real-Time (SMRT) sequencing, Oxford Nanopore Technologies (ONT), and Illumina high-throughput short-read sequencing to achieve high precision overlap hybrid assembly. We generated two types of 3rd generation primary reads of ‘Pink Pepper’ based on PacBio SMRT, well-established for its high accuracy27, and ONT, which is advantageous for its longer read lengths28. Then, the accuracy of the genome assembly was then increased by aligning it with the Illumina sequencing data of the same variety, resulting in a chromosome-level genome (Fig. 1a). The assembled genome was classified into 10 chromosomes, with a size of 770 Mb. The GC content was 34.09%, N per 100 kbp was 0.69, complete Benchmarking Universal Single-Copy Orthologs (BUSCO) was 99.6% (viridiplantae_odb10), 97.8% (eudicots_odb10) and 98.6% (embryophyte_odb10). Overall repeats accounted for 77.13% of the entire genome. Based on transcriptome data from leaves, flowers, roots, and stems, and protein sets related to cannabis, 30,459 genes encoding proteins were predicted, accounting for 92.92% of the total 32,779 genes.

Fig. 1

Schematic diagram of the genome assembly of Cannabis sativa L. conducted in this study (a). The reference genome used for scaffolding was GCA_900626175.2 of NCBI GenBank database. The distribution of k-mer analysis using GenomeScope 2.0 (kmer: 19). Max k-mer coverage at 300 × (b), and 1,000,000 × (c). The blue portion in the figure represents the analyzed k-mer frequency, while the orange and yellow lines represent errors and unique sequences, respectively (b, c).

In this data, we present the complete genome sequence of the Pink Pepper cultivar, selectively bred for high CBD production. Based on this assembled genome, we can provide more precise fundamental information for not only cannabis breeding but also studies on the biological characteristics, and plant responses through the analysis of Differentially Expressed Genes. Consequently, understanding the cannabinoid and terpene biosynthesis mechanisms in cannabis could ultimately contribute to the development and application of medical cannabis.

Methods

Cannabis variety and cultivation

The variety of C. sativa used in this study was ‘Pink Pepper,’ which is a type 3 cannabis strain with a high content of CBD (open field, 11.404 ± 1.117%·inflorescence dry weight, 3.267 ± 0.335%·leaf dry weight). The cannabis was a cut-clone, and rooting was induced in tap water before being cultivated. To secure rooting space, a large pot (15 L) was filled with bed soil (bio bed soil, Heungnong Jongmyo Co., Pyeongtaek, Korea) for cultivation. The plants were grown in a green house for 90 days (24 ± 4°C), the photoperiod was adjusted to 18 hours/day using shading curtains. Although the strain was auto-flowering, the light was adjusted to 12 hours/day to activate flower differentiation and induce flower development. Throughout the entire growth cycle, the plants were irrigated with 400 mL of tap water once daily.

Nucleic acid extraction

High molecular weight genomic DNA was extracted from fresh leaf tissue during the vegetative growth phase, using the cetrimonium bromide (CTAB)-based extraction method. Total RNA was extracted from three types of plant tissues: flower, leaf, and root, using the Quick-RNA MiniPrep kit (Zymo Research, Irvine, CA, USA) during the flowering stage. To preserve the integrity of the nucleic acids, the sampled plant tissues were immediately submerged in liquid nitrogen and subsequently stored at −80 °C in a deep freezer (DAIHAN Scientific Co., Ltd., Wonju, Korea) until further analysis.

Quality control and library preparation

DNA concentration, quality, quantity, and integrity were assessed using Victor 3 fluorometry (PerkinElmer Inc., Waltham, MA, USA) and gel electrophoresis. A DNA integrity number (DIN) of seven or higher was confirmed. Quality control and normalization of the Illumina library involved quantification according to the Illumina qPCR quantification protocol guide. For nanopore sequencing, library preparation utilized a ligation sequencing kit with quantification performed using Qubit 3.0 (Thermo Fisher Scientific Inc., Waltham, MA, USA). The Pacbio library was prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences of California Inc., Menlo Park, CA, USA).

RNA was quantified with the Agilent Technologies 2100 Bioanalyzer (Santa Clara, CA, USA), achieving a RNA integrity number (RIN) of seven or higher, indicative of high quality. RNA integrity was further verified by gel electrophoresis. mRNA purification was conducted using the TruSeq stranded mRNA kit (Illumina, San Diego, CA, USA), followed by cDNA reverse transcription for library preparation. Illumina paired-end sequencing was subsequently performed.

Sequencing and pre-processing

Using the Illumina NovaSeq. 6000 (San Diego, CA, USA), we generated paired-end read data comprising 815,329,552 reads and totaling 123 giga base pair (Gbp). To remove contaminants and adaptors, fastp v0.21.0 (https://github.com/OpenGene/fastp) and BBDuk v38.87 (k = 31, mcf = 0.5; https://sourceforge.net/projects/bbmap/) were used. The contaminant databases included viral, rRNA, human, and bacterial sequences. After quality and adaptor trimming, 94 Gbp of read data was obtained, removing 0.06% viral, 2.61% rRNA, 0.03% human, and 0.03% bacterial reads. For long-read sequencing, ONT sequencing was performed using the ONT GridION (Oxford, UK), repeated five times for high reliability. The generated long-read data had adaptors removed using Porechop (v0.2.3, https://github.com/rrwick/Porechop). A pass with a quality score of seven or higher was confirmed. The number of reads was 11,484,123, with a total base pair count of 89 Gbp and an N50 of 26,677. Using the PacBio Sequel II system (Menlo Park, CA, USA), single molecule, real-time (SMRT) sequencing was performed to generate polymerase reads. Using the SMRT Link v11.1 software with the PacBio Sequel II system, adaptors were removed, and subreads were aligned, resulting in 2,039,056 reads with a total base pair count of 21 Gbp.

De novo assembly and scaffolding

To perform statistical analysis on the basic genomic information, Jellyfish v2.2.1029 and GenomeScope 2.030 were utilized to predict the genome size of the Illumina sequence reads. The analysis was conducted with k-mer 17, 19, and 21, with the 19-mer used for the final genome size estimation. As a result, homozygosity ranged from 98.64% to 98.69%, while heterozygosity ranged from 1.31% to 1.36%. The estimated haploid genome length ranged from 776 Mbp (mega base pair) to 779 Mbp, while the repeat length ranged from 554 Mbp to 556 Mbp. The unique length was estimated from 222 Mbp to 223 Mbp (Fig. 1b and c).

In this data, NextDenovo v2.3.1 (https://github.com/Nextomics/NextDenovo) was used to assemble the ONT reads, and then the PacBio reads were mapped31, resulting in the generation of 130 contigs with a total length of 809 Mbp. Then, the contigs were polished using Illumina short read to generate contigs of 810 Mbp. finally filtered using PurgeHaplotigs, about 4.99% of redundant sequences were removed to derive 70 contigs (770 Mbp). Using the RagTag software (https://github.com/malonge/RagTag) with default parameters, the generated contig was mapped to the previous version of C. sativa reference genome (GCA_900626175.2)32, and the pseudomolecule of a total of 770 Mbp was created. The longest chromosome was chromosome 2 with a length of 92 Mbp, while the shortest chromosome was chromosome 8 with a length of 51 Mbp, and N50 value was 77 Mbp (Table 2).

Table 2 Genome statistics during assembly and scaffolding process.

Repeat annotation

In general, long-read sequencing techniques, such as Pacbio sequencing and ONT approaches, are advantageous for the accurate detection of repeats containing tandem repeats (TRs). These methods can relatively accurately assemble long repeats spanning genes and detect the length, nucleotide composition, and nucleotide variations of TRs33. De novo repeat families were identified using RepeatModeler software (https://github.com/Dfam-consortium/RepeatModeler), and the distribution of repeats within assembled genomic sequences was analyzed using RepeatMasker v4.1.2 software34 (https://github.com/rmhubley/RepeatMasker). To enhance readability, the distribution of repeats was categorized into DNA elements, long interspersed nuclear elements (LINEs), LTRs elements, rolling circles (RCs) elements, and short interspersed nuclear elements (SINEs). The overall repeats represented 77.13% of the cannabis assembly, which was consistent with previous research reporting high repeat levels in cannabis cultivars ‘Purple kush’ and ‘Finola’ (73.9% and 73.3%, respectively)23.

The results indicated a slightly higher repeat content in cannabis compared to its taxonomically close relative Humulus lupulus (71.46%)35,36. Additionally, it was on the higher side compared to other plants such as Xanthoceras sorbifolium (56.39%)37, Oryza sativa (51.63% – 54.34%)38, Panax ginseng (56.9%)39, and Nicotiana tabacum (67.05%)40. The most abundant repeat regions were LTR-Gypsy retrotransposons and LTR-Copia retrotransposons, comprising 24.45% and 25.81% of the genome, respectively (Table 3).

Table 3 Result of repeat annotation statistics.

Gene annotation

Total RNA from plant tissues, including stems, leaves, roots, and flowers, was reverse-transcribed, and paired-end sequencing was performed using Illumina NovaSeq 6000. Subsequently, de novo assembly was conducted to obtain transcriptome data41. Simultaneously, an evidence dataset was constructed using protein sequences from 10 registered species on NCBI (Table 4), and the first gene prediction was performed using MAKER (v3.01.03)42. Among the genes, only those with an annotation edit distance of 0.25 or lower were selected. GeneMark (v4.38)43, SNAP (v20060728)44, and AUGUSTUS (v3.3.2)45 were performed for gene prediction ab initio training.

Table 4 Used protein database of related species for evidence dataset.

By integrating the results of the first gene prediction and the ab initio training dataset, a second gene prediction for gene model prediction was conducted. EvidenceModeler v1.1.146 was used to apply different weights to each dataset. The weights were set to 7 for GeneMark data and 10 for the others.

To predict the function of the identified genes, DIAMOND (v5.34-73.0; maximum target sequence = 20, e-value threshold = 1e-5)47 was used to analyze the similarity with the non-redundant protein database48 from NCBI and Araport1149 from Arabidopsis thaliana. Gene ontology (GO) analysis was conducted using BLAST2GO (v5.2.5)50, protein domains were identified using InterproScan (v5.34-73.0)51, and KEGG (Kyoto encyclopedia of genes and genomes) pathway analysis was performed using the KAAS web-tool52. Annotations were defined as follows: 30,395 (92.73%) for NCBI nr, 22,093 (67.40%) for Araport11, 21,878 (66.74%) for InterProScan, 16,464 (50.23%) for BLAST2GO, and 10,376 (31.65%) for KAAS web-tool. The data from each source were combined and complemented, resulting in 30,459 genes, which accounted for 92.92% of the total cannabis transcriptome (Fig. 2 and Table 5).

Fig. 2

Number and percentage of annotations by different annotation methods. The data represented in the Venn diagram describes protein IDs that are shared among functional annotation tools: Araport11, annotation database of Arabidopsis thaliana; NCBInr, NCBI protein sequence database; InterProScan, InterPro protein sequence database; KAAS, Kyoto encyclopedia of genes and genomes (KEGG) protein sequence database, Blast2GO: Tool for Gene Ontology (GO) analysis and functional annotation.

Table 5 Functional annotation statistics of software for gene prediction.

Data Records

In the study, the raw data set generated is available in the NCBI SRA database53. Specifically, the PacBio sequencing data for the genome is deposited under accession number SRX1788736154. The ONT sequencing data is available under accession number SRX1788736055, and the Illumina data under accession number SRX1788735556. The raw mRNA data generated for genome annotation have also been registered in the NCBI SRA database, associated with the following accession numbers: SRX17887359 (stem)57, SRX17887358 (root)58, SRX17887357 (leaf)59, and SRX17887356 (flower)60.

The assembled genome can be accessed in the GenBank database61. Comprehensive gene annotation information, including gene structure, functional predictions, transcriptome and protein data set can be accessed in the Figshare database62.

Technical Validation

Plant sample validation

The DNA concentration of the leaf sample was 23.616 ng/µl, and 100 µl was extracted (total DNA amount: 3.262 µg). The DIN value was determined to be 7.5, and after passing the quality check, it was used for library preparation. The RNA concentration was 107.024 ng/µl, and 96 µl was extracted (total RNA amount: 10.274 µg). The RIN value was confirmed to be 8.4, and the rRNA ratio was determined to be 2.0.

The RNA concentration of the root sample was 41.937 ng/µl, and 50 µl was extracted (total RNA amount: 2.097 µg). The RIN value was confirmed to be 7.7, and the rRNA ratio was determined to be 4.2.

The RNA concentration of the stem sample was 59.53 ng/µl, and 50 µl was extracted (total RNA amount: 0.281 µg). The RIN value was confirmed to be 7.7, and the rRNA ratio was determined to be 8.3.

The RNA concentration of the inflorescence (flower) sample was 952.552 ng/µl, and 50 µl was extracted (total RNA amount: 47.628 µg). The RIN value was confirmed to be 8.3, and the rRNA ratio was determined to be 2.7.

Comparison of read statistics and BUSCO with existing cannabis assemblies

Raw reads from the chromosome-level assembly publicly available on NCBI (Exclude reads from Abacus that are not presented in Sequencing Reads Archive (SRA)) were collected using SRA Toolkit (v3.1.1-ubuntu). Statistics were then generated using SeqKit63 (v2.8.2, Supplementary Table 1).

Among them, the Illumina NovaSeq 6000 used for this assembly produced the highest number of reads, generating 815,329,552 paired-end reads totaling 123 Gbp. This result produced 2.7 times more reads than JL’s HiSeq X Ten (SRA accession: SRX6757267), which previously held the highest number of reads, with comparable read lengths. The reads produced by ONT GridION had an N50 value of 26,677 and an N60 value of 73,606, which is 1.8 times higher than the N50 value of 14,716 for cs10’s ERX3863365 reads, the only other reads produced using ONT. It is also 1.7 times higher than the N50 value of 16,037 for Cannbio-2’s PacBio Sequel reads. This indicates a higher overlap proximity of reads, potentially leading to a more contiguous assembly. The reads produced by PacBio Sequel II were evaluated with a Q20 of 98.88% and a Q30 of 97.42%, the highest values next to those of Purple Kush (SRA accession: SRX4178554). These statistics demonstrate the impact of rapidly advancing sequencing technologies on producing high-quality reads. Furthermore, they emphasize the importance of hybrid assembly in offsetting disadvantages and leveraging advantages for downstream analysis.

To compare the completed chromosome-level assembly (Fig. 3) with other assemblies, the final assembly version of the chromosome-level Cannabis genomes registered in NCBI were collected20,21,24,25 (GenBank accession: GCA_025232715.1, GCA_013030365.1, GCA_003417725.2, GCA_016165845.1, GCA_000230575.5, GCA_900626175.2), and the collected genome data were validated for integrity using vdb-validate. The BUSCO (v5.2.2) analysis of NCBI’s chromosome-level assemblies were conducted using the viridiplantae_odb10, eudicots_odb10, and embryophyte_odb10 databases (Jan 08, 2024 released). Among the registered chromosome-level assemblies, this assembly showed the highest complete BUSCOs% based on all three databases (Fig. 4a-c). Specifically, for the viridiplantae_odb10 database, the complete percentage was 99.6% (single-copy: 95.8%, duplicated: 3.8%), for the eudicots_odb10 database it was 97.8% (single-copy: 91.6%, duplicated: 6.2%), and for the embryophyte_odb10 database, it was 98.6% (single-copy: 92.7%, duplicated: 5.9%). Simultaneously, our assembly data demonstrated a high level of single-copy BUSCOs% (Fig. 4a-c). The treemap, which represents the relative size of the assemblies, highlights the improved continuity of our assembly. Specifically, the number of scaffolds in chromosome-level assemblies is 5,303 for Finola, 147 for Cannbio-2, 12,836 for Purple Kush, 220 for cs10 (CBDRx), 160 for Abacus, and 483 for JL, while this assembly data contains only 17 scaffolds, confirming its superior continuity (Table 1 and Fig. 4d).

Fig. 3

Circle plot of the Cannabis sativa L. cv Pink Pepper genome assembly. From the outermost to innermost layers: Chromosome number, gene, CDS (coding sequence) frequency, mRNA frequency, and the relationship of the main cannabinoid gene. The protruding segments on the chromosomes represent unscaffolded regions. The scale indicating chromosome size is in units of Mbp (mega base pairs). CDS frequency and mRNA frequency are visualized after trimming at the 1 Mbp level. The red links connecting the center represent annotated genes involved in Δ9-THCA synthesis, while the blue lines represent annotated genes involved in CBDA synthesis (based on the description).

Fig. 4

Assembly completeness evaluation using Benchmark Universal Single-Copy Orthologs (BUSCO) and comparison of assembly continuity using a tree map chart. The evaluations were conducted using viridiplantae_odb10 (a), eudicots_odb10 (b), and embryophyta_odb10 (c). C: complete BUSCOs (S + D), S: Single-copy, D: Duplicated, F: Fragmented, M: Missing. The tree map chart visualizes the continuity of the assembly (d). The GenBank accession numbers for the varieties are as follows: Pink Pepper, the assembly data from this study (GCA_029168945.1); Abacus, GCA_025232715.1; Cannbio-2, GCA_016165845.1; JL, GCA_013030365.1; cs10, GCA_900626175.2; Finola, GCA_003417725.2; Purple kush, GCA_000230575.5.

Synteny analysis with close genetic relatives of C. sativa

Synteny comparison was conducted using protein sequences (protein.fasta) and annotation files (annotation.gff) generated from the annotation through BLASTp (v2.12.0)64 and MCScanX65. Previous studies using C. sativa genomes reported synteny comparison results with Ziziphus jujuba, which belongs to the same Rosaceae family24. In our synteny analysis using between the Pink Pepper genome assembly and the Z. jujuba reference genome (RefSeq: GCF_031755915.1), a total of 72,921 genes were identified, with 30,456 classified as collinear. This indicates that C. sativa and Z. jujuba share 41.77% synteny (Fig. 5a and b).

Fig. 5

Synteny analysis between the assembled Pink Pepper genome and the reference genomes of closely related species. The multicolored connecting curves between the chromosomes of the two species represent syntenic blocks, indicating conserved gene blocks between the genomes (a). The dot plots generated from the synteny data show the conserved synteny between Cannabis sativa L. and other genomes (b, c). cs1-10: Chromosome number of C. sativa formed by this assembly. zj1-12: chromosome number of the Ziziphus jujuba reference genome (RefSeq: GCF_031755915.1). hl1-10: chromosome number of the Humulus lupulus reference genome (RefSeq: GCF_963169125.1).

We further conducted a synteny analysis using the reference genome of H. lupulus (RefSeq: GCF_963169125.1), which belongs to the Cannabaceae family, a more specific clade within Rosaceae, and shares significant genetic similarity with C. sativa. Out of the 79,354 identified genes, 55,832 were analyzed as collinear genes, revealing a high synteny of 70.36% (Fig. 5a and c). These results further confirm the close genetic relationship between C. sativa and H. lupulus. The synteny analysis data can be available on Figshare for further analysis and use66.

Structural comparison between cannabis genomes

To compare the genomic structure using Pink Pepper assembly data, we compared the assembly with the previous reference genome, cs10 (GCA_900626175.2). Whole genome alignment (WGA) was performed using D-GENIES (v1.5.0)67 with Minimap2 (v2.26; -f = 0.02)68 as the aligner. The dot plot, with Pink Pepper as the target (reference) and cs10 as the query, revealed significant structural variations, such as gaps, inversions, and repeats, across the chromosomes, despite being from the same species. The comparison showed 19.89% no match, 9.12% matching <25%, 57.40% matching <50%, 13.36% matching <75%, and only 0.23% maching >75% (Fig. 6a). Additionally, distinct structural differences and variations were identified on Chromosome 7, which contains a high density of CBDAS and THCAS (or pseudo- and fragmented) loci in both our assembly and cs1021,62 (Figs. 3 and 6b).

Fig. 6

Whole genome alignment (WGA) dot plot between the assembled Pink Pepper genome and cs10. The dots generated in the plot represent regions of similarity between the two genomes that have been aligned. p01-p10: Chromosome numbers of Pink Pepper, c01-c10: Chromosome numbers of cs10. The WGA excluded unscaffolded contigs (a), and the dot plot of chromosome 7, which contains loci related to cannabidiolic acid synthase (CBDAS) and Δ9-tetrahydrocannabinolic acid synthase (THCAS), shows significant structural differences despite both strains being high-CBD varieties (b). Structural variations (SVs) at the chromosome level include breakpoints, duplications, sequence differences, gaps, and jumps, and the variant count was calculated per 10 Mbp (c). The GenBank accession numbers for the varieties are as follows: Pink Pepper, the assembly data from this study (GCA_029168945.1); Abacus, GCA_025232715.1; Cannbio-2, GCA_016165845.1; JL, GCA_013030365.1; cs10, GCA_900626175.2; Finola, GCA_003417725.2; Purple kush, GCA_000230575.5.

Figure 6c presents the distribution of structural variations (SVs), categorized by chromosome and interval, using the current assembly as a reference against previously registered genomic datasets (Abacus, Cannbio2, cs10, Finola, JL, and Purple Kush). The analysis was conducted using NUCmer (v3.1; l = 40, g = 90, b = 100, c = 200) and dnadiff (v1.3) with Pink Pepper assembly data as reference. Overall, the number of breakpoints was highest in JL, with 577,769 instances, while Finola exhibited the highest number of relocations (12,979) and translocations (32,620). The most frequent inversions were observed in Purple Kush, totaling 3,158, and Cannbio-2 showed the greatest number of insertions, reaching 220,308. Although visually distinct large-scale structural variations were observed in Fig. 6a, cs10 showed the lowest values across all SV comparisons when compared to other cultivars. This finding suggests significant structural genomic variations among cannabis cultivars bred for diverse purposes and through different ways.

These intra-species WGA results have stem from fragmented assembly, as previously suggested in cannabis genomics69. However, they could also be due to phenotypic changes induced by chemical treatments such as silver nitrate and sodium thiosulfate, aimed at inducing male flower through inhibition of ethylene synthesis70, and repeated inbreeding for strain stabilization71. Additionally, these results may be influenced by inbreeding within a limited population to achieve desired chemotypes or phenotypes. Through these differences, the accumulation of multiple high-quality cannabis genome assemblies can significantly enhance the resolution of molecular phylogenetic analyses, enabling the identification of subtle differences in evolutionary relationships and precise elucidation of phylogenetic dynamics. This SVs data can be available on Figshare for further analysis and use72.

Usage Notes

Table 6 provides a summary of the chromosome labels for easier data accessibility.

Table 6 Chromosome and annotation label of Cannabis sativa L. assembly of this data.