The original 2,504 samples from the 1000 Genomes Project were re-sequenced and new related samples were added for generation of an improved publicly accessible whole-genome sequencing resource
Seven years ago, the 1000 Genomes Project (1kGP) published an open-access resource based primarily on low-coverage whole-genome sequencing (WGS) data of 2,504 individuals from 26 populations representing five continental regions of the world, making it the first large-scale WGS effort to deliver a catalog of human genetic variation.
Now, researchers at the New York Genome Center (NYGC), in collaboration with groups at the Massachusetts General Hospital, Yale University, and Human Genome Structural Variation Consortium (HGSVC), have expanded the 1kGP resource to include nearly all parent-child trios in the collection, alongside the original samples, and sequenced them at high coverage using Illumina NovaSeq instruments. The study, published in Cell, presents comprehensive analyses of the high-coverage WGS data on the expanded 1kGP cohort which now consists of 3,202 samples, including 602 trios.
NYGC’s Dr. Michael Zody (senior author), Marta Byrska-Bishop and Uday Shanker Evani (co-first authors) (l-r)
“The 1000 Genomes Project cohort is such a valuable resource, we felt it would be useful to the community to bring the sequencing up to date with the latest version of short-read technology while adding in the richness of the previously omitted family samples,” explained Michael Zody, PhD, Scientific Director of Computational Biology at the NYGC, and the study’s senior author.
Using state-of-the-art methods and algorithms, researchers at the NYGC sequenced DNA derived from lymphoblastoid cell lines (LCLs; i.e., immortalized human B cells from peripheral blood) from the expanded cohort to a targeted depth of 30X genome coverage. Next, the group performed single nucleotide variant (SNV) and short insertion and deletion (INDEL) calling, which consists of identification of variant sites from the sequence data relative to the human genome reference and genotyping of discovered variant sites across all samples in the cohort.
Additionally, a team from Dr. Michael Talkowski’s group at the Harvard Medical School, Broad Institute and Massachusetts General Hospital, in collaboration with Dr. Ira Hall’s group at Yale University and the Washington University School of Medicine, as well as the HGSVC, discovered and genotyped a comprehensive set of structural variants (SVs) across the 3,202 1kGP samples by integrating multiple analytic approaches.
Overall, the study shows significant improvements in both discovery power and precision of variant calls, especially among rare SNVs as well as INDELs and SVs spanning the frequency spectrum, which were previously inaccessible with low-coverage sequencing.
An important aspect of the original 1kGP resource is its use as a reference panel for variant imputation, i.e., statistical inference of unobserved genotypes in sparse, array-based samples based on groupings of variants that are typically inherited together in the population learned from the reference panel, which facilitated numerous genome-wide association studies (GWAS). Now, with the expansion of the original resource, the team upgraded the reference imputation panel to include more variants discovered through high-coverage WGS and trio families.
“The new imputation panel includes more sites, especially many more common INDELs and SVs, thus expanding the number of variants accessible for GWAS, which, given the large effect size of non-SNV variation, is likely to enable discovery of new genetic associations that help pinpoint the causative variant,” explains Marta Byrska-Bishop, PhD, Senior Bioinformatics Scientist at the NYGC, and the study’s co-first author.
All raw sequence data and variant call sets were immediately released to the public upon sequencing completion via several genomic data repositories, including the International Genome Sample Resource (IGSR) which is maintained by co-authors from the European Bioinformatics Institute at the European Molecular Biology Laboratory (EMBL-EBI).
“Our goal is to have this public resource serve as the benchmark for future population genetic studies and methods development,” adds Xuefang Zhao, PhD, Postdoctoral Fellow at the Center for Genomic Medicine Massachusetts General Hospital, and the study’s co-first author.
The data have already gathered interest from the genetics and genomics community. This will likely continue for years to come thanks to the fully open-access nature of the 1kGP samples which, unlike most newly emerging WGS efforts, are consented for public distribution of genetic data without access or use restriction.
Sequencing was supported by grants from the National Human Genome Research Institute (NHGRI). This analysis was partly supported by grants from NHGRI, the National Institute of Child Health and Human Development (NICHD), the National Institute of Mental Health (NIMH), the European Molecular Biology Laboratory (EMBL), and the Wellcome Trust.
About the New York Genome Center
The New York Genome Center (NYGC) is an independent, nonprofit academic research institution that serves as a multi-institutional hub for collaborative genomic research. Leveraging our strengths in technology development, computational biology, and whole-genome sequencing and analysis, our mission is to advance genomic science, and its application to novel biomedical discoveries. NYGC’s areas of focus include the development of computational and experimental genomic methods and disease-focused research to advance the understanding of the genetic basis of cancer, neurodegenerative disease, and neuropsychiatric disease. In 2020, the NYGC began its work with hospital and academic partners to advance COVID-19 research, whole genome sequencing over 12,000 viral samples to discover new viral variants and explore the genetic basis of severe disease.
Institutional founding members are: Cold Spring Harbor Laboratory, Columbia University, Albert Einstein College of Medicine, Memorial Sloan Kettering Cancer Center, Icahn School of Medicine at Mount Sinai, New York-Presbyterian Hospital, New York University, Northwell Health, The Rockefeller University, Stony Brook University, and Weill Cornell Medicine. Institutional associate members are: American Museum of Natural History, Hospital for Special Surgery, Georgetown Lombardi Comprehensive Cancer Center, Hackensack Meridian Health, The New York Stem Cell Foundation, Princeton University, Roswell Park Cancer Institute and Rutgers Cancer Institute of New Jersey. For more information on the NYGC, please visit: nygenome.org