Discovery of common sequences absent in the human reference genome using pooled samples from next generation sequencing.

August 16, 2014

Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations.

To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity.

76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates.

Figure PCR of individuals and evolutionary conservation of micSeq30. a) PCR results for micSeq30 for samples from three populations used for validation. The allele frequency clearly shows population stratification for this micSeq: with higher frequency in African, and much lower in others, p-value < 10-5 for African vs non-African (Table 1). b) Sanger sequencing results for PCR amplified DNA, predicted sequence of micSeq30, and homologs from other primates. The genomic coordinates for other primates are: chimp: chrX: 132, 861, 450-132, 861, 750; gorilla: chrX: 129, 774, 750-129, 775, 050; orangutan: chrX: 131, 684, 602-131, 684,902; and rhesus: chrX: 130, 465, 777-130, 466, 071. The red arrows show the corresponding coordinates in human reference genome (hg19). The inserted position of micSeq30 in the reference genome is: chrX: 131, 393, 592, and, the corresponding anchor reads for micSeq30 locate in the region of chrX: 131, 393, 154-131, 393, 947.

Results from: Liu Y, Koyuturk M, Maxwell S, Xiang M, Veigl M, Cooper RS, Tayo BO, Li L, LaFramboise T, Wang Z, Zhu X, Chance MR.BMC Genomics. 2014 Aug 16;15(1):685. [Epub ahead of print]