Customer Name | DEMO | Customer Institutaion | DEMO | Customer Email | demo | Project ID | DEMO |
---|---|
Seller Name | DEMO | Seller Email | demo@lcsciences.com |
Your reliable partner in genomics, transcriptomics and proteomics
Whole genome re-sequencing (WGS) reveals the complete DNA make-up of an individual with complete reference genome, enabling us to better understand variations both within and between individuals or populations.By Aligned to reference genome, we could find plenty of variants like SNP,InDel,SV and CNV in an individual's sequencing data,which could be applied to clinical research, population genetics, association studies, evolutionary biology and many other related fields.
Database | link | web_link |
Genome | hg19 | ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz |
Analysis item | Software | Version/date |
Quality control | fastp | 0.19.4 |
Alignment | BWA | 0.7.12 |
SAMtools | 1.9 | |
Picard | 2.17.3 | |
SNP/INDEL calling | GATK | 4.0.4.0 |
SNP/INDEL annotation | VEP |
Total DNA was isolated from whole blood collected in tubes with EDTA or tissue samples by using a standard DNA extraction protocol. The quantity of DNA was measured by reading A260/280 ratios by spectrophotometer. When A260/280 ratios located range 1.8 to 2.0, DNA was available.
The genomic DNA (gDNA, 1ug) was sheared using Covaris Series sonicator to achieve target peak of 200 to 300bp. Size selection after fragmentation was performed using Ampure XP beads,and blunt-end DNA fragments were generated by a combination of fill-in reaction and exonuclease activity. An A-base is then added to the blunt ends of each strand, preparing them for ligation to the T-tail sequencing adapters. 5 cycles of PCR were performed for amplification of the ligation products to generate gDNA library . At last, we performed the 150bp paired-end sequencing on an Illumina NovaSeq 6000 following the vendor's recommended protocol.
Workflow of WGS
We sequenced generating a total of DEMO paired-end reads of 150bp length. This yielded DEMO G of sequence.Prior to alignment, the low quality reads (1, reads containing sequencing adaptors; 2, nucleotide with q quality score lower than 20) were removed. After that, a total of DEMO G bp of cleaned, paired-end reads were produced.
For the alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired FASTQ files. And as first post-alignment processing step, Picard tools is utilized to identified and mark duplicate reads from BAM file.
In the second post-alignment processing step, local read realignment is performed to correct for potential alignment errors around indels. Mapping of reads around the edges of indels often results in misaligned bases creating false positive SNP calls. Local realignment uses these mismatching bases to determine if a site should be realigned, and applies a computationally intensive algorithm to determine the most consistent placement of the reads with respect to the indel and remove misalignment artifacts.
Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. Due to the Systematic biases, the reported quality scores are known to be inaccurate and as such must be recalibrated prior to genotyping. After recalibration, the recalibrated quality score in the output BAM will more closely correspond to the probability of a sequencing error.
Variant calls can be generated with GATK HaplotypeCaller or UnifiedGenotyper, which Examine the evidence for variation from reference via Bayesian inference.
A Gaussian mixture model is fit to assigning accurate confidence score to each putative mutation call and evaluating new potential variants.
Biological functional annotation is a crucial step in finding the links between genetic variation and disease. SnpEff is utilized to add biological information to a set of variants.
Bioinformatics pipeline for WGS
Family | Sample | Patient/Normal | gender |
PA | PA_proband | noset | noset |
PA | PA_sister | noset | noset |
PA | PA_brother | noset | noset |
document location:summary/1_RawData/sample_info_mendelian.xlsx
Sample | Raw Data | Valid Data | Valid% | Q20% | Q30% | GC% | ||
Read(M) | Base(G) | Read(M) | Base(G) | |||||
PA_brother | 816.06 | 122.41 | 796.07 | 113.87 | 97.55 | 96.21 | 90.07 | 40.68 |
PA_proband | 821.90 | 123.29 | 795.56 | 114.01 | 96.79 | 96.18 | 90.02 | 40.87 |
PA_sister | 773.91 | 116.09 | 750.52 | 107.60 | 96.98 | 95.83 | 89.35 | 41.26 |
document location:summary/1_RawData/ReadsQC.xlsx
Fig:Reads Depth Coverage
document location:summary/2_MappedData/ReadsDepthCoverage.png
Depth of coverage on each chromosome
Depth of coverage=covered total length/total length of each chromosome
document location:summary/2_MappedData/DepthCoverageByChr.png
Term | PA_brother | PA_proband | PA_sister |
Total Reads | 816055810 | 821903186 | 773906286 |
Valid Reads | 796067230 | 795555384 | 750518372 |
Mapped Reads | 794845447 | 794233789 | 749275898 |
Duplicate Reads | 149115353 | 167447724 | 137788372 |
Valid Rate % | 97.55 | 96.79 | 96.98 |
Mapped Rate % | 99.85 | 99.83 | 99.83 |
Duplicate Rate % | 18.76 | 21.08 | 18.39 |
MEAN_READ_LENGTH | 143.04 | 143.31 | 143.37 |
MEAN_DEPTH | 20.18 | 20.11 | 18.50 |
PCT_TARGET_BASES_1X | 98.29 | 97.80 | 97.79 |
PCT_TARGET_BASES_10X | 93.05 | 94.05 | 91.24 |
PCT_TARGET_BASES_20X | 53.69 | 52.37 | 40.00 |
PCT_TARGET_BASES_30X | 8.81 | 8.61 | 7.45 |
document location:summary/2_MappedData/MappedStatistics.xlsx
Table Description:
Term | Description |
---|---|
Total Reads | Number of total reads |
Valid Reads | Number of Valid reads |
Mapped Reads | Number of mapped reads on reference genome |
Duplicate Reads | Number of duplicate reads |
Valid Rate % | ratio:Valid reads/Total Reads |
Mapped Rate % | ratio:Mapped Reads/Valid Reads |
Duplicate Rate % | ratio:Duplicate Reads/Mapped Reads |
MEAN_READ_LENGTH | Average length of mapped reads |
MEAN_DEPTH | Mean dapth |
PCT_TARGET_BASES_1X | percentage of base>=1x |
PCT_TARGET_BASES_2X | percentage of base>=2x |
PCT_TARGET_BASES_10X | percentage of base>=10x |
PCT_TARGET_BASES_20X | percentage of base>=20x |
PCT_TARGET_BASES_30X | percentage of base>=30x |
Fig: Depth of coverage on each sample
A single-nucleotide polymorphism, often abbreviated to SNP, is a variation in a single nucleotide that occurs at a specific position in the genome including transition and transversion, where each variation is present to some appreciable degree within a population (e.g. > 1%).
Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that may be separated by many years, and may not be related to each other in any way. A microindel is defined as an Indel that results in a net change of 1 to 50 nucleotides.
Meanwhile,Indels can be used as genetic markers in natural populations, especially in phylogenetic studies. It has been shown that genomic regions with multiple Indels can also be used for species-identification procedures.
A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next. Following the completion of the Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes.
A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration.
SNP_Class | NON_SYNONYMOUS_CODING | START_GAINED | START_LOST | STOP_GAINED | STOP_LOST | SYNONYMOUS_CODING | |
PA | 41046 | 0 | 88 | 483 | 55 | 45407 | |
SNP_Pos | DOWNSTREAM | INTRON | SPLICE_SITE_ACCEPTOR | UPSTREAM | UTR_3_PRIME | UTR_5_PRIME | |
PA | 1873859 | 11932927 | 474 | 1912141 | 87345 | 31299 | |
INDEL_Class | CODON_CHANGE_PLUS_CODON_DELETION | CODON_CHANGE_PLUS_CODON_INSERTION | CODON_DELETION | CODON_INSERTION | FRAME_SHIFT | FRAME_SHIFT+STOP_GAINED | FRAME_SHIFT+START_LOST |
PA | 583 | 280 | 204 | 308 | 1363 | 46 | 7 |
INDEL_Pos | DOWNSTREAM | INTRON | SPLICE_SITE_ACCEPTOR | SPLICE_SITE_DONOR | UPSTREAM | UTR_3_PRIME | UTR_5_PRIME |
PA | 462912 | 2869068 | 431 | 186 | 478933 | 21352 | 4017 |
document location:summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx
High-Impact Effects | Moderate-Impact Effects | Low-Impact Effects |
---|---|---|
SPLICE_SITE_ACCEPTOR | NON_SYNONYMOUS_CODING | SYNONYMOUS_START |
SPLICE_SITE_DONOR | CODON_CHANGE | NON_SYNONYMOUS_START |
START_LOST | CODON_INSERTION | START_GAINED |
EXON_DELETED | CODON_CHANGE_PLUS_CODON_INSERTION | SYNONYMOUS_CODING |
FRAME_SHIFT | CODON_DELETION | SYNONYMOUS_STOP |
STOP_GAINED | CODON_CHANGE_PLUS_CODON_DELETION | NON_SYNONYMOUS_STOP |
STOP_LOST | UTR_5_DELETED | INTRON |
FRAME_SHIFT+START_LOST | UTR_3_DELETED | UPSTREAM |
FRAME_SHIFT+STOP_GAINED | DOWNSTREAM | |
UTR_3_PRIME | ||
UTR_5_PRIME |
document location:summary/3_VariantData/PA/PA_SNV.png
Sample | all | genotype.Het | genotype.Hom | novel | in dbSNP | novel_proportion | dbSNP_proportion | Ts | Tv | novel.Ts | novel.Tv |
PA | 5960805 | 4353332 | 1607473 | 1182640 | 4778165 | 0.20 | 0.80 | 3970908 | 1996891 | 768994 | 414792 |
document location:summary/3_VariantData/VariantsType_SNP.xlsx
document location:ssrc/summary/3_VariantData/PA/PA_SNP_VariantsType.png
All SNPs were annotated by VEP in VCF format: summary/3_VariantData/PA/PA_SNP.annotation.fixed.function.vcf
Table Description:
Term | Description |
---|---|
CHROM | chromosome id | POS | chromosome position |
ID | dbSNP ID |
REF | reference allele |
ALT | alternative allele |
QUAL | quality |
FILTER | filter |
INFO | information |
AD | Allelic depths |
DP | Approximate read depth |
GQ | Genotype Quality |
GT | genotype |
PL | Phred-scaled likelihoods |
SNP_Class | NON_SYNONYMOUS_CODING | START_GAINED | START_LOST | STOP_GAINED | STOP_LOST | SYNONYMOUS_CODING | |
PA | 41046 | 0 | 88 | 483 | 55 | 45407 | |
SNP_Pos | DOWNSTREAM | INTRON | SPLICE_SITE_ACCEPTOR | UPSTREAM | UTR_3_PRIME | UTR_5_PRIME | |
PA | 1873859 | 11932927 | 474 | 1912141 | 87345 | 31299 | |
INDEL_Class | CODON_CHANGE_PLUS_CODON_DELETION | CODON_CHANGE_PLUS_CODON_INSERTION | CODON_DELETION | CODON_INSERTION | FRAME_SHIFT | FRAME_SHIFT+STOP_GAINED | FRAME_SHIFT+START_LOST |
PA | 583 | 280 | 204 | 308 | 1363 | 46 | 7 |
INDEL_Pos | DOWNSTREAM | INTRON | SPLICE_SITE_ACCEPTOR | SPLICE_SITE_DONOR | UPSTREAM | UTR_3_PRIME | UTR_5_PRIME |
PA | 462912 | 2869068 | 431 | 186 | 478933 | 21352 | 4017 |
document location:summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx
document location:src/summary/3_VariantData/PA/PA_INDEL.png
Sample | all | genotype.Het | genotype.Hom | novel | in dbSNP | novel_proportion | dbSNP_proportion |
PA | 1181622 | 877853 | 303769 | 330523 | 851099 | 0.28 | 0.72 |
document location:src/summary/3_VariantData/VariantsType_INDEL.xlsx
document location:ssrc/summary/3_VariantData/PA/PA_INDEL_VariantsType.png
All InDels were annotated by VEP in VCF format: summary/3_VariantData/PA/PA_INDEL.annotation.fixed.function.vcf
Term | Description |
---|---|
CHROM | chromosome id |
POS | chromosome position |
ID | dbSNP ID |
REF | reference allele |
ALT | alternative allele |
QUAL | quality |
FILTER | filter |
INFO | information |
AD | Allelic depths |
DP | Approximate read depth |
GQ | Genotype Quality |
GT | Genotype |
PL | Phred-scaled likelihoods |
CNVs(copy number variations) on the genome were caculated by Control-Freec.
1 | 562000 | 572999 | 79 | gain |
1 | 664000 | 850999 | 5 | gain |
1 | 2581000 | 2631999 | 12 | gain |
1 | 5725000 | 5737999 | 12 | gain |
1 | 12900000 | 12951999 | 3 | gain |
1 | 12950000 | 13166999 | 1 | loss |
1 | 13165000 | 13220999 | 3 | gain |
1 | 13219000 | 13782999 | 1 | loss |
1 | 16832000 | 16844999 | 4 | gain |
1 | 16884000 | 16973999 | 7 | gain |
1 | 16972000 | 17013999 | 3 | gain |
1 | 17012000 | 17059999 | 6 | gain |
1 | 17058000 | 17089999 | 4 | gain |
1 | 17190000 | 17276999 | 4 | gain |
1 | 72765000 | 72812999 | 0 | loss |
1 | 85975000 | 86006999 | 4 | gain |
document location:summary/4_CNV/PA_proband/PA_proband_CNVs.xlsx
document location:summary/4_CNV/PA_proband/PA_proband.makeGraph_Chromosome_*.png
SVs(structure variations) on the genome were caculated by Lumpy in VCF format.
We divided SVs into 4 types:duplication(DUP),insertion(INS),deletion(DEL)and Inversion(INV).
Term Description:
Term | Description |
---|---|
CHROM | chromosome id |
POS ID | chromosome position ID |
REF | reference allele |
ALT | alternative allele |
QUAL | quality |
FILTER | filter |
INFO | information |
FORMAT | format |
document location:summary/5_SV/PA_proband/PA_proband.sv.vcf
Sample | DUP | DEL | INV | BND |
PA_proband | 264 | 2425 | 3 | 6152 |
Term Description:
Term | Description |
---|---|
Sample | Sample ID |
DEL | deletion |
DUP | duplication |
INV | Inversion |
BND | translocation |
document location:summary/5_SV/PA_proband/PA_proband.SV_stat.xlsx
Clinvar (https://www.ncbi.nlm.nih.gov/clinvar/) is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation. ClinVar processes submissions reporting variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences, and reported according to the HGVS standard. ClinVar then presents the data for interactive users as well as those wishing to use ClinVar in daily workflows and other local applications. ClinVar works in collaboration with interested organizations to meet the needs of the medical genetics community as efficiently and effectively as possible.
document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.ClinVar.xlsx
dbNSFP (http://varianttools.sourceforge.net/Annotation/DbNSFP) is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. This database significantly facilitates the process of querying predictions and annotations from different databases/web-servers for large amounts of nsSNVs discovered in exome-sequencing studies.
document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.dbNSFP.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.dbNSFP.xlsx
DisGeNET (http://www.disgenet.org/home/) is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation.
For now, DisGeNET has been updated to v6.0 as discribed: "The current version contains 628,685 gene-disease associations (GDAs), between 17,549 genes and 24,166 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 210,498 variant-disease associations (VDAs), between 117,337 variants and 10,358 diseases, traits, and phenotypes".
document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.DisGeNET.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.DisGeNET.xlsx
OMIM(http://www.omim.org/) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 15,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.
document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.OMIM.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.OMIM.xlsx
PheGenI(https://www.ncbi.nlm.nih.gov/gap/phegeni) is a tool that integrates the search and retrieval of associated genotype-phenotype data from National Human Genome Research Institute (NHGRI) Genome-wide Association Study (GWAS) Catalog integrated with data housed in Gene, dbGaP, OMIM, GTEx and dbSNP. It provides search by genotype and phenotype.
document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.PheGenI.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.PheGenI.xlsx
document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.VEP.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.VEP.xlsx
The original fluorescence images obtained from high throughput sequencing platforms are transformed to short reads by base calling. These short reads (Raw data) are recorded in FASTQ format, which contains sequence information (reads) and corresponding sequencing quality information.
Sequencing error rate and base quality can be affected by various factors such as sequencing platform, chemical reagent and sample quality. The first several bases shows higher error rates which is caused by a less sensitive fluorescence image signal at the beginning of sequencing. Error rate also shows an increasing trend with read extension, due to the consumption of chemical reagents. These two features are common for Illumina sequencing platforms
GC content distribution evaluation aims to check the potential AT-GC separation phenomenon, which may result from sample contamination, sequencing bias or errors in library preparation.In theory, GC or AT content should be constant across read positions. But it is common to see that the first 6 to 7 bases in both reads (read1 & read2) fluctuate in GC content, due to primer amplification bias and some other reasons.
document location:src/summary/8_Quality_Control/1_fastp
1. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 2009, 25(14): 1754-1760
2. Kent W J, Sugnet C W, Furey T S, et al. The human genome browser at UCSC. Genome research, 2002, 12(6): 996-1006.
3. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009, 25(16): 2078-2079.
4. Sherry S T, Ward M H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research, 2001, 29(1): 308-311.
5. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research, 2010, 38(16): e164-e164.
6. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 2012, 491(7422): 56-65.
2575 West Bellfort Street
Suite 270
Houston, TX
77054 USA
Local (713) 664-7087
Toll Free: 1-888-528-8818
Fax: (713) 664-8181