LC Sciences Human WGS Report



Customer NameDEMO
Customer InstitutaionDEMO
Customer Emaildemo
Project IDDEMO
Seller NameDEMO
Seller Emaildemo@lcsciences.com

Your reliable partner in genomics, transcriptomics and proteomics



1.Introduction

Whole genome re-sequencing (WGS) reveals the complete DNA make-up of an individual with complete reference genome, enabling us to better understand variations both within and between individuals or populations.By Aligned to reference genome, we could find plenty of variants like SNP,InDel,SV and CNV in an individual's sequencing data,which could be applied to clinical research, population genetics, association studies, evolutionary biology and many other related fields.






2.Project Information

2.1. Sample information


Species name:Human

Latin name:Homo sapiens

2.2. Disease information


Disease:DEMO

2.3. Database


Database     link   web_link
Genome     hg19   ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

2.4. Bioinformatics software


Analysis item Software Version/date
Quality control fastp 0.19.4
Alignment BWA 0.7.12
SAMtools 1.9
Picard 2.17.3
SNP/INDEL calling GATK 4.0.4.0
SNP/INDEL annotation VEP




3.Materials and Methods

3.1. DNA extraction and whole genome sequencing


Total DNA was isolated from whole blood collected in tubes with EDTA or tissue samples by using a standard DNA extraction protocol. The quantity of DNA was measured by reading A260/280 ratios by spectrophotometer. When A260/280 ratios located range 1.8 to 2.0, DNA was available.


The genomic DNA (gDNA, 1ug) was sheared using Covaris Series sonicator to achieve target peak of 200 to 300bp. Size selection after fragmentation was performed using Ampure XP beads,and blunt-end DNA fragments were generated by a combination of fill-in reaction and exonuclease activity. An A-base is then added to the blunt ends of each strand, preparing them for ligation to the T-tail sequencing adapters. 5 cycles of PCR were performed for amplification of the ligation products to generate gDNA library . At last, we performed the 150bp paired-end sequencing on an Illumina NovaSeq 6000 following the vendor's recommended protocol.



Workflow of WGS


3.2. Bioinformatics analysis


We sequenced generating a total of DEMO paired-end reads of 150bp length. This yielded DEMO G of sequence.Prior to alignment, the low quality reads (1, reads containing sequencing adaptors; 2, nucleotide with q quality score lower than 20) were removed. After that, a total of DEMO G bp of cleaned, paired-end reads were produced.


For the alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired FASTQ files. And as first post-alignment processing step, Picard tools is utilized to identified and mark duplicate reads from BAM file.

In the second post-alignment processing step, local read realignment is performed to correct for potential alignment errors around indels. Mapping of reads around the edges of indels often results in misaligned bases creating false positive SNP calls. Local realignment uses these mismatching bases to determine if a site should be realigned, and applies a computationally intensive algorithm to determine the most consistent placement of the reads with respect to the indel and remove misalignment artifacts.


Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. Due to the Systematic biases, the reported quality scores are known to be inaccurate and as such must be recalibrated prior to genotyping. After recalibration, the recalibrated quality score in the output BAM will more closely correspond to the probability of a sequencing error.



Variant calls can be generated with GATK HaplotypeCaller or UnifiedGenotyper, which Examine the evidence for variation from reference via Bayesian inference.



A Gaussian mixture model is fit to assigning accurate confidence score to each putative mutation call and evaluating new potential variants.



Biological functional annotation is a crucial step in finding the links between genetic variation and disease. SnpEff is utilized to add biological information to a set of variants.




Bioinformatics pipeline for WGS





4. Overview

4.1. Sample information



FamilySamplePatient/Normalgender
PAPA_probandnosetnoset
PAPA_sisternosetnoset
PAPA_brothernosetnoset

document location:summary/1_RawData/sample_info_mendelian.xlsx


4.2. Statistics of Sequencing Quality


SampleRaw DataValid DataValid%Q20%Q30%GC%
Read(M)Base(G)Read(M)Base(G)
PA_brother816.06122.41796.07113.8797.5596.2190.0740.68
PA_proband821.90123.29795.56114.0196.7996.1890.0240.87
PA_sister773.91116.09750.52107.6096.9895.8389.3541.26

document location:summary/1_RawData/ReadsQC.xlsx


4.3. Sequencing depth



Fig:Reads Depth Coverage



document location:summary/2_MappedData/ReadsDepthCoverage.png




Depth of coverage on each chromosome


Depth of coverage=covered total length/total length of each chromosome



document location:summary/2_MappedData/DepthCoverageByChr.png



4.4. Statistics of mapped reads

TermPA_brotherPA_probandPA_sister
Total Reads816055810821903186773906286
Valid Reads796067230795555384750518372
Mapped Reads794845447794233789749275898
Duplicate Reads149115353167447724137788372
Valid Rate %97.5596.7996.98
Mapped Rate %99.8599.8399.83
Duplicate Rate %18.7621.0818.39
MEAN_READ_LENGTH143.04143.31143.37
MEAN_DEPTH20.1820.1118.50
PCT_TARGET_BASES_1X98.2997.8097.79
PCT_TARGET_BASES_10X93.0594.0591.24
PCT_TARGET_BASES_20X53.6952.3740.00
PCT_TARGET_BASES_30X8.818.617.45



document location:summary/2_MappedData/MappedStatistics.xlsx


Table Description:


Term Description
Total ReadsNumber of total reads
Valid ReadsNumber of Valid reads
Mapped ReadsNumber of mapped reads on reference genome
Duplicate ReadsNumber of duplicate reads
Valid Rate %ratio:Valid reads/Total Reads
Mapped Rate %ratio:Mapped Reads/Valid Reads
Duplicate Rate %ratio:Duplicate Reads/Mapped Reads
MEAN_READ_LENGTHAverage length of mapped reads
MEAN_DEPTHMean dapth
PCT_TARGET_BASES_1Xpercentage of base>=1x
PCT_TARGET_BASES_2Xpercentage of base>=2x
PCT_TARGET_BASES_10Xpercentage of base>=10x
PCT_TARGET_BASES_20Xpercentage of base>=20x
PCT_TARGET_BASES_30Xpercentage of base>=30x



Fig: Depth of coverage on each sample




5. Variant calling

A single-nucleotide polymorphism, often abbreviated to SNP, is a variation in a single nucleotide that occurs at a specific position in the genome including transition and transversion, where each variation is present to some appreciable degree within a population (e.g. > 1%).


Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that may be separated by many years, and may not be related to each other in any way. A microindel is defined as an Indel that results in a net change of 1 to 50 nucleotides.


Meanwhile,Indels can be used as genetic markers in natural populations, especially in phylogenetic studies. It has been shown that genomic regions with multiple Indels can also be used for species-identification procedures.


A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next. Following the completion of the Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes.


A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration.




5.1. SNP Statistics


SNP_ClassNON_SYNONYMOUS_CODINGSTART_GAINEDSTART_LOSTSTOP_GAINEDSTOP_LOSTSYNONYMOUS_CODING
PA410460884835545407
SNP_PosDOWNSTREAMINTRONSPLICE_SITE_ACCEPTORUPSTREAMUTR_3_PRIMEUTR_5_PRIME
PA18738591193292747419121418734531299
INDEL_ClassCODON_CHANGE_PLUS_CODON_DELETIONCODON_CHANGE_PLUS_CODON_INSERTIONCODON_DELETIONCODON_INSERTIONFRAME_SHIFTFRAME_SHIFT+STOP_GAINEDFRAME_SHIFT+START_LOST
PA5832802043081363467
INDEL_PosDOWNSTREAMINTRONSPLICE_SITE_ACCEPTORSPLICE_SITE_DONORUPSTREAMUTR_3_PRIMEUTR_5_PRIME
PA4629122869068431186478933213524017

document location:summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx


High-Impact Effects Moderate-Impact Effects Low-Impact Effects
SPLICE_SITE_ACCEPTOR NON_SYNONYMOUS_CODING SYNONYMOUS_START
SPLICE_SITE_DONOR CODON_CHANGE NON_SYNONYMOUS_START
START_LOST CODON_INSERTION START_GAINED
EXON_DELETED CODON_CHANGE_PLUS_CODON_INSERTION SYNONYMOUS_CODING
FRAME_SHIFT CODON_DELETION SYNONYMOUS_STOP
STOP_GAINED CODON_CHANGE_PLUS_CODON_DELETION NON_SYNONYMOUS_STOP
STOP_LOST UTR_5_DELETED INTRON
FRAME_SHIFT+START_LOST UTR_3_DELETED UPSTREAM
FRAME_SHIFT+STOP_GAINED DOWNSTREAM
UTR_3_PRIME
UTR_5_PRIME




document location:summary/3_VariantData/PA/PA_SNV.png



Sampleallgenotype.Hetgenotype.Homnovelin dbSNPnovel_proportiondbSNP_proportionTsTvnovel.Tsnovel.Tv
PA596080543533321607473118264047781650.200.8039709081996891768994414792

document location:summary/3_VariantData/VariantsType_SNP.xlsx




document location:ssrc/summary/3_VariantData/PA/PA_SNP_VariantsType.png



5.2. VEP Annotation for SNP


All SNPs were annotated by VEP in VCF format: summary/3_VariantData/PA/PA_SNP.annotation.fixed.function.vcf


Table Description:

Term Description
CHROMchromosome id
POSchromosome position
IDdbSNP ID
REFreference allele
ALTalternative allele
QUALquality
FILTERfilter
INFOinformation
ADAllelic depths
DPApproximate read depth
GQGenotype Quality
GTgenotype
PLPhred-scaled likelihoods


5.3. InDel Statistics


SNP_ClassNON_SYNONYMOUS_CODINGSTART_GAINEDSTART_LOSTSTOP_GAINEDSTOP_LOSTSYNONYMOUS_CODING
PA410460884835545407
SNP_PosDOWNSTREAMINTRONSPLICE_SITE_ACCEPTORUPSTREAMUTR_3_PRIMEUTR_5_PRIME
PA18738591193292747419121418734531299
INDEL_ClassCODON_CHANGE_PLUS_CODON_DELETIONCODON_CHANGE_PLUS_CODON_INSERTIONCODON_DELETIONCODON_INSERTIONFRAME_SHIFTFRAME_SHIFT+STOP_GAINEDFRAME_SHIFT+START_LOST
PA5832802043081363467
INDEL_PosDOWNSTREAMINTRONSPLICE_SITE_ACCEPTORSPLICE_SITE_DONORUPSTREAMUTR_3_PRIMEUTR_5_PRIME
PA4629122869068431186478933213524017

document location:summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx







document location:src/summary/3_VariantData/PA/PA_INDEL.png




Sampleallgenotype.Hetgenotype.Homnovelin dbSNPnovel_proportiondbSNP_proportion
PA11816228778533037693305238510990.280.72

document location:src/summary/3_VariantData/VariantsType_INDEL.xlsx







document location:ssrc/summary/3_VariantData/PA/PA_INDEL_VariantsType.png



5.4. VEP Annotation for InDel


All InDels were annotated by VEP in VCF format: summary/3_VariantData/PA/PA_INDEL.annotation.fixed.function.vcf


Term Description
CHROM chromosome id
POS chromosome position
ID dbSNP ID
REF reference allele
ALT alternative allele
QUAL quality
FILTER filter
INFO information
AD Allelic depths
DP Approximate read depth
GQ Genotype Quality
GT Genotype
PL Phred-scaled likelihoods



5.5. CNV Statistics

CNVs(copy number variations) on the genome were caculated by Control-Freec.



156200057299979gain
16640008509995gain
12581000263199912gain
15725000573799912gain
112900000129519993gain
112950000131669991loss
113165000132209993gain
113219000137829991loss
116832000168449994gain
116884000169739997gain
116972000170139993gain
117012000170599996gain
117058000170899994gain
117190000172769994gain
172765000728129990loss
185975000860069994gain



document location:summary/4_CNV/PA_proband/PA_proband_CNVs.xlsx








document location:summary/4_CNV/PA_proband/PA_proband.makeGraph_Chromosome_*.png


5.6. SV Statistics

SVs(structure variations) on the genome were caculated by Lumpy in VCF format.


We divided SVs into 4 types:duplication(DUP),insertion(INS),deletion(DEL)and Inversion(INV).



Term Description:


Term Description
CHROM chromosome id
POS ID chromosome position ID
REF reference allele
ALT alternative allele
QUAL quality
FILTER filter
INFO information
FORMAT format

document location:summary/5_SV/PA_proband/PA_proband.sv.vcf






SampleDUPDELINVBND
PA_proband264242536152

Term Description:


Term Description
Sample Sample ID
DEL deletion
DUP duplication
INV Inversion
BND translocation

document location:summary/5_SV/PA_proband/PA_proband.SV_stat.xlsx




6. Variant Annotation

6.1. ClinVar

Clinvar (https://www.ncbi.nlm.nih.gov/clinvar/) is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation. ClinVar processes submissions reporting variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences, and reported according to the HGVS standard. ClinVar then presents the data for interactive users as well as those wishing to use ClinVar in daily workflows and other local applications. ClinVar works in collaboration with interested organizations to meet the needs of the medical genetics community as efficiently and effectively as possible.


document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.ClinVar.xlsx

6.2. dbNSFP

dbNSFP (http://varianttools.sourceforge.net/Annotation/DbNSFP) is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. This database significantly facilitates the process of querying predictions and annotations from different databases/web-servers for large amounts of nsSNVs discovered in exome-sequencing studies.

document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.dbNSFP.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.dbNSFP.xlsx


6.3. DisGeNET

DisGeNET (http://www.disgenet.org/home/) is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation.

For now, DisGeNET has been updated to v6.0 as discribed: "The current version contains 628,685 gene-disease associations (GDAs), between 17,549 genes and 24,166 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 210,498 variant-disease associations (VDAs), between 117,337 variants and 10,358 diseases, traits, and phenotypes".

document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.DisGeNET.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.DisGeNET.xlsx


6.4. OMIM

OMIM(http://www.omim.org/) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 15,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.

document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.OMIM.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.OMIM.xlsx


6.5. PheGenI

PheGenI(https://www.ncbi.nlm.nih.gov/gap/phegeni) is a tool that integrates the search and retrieval of associated genotype-phenotype data from National Human Genome Research Institute (NHGRI) Genome-wide Association Study (GWAS) Catalog integrated with data housed in Gene, dbGaP, OMIM, GTEx and dbSNP. It provides search by genotype and phenotype.

document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.PheGenI.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.PheGenI.xlsx



6.6. VEP


document location:summary/6_VariantMultiAnno/*/SNP/*_SNP.VEP.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.VEP.xlsx







7. Quality control


The original fluorescence images obtained from high throughput sequencing platforms are transformed to short reads by base calling. These short reads (Raw data) are recorded in FASTQ format, which contains sequence information (reads) and corresponding sequencing quality information.






Sequencing error rate and base quality can be affected by various factors such as sequencing platform, chemical reagent and sample quality. The first several bases shows higher error rates which is caused by a less sensitive fluorescence image signal at the beginning of sequencing. Error rate also shows an increasing trend with read extension, due to the consumption of chemical reagents. These two features are common for Illumina sequencing platforms


GC content distribution evaluation aims to check the potential AT-GC separation phenomenon, which may result from sample contamination, sequencing bias or errors in library preparation.In theory, GC or AT content should be constant across read positions. But it is common to see that the first 6 to 7 bases in both reads (read1 & read2) fluctuate in GC content, due to primer amplification bias and some other reasons.






document location:src/summary/8_Quality_Control/1_fastp





8. Reference


1. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 2009, 25(14): 1754-1760

2. Kent W J, Sugnet C W, Furey T S, et al. The human genome browser at UCSC. Genome research, 2002, 12(6): 996-1006.

3. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009, 25(16): 2078-2079.

4. Sherry S T, Ward M H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research, 2001, 29(1): 308-311.

5. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research, 2010, 38(16): e164-e164.

6. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 2012, 491(7422): 56-65.





9. Contact us

2575 West Bellfort Street

Suite 270

Houston, TX

77054 USA


Local (713) 664-7087

Toll Free: 1-888-528-8818

Fax: (713) 664-8181

http://www.lcsciences.com/