LC Sciences Human WGS Report

Customer Name	DEMO
Customer Institutaion	DEMO
Customer Email	demo
Project ID	DEMO

Seller Name	DEMO
Seller Email	demo@lcsciences.com

Your reliable partner in genomics, transcriptomics and proteomics

LC Sciences

1.Introduction

Whole genome re-sequencing (WGS) reveals the complete DNA make-up of an individual with complete reference genome, enabling us to better understand variations both within and between individuals or populations.By Aligned to reference genome, we could find plenty of variants like SNP,InDel,SV and CNV in an individual's sequencing data,which could be applied to clinical research, population genetics, association studies, evolutionary biology and many other related fields.

LC Sciences

2.Project Information

2.1. Sample information

Species name：Human

Latin name：Homo sapiens

2.2. Disease information

Disease：DEMO

2.3. Database

Database	link	web_link
Genome	hg19	ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

2.4. Bioinformatics software

Analysis item	Software	Version/date
Quality control	fastp	0.19.4
Alignment	BWA	0.7.12
	SAMtools	1.9
	Picard	2.17.3
SNP/INDEL calling	GATK	4.0.4.0
SNP/INDEL annotation	VEP

LC Sciences

3.Materials and Methods

3.1. DNA extraction and whole genome sequencing

Total DNA was isolated from whole blood collected in tubes with EDTA or tissue samples by using a standard DNA extraction protocol. The quantity of DNA was measured by reading A260/280 ratios by spectrophotometer. When A260/280 ratios located range 1.8 to 2.0, DNA was available.

The genomic DNA (gDNA, 1ug) was sheared using Covaris Series sonicator to achieve target peak of 200 to 300bp. Size selection after fragmentation was performed using Ampure XP beads,and blunt-end DNA fragments were generated by a combination of fill-in reaction and exonuclease activity. An A-base is then added to the blunt ends of each strand, preparing them for ligation to the T-tail sequencing adapters. 5 cycles of PCR were performed for amplification of the ligation products to generate gDNA library . At last, we performed the 150bp paired-end sequencing on an Illumina NovaSeq 6000 following the vendor's recommended protocol.

Workflow of WGS

3.2. Bioinformatics analysis

We sequenced generating a total of DEMO paired-end reads of 150bp length. This yielded DEMO G of sequence.Prior to alignment, the low quality reads (1, reads containing sequencing adaptors; 2, nucleotide with q quality score lower than 20) were removed. After that, a total of DEMO G bp of cleaned, paired-end reads were produced.

For the alignment step, BWA is utilized to perform reference genome alignment with the reads contained in paired FASTQ files. And as first post-alignment processing step, Picard tools is utilized to identified and mark duplicate reads from BAM file.

In the second post-alignment processing step, local read realignment is performed to correct for potential alignment errors around indels. Mapping of reads around the edges of indels often results in misaligned bases creating false positive SNP calls. Local realignment uses these mismatching bases to determine if a site should be realigned, and applies a computationally intensive algorithm to determine the most consistent placement of the reads with respect to the indel and remove misalignment artifacts.

Each base of each read has an associated quality score, corresponding to the probability of a sequencing error. Due to the Systematic biases, the reported quality scores are known to be inaccurate and as such must be recalibrated prior to genotyping. After recalibration, the recalibrated quality score in the output BAM will more closely correspond to the probability of a sequencing error.

Variant calls can be generated with GATK HaplotypeCaller or UnifiedGenotyper, which Examine the evidence for variation from reference via Bayesian inference.

A Gaussian mixture model is fit to assigning accurate confidence score to each putative mutation call and evaluating new potential variants.

Biological functional annotation is a crucial step in finding the links between genetic variation and disease. SnpEff is utilized to add biological information to a set of variants.

Bioinformatics pipeline for WGS

LC Sciences

4. Overview

4.1. Sample information

Family	Sample	Patient/Normal	gender
PA	PA_proband	noset	noset
PA	PA_sister	noset	noset
PA	PA_brother	noset	noset

document location:summary/1_RawData/sample_info_mendelian.xlsx

4.2. Statistics of Sequencing Quality

Sample	Raw Data		Valid Data		Valid%	Q20%	Q30%	GC%
	Read(M)	Base(G)	Read(M)	Base(G)
PA_brother	816.06	122.41	796.07	113.87	97.55	96.21	90.07	40.68
PA_proband	821.90	123.29	795.56	114.01	96.79	96.18	90.02	40.87
PA_sister	773.91	116.09	750.52	107.60	96.98	95.83	89.35	41.26

document location:summary/1_RawData/ReadsQC.xlsx

4.3. Sequencing depth

Fig:Reads Depth Coverage

document location:summary/2_MappedData/ReadsDepthCoverage.png

Depth of coverage on each chromosome

Depth of coverage=covered total length/total length of each chromosome

document location:summary/2_MappedData/DepthCoverageByChr.png

4.4. Statistics of mapped reads

Term	PA_brother	PA_proband	PA_sister
Total Reads	816055810	821903186	773906286
Valid Reads	796067230	795555384	750518372
Mapped Reads	794845447	794233789	749275898
Duplicate Reads	149115353	167447724	137788372
Valid Rate %	97.55	96.79	96.98
Mapped Rate %	99.85	99.83	99.83
Duplicate Rate %	18.76	21.08	18.39
MEAN_READ_LENGTH	143.04	143.31	143.37
MEAN_DEPTH	20.18	20.11	18.50
PCT_TARGET_BASES_1X	98.29	97.80	97.79
PCT_TARGET_BASES_10X	93.05	94.05	91.24
PCT_TARGET_BASES_20X	53.69	52.37	40.00
PCT_TARGET_BASES_30X	8.81	8.61	7.45

document location:summary/2_MappedData/MappedStatistics.xlsx

Table Description：

Term	Description
Total Reads	Number of total reads
Valid Reads	Number of Valid reads
Mapped Reads	Number of mapped reads on reference genome
Duplicate Reads	Number of duplicate reads
Valid Rate %	ratio:Valid reads/Total Reads
Mapped Rate %	ratio:Mapped Reads/Valid Reads
Duplicate Rate %	ratio:Duplicate Reads/Mapped Reads
MEAN_READ_LENGTH	Average length of mapped reads
MEAN_DEPTH	Mean dapth
PCT_TARGET_BASES_1X	percentage of base>=1x
PCT_TARGET_BASES_2X	percentage of base>=2x
PCT_TARGET_BASES_10X	percentage of base>=10x
PCT_TARGET_BASES_20X	percentage of base>=20x
PCT_TARGET_BASES_30X	percentage of base>=30x

Fig:　Depth of coverage on each sample

LC Sciences

5. Variant calling

A single-nucleotide polymorphism, often abbreviated to SNP, is a variation in a single nucleotide that occurs at a specific position in the genome including transition and transversion, where each variation is present to some appreciable degree within a population (e.g. > 1%).

Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that may be separated by many years, and may not be related to each other in any way. A microindel is defined as an Indel that results in a net change of 1 to 50 nucleotides.

Meanwhile,Indels can be used as genetic markers in natural populations, especially in phylogenetic studies. It has been shown that genomic regions with multiple Indels can also be used for species-identification procedures.

A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next. Following the completion of the Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes.

A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. A somatic single nucleotide variation (e.g., caused by cancer) may also be called a single-nucleotide alteration.

5.1. SNP Statistics

SNP_Class	NON_SYNONYMOUS_CODING	START_GAINED	START_LOST	STOP_GAINED	STOP_LOST	SYNONYMOUS_CODING
PA	41046	0	88	483	55	45407

SNP_Pos	DOWNSTREAM	INTRON	SPLICE_SITE_ACCEPTOR	UPSTREAM	UTR_3_PRIME	UTR_5_PRIME
PA	1873859	11932927	474	1912141	87345	31299

INDEL_Class	CODON_CHANGE_PLUS_CODON_DELETION	CODON_CHANGE_PLUS_CODON_INSERTION	CODON_DELETION	CODON_INSERTION	FRAME_SHIFT	FRAME_SHIFT+STOP_GAINED	FRAME_SHIFT+START_LOST
PA	583	280	204	308	1363	46	7

INDEL_Pos	DOWNSTREAM	INTRON	SPLICE_SITE_ACCEPTOR	SPLICE_SITE_DONOR	UPSTREAM	UTR_3_PRIME	UTR_5_PRIME
PA	462912	2869068	431	186	478933	21352	4017

document location:summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx

High-Impact Effects	Moderate-Impact Effects	Low-Impact Effects
SPLICE_SITE_ACCEPTOR	NON_SYNONYMOUS_CODING	SYNONYMOUS_START
SPLICE_SITE_DONOR	CODON_CHANGE	NON_SYNONYMOUS_START
START_LOST	CODON_INSERTION	START_GAINED
EXON_DELETED	CODON_CHANGE_PLUS_CODON_INSERTION	SYNONYMOUS_CODING
FRAME_SHIFT	CODON_DELETION	SYNONYMOUS_STOP
STOP_GAINED	CODON_CHANGE_PLUS_CODON_DELETION	NON_SYNONYMOUS_STOP
STOP_LOST	UTR_5_DELETED	INTRON
FRAME_SHIFT+START_LOST	UTR_3_DELETED	UPSTREAM
FRAME_SHIFT+STOP_GAINED		DOWNSTREAM
		UTR_3_PRIME
		UTR_5_PRIME

document location:summary/3_VariantData/PA/PA_SNV.png

Sample	all	genotype.Het	genotype.Hom	novel	in dbSNP	novel_proportion	dbSNP_proportion	Ts	Tv	novel.Ts	novel.Tv
PA	5960805	4353332	1607473	1182640	4778165	0.20	0.80	3970908	1996891	768994	414792

document location:summary/3_VariantData/VariantsType_SNP.xlsx

document location:ssrc/summary/3_VariantData/PA/PA_SNP_VariantsType.png

5.2. VEP Annotation for SNP

All SNPs were annotated by VEP in VCF format: summary/3_VariantData/PA/PA_SNP.annotation.fixed.function.vcf

Table Description：

Term	Description
CHROM	chromosome id
POS	chromosome position
ID	dbSNP ID
REF	reference allele
ALT	alternative allele
QUAL	quality
FILTER	filter
INFO	information
AD	Allelic depths
DP	Approximate read depth
GQ	Genotype Quality
GT	genotype
PL	Phred-scaled likelihoods

5.3. InDel Statistics

SNP_Class	NON_SYNONYMOUS_CODING	START_GAINED	START_LOST	STOP_GAINED	STOP_LOST	SYNONYMOUS_CODING
PA	41046	0	88	483	55	45407

SNP_Pos	DOWNSTREAM	INTRON	SPLICE_SITE_ACCEPTOR	UPSTREAM	UTR_3_PRIME	UTR_5_PRIME
PA	1873859	11932927	474	1912141	87345	31299

INDEL_Class	CODON_CHANGE_PLUS_CODON_DELETION	CODON_CHANGE_PLUS_CODON_INSERTION	CODON_DELETION	CODON_INSERTION	FRAME_SHIFT	FRAME_SHIFT+STOP_GAINED	FRAME_SHIFT+START_LOST
PA	583	280	204	308	1363	46	7

INDEL_Pos	DOWNSTREAM	INTRON	SPLICE_SITE_ACCEPTOR	SPLICE_SITE_DONOR	UPSTREAM	UTR_3_PRIME	UTR_5_PRIME
PA	462912	2869068	431	186	478933	21352	4017

document location:summary/3_VariantData/SNP_INDEL_PositionType_VariantsType.xlsx

document location：src/summary/3_VariantData/PA/PA_INDEL.png

Sample	all	genotype.Het	genotype.Hom	novel	in dbSNP	novel_proportion	dbSNP_proportion
PA	1181622	877853	303769	330523	851099	0.28	0.72

document location：src/summary/3_VariantData/VariantsType_INDEL.xlsx

document location：ssrc/summary/3_VariantData/PA/PA_INDEL_VariantsType.png

5.4. VEP Annotation for InDel

All InDels were annotated by VEP in VCF format: summary/3_VariantData/PA/PA_INDEL.annotation.fixed.function.vcf

Term	Description
CHROM	chromosome id
POS	chromosome position
ID	dbSNP ID
REF	reference allele
ALT	alternative allele
QUAL	quality
FILTER	filter
INFO	information
AD	Allelic depths
DP	Approximate read depth
GQ	Genotype Quality
GT	Genotype
PL	Phred-scaled likelihoods

5.5. CNV Statistics

CNVs(copy number variations) on the genome were caculated by Control-Freec.

1	562000	572999	79	gain
1	664000	850999	5	gain
1	2581000	2631999	12	gain
1	5725000	5737999	12	gain
1	12900000	12951999	3	gain
1	12950000	13166999	1	loss
1	13165000	13220999	3	gain
1	13219000	13782999	1	loss
1	16832000	16844999	4	gain
1	16884000	16973999	7	gain
1	16972000	17013999	3	gain
1	17012000	17059999	6	gain
1	17058000	17089999	4	gain
1	17190000	17276999	4	gain
1	72765000	72812999	0	loss
1	85975000	86006999	4	gain

document location：summary/4_CNV/PA_proband/PA_proband_CNVs.xlsx

document location：summary/4_CNV/PA_proband/PA_proband.makeGraph_Chromosome_*.png

5.6. SV Statistics

SVs(structure variations) on the genome were caculated by Lumpy in VCF format.

We divided SVs into 4 types:duplication(DUP),insertion(INS),deletion(DEL)and Inversion(INV).

Term Description:

Term	Description
CHROM	chromosome id
POS ID	chromosome position ID
REF	reference allele
ALT	alternative allele
QUAL	quality
FILTER	filter
INFO	information
FORMAT	format

document location：summary/5_SV/PA_proband/PA_proband.sv.vcf

Sample	DUP	DEL	INV	BND
PA_proband	264	2425	3	6152

Term Description:

Term	Description
Sample	Sample ID
DEL	deletion
DUP	duplication
INV	Inversion
BND	translocation

document location：summary/5_SV/PA_proband/PA_proband.SV_stat.xlsx

LC Science

6. Variant Annotation

6.1. ClinVar

Clinvar (https://www.ncbi.nlm.nih.gov/clinvar/) is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation. ClinVar processes submissions reporting variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data. The alleles described in submissions are mapped to reference sequences, and reported according to the HGVS standard. ClinVar then presents the data for interactive users as well as those wishing to use ClinVar in daily workflows and other local applications. ClinVar works in collaboration with interested organizations to meet the needs of the medical genetics community as efficiently and effectively as possible.

document location：summary/6_VariantMultiAnno/*/SNP/*_SNP.ClinVar.xlsx

6.2. dbNSFP

dbNSFP (http://varianttools.sourceforge.net/Annotation/DbNSFP) is a database developed for functional prediction and annotation of all potential non-synonymous single-nucleotide variants (nsSNVs) in the human genome. This database significantly facilitates the process of querying predictions and annotations from different databases/web-servers for large amounts of nsSNVs discovered in exome-sequencing studies.

document location：summary/6_VariantMultiAnno/*/SNP/*_SNP.dbNSFP.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.dbNSFP.xlsx

6.3. DisGeNET

DisGeNET (http://www.disgenet.org/home/) is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation.

For now, DisGeNET has been updated to v6.0 as discribed: "The current version contains 628,685 gene-disease associations (GDAs), between 17,549 genes and 24,166 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 210,498 variant-disease associations (VDAs), between 117,337 variants and 10,358 diseases, traits, and phenotypes".

document location：summary/6_VariantMultiAnno/*/SNP/*_SNP.DisGeNET.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.DisGeNET.xlsx

6.4. OMIM

OMIM(http://www.omim.org/) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 15,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.

document location：summary/6_VariantMultiAnno/*/SNP/*_SNP.OMIM.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.OMIM.xlsx

6.5. PheGenI

PheGenI(https://www.ncbi.nlm.nih.gov/gap/phegeni) is a tool that integrates the search and retrieval of associated genotype-phenotype data from National Human Genome Research Institute (NHGRI) Genome-wide Association Study (GWAS) Catalog integrated with data housed in Gene, dbGaP, OMIM, GTEx and dbSNP. It provides search by genotype and phenotype.

document location：summary/6_VariantMultiAnno/*/SNP/*_SNP.PheGenI.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.PheGenI.xlsx

6.6. VEP

document location：summary/6_VariantMultiAnno/*/SNP/*_SNP.VEP.xlsx
document location: summary/6_VariantMultiAnno/*/INDEL/*_INDEL.VEP.xlsx

LC Science

7. Quality control

The original fluorescence images obtained from high throughput sequencing platforms are transformed to short reads by base calling. These short reads (Raw data) are recorded in FASTQ format, which contains sequence information (reads) and corresponding sequencing quality information.

Sequencing error rate and base quality can be affected by various factors such as sequencing platform, chemical reagent and sample quality. The first several bases shows higher error rates which is caused by a less sensitive fluorescence image signal at the beginning of sequencing. Error rate also shows an increasing trend with read extension, due to the consumption of chemical reagents. These two features are common for Illumina sequencing platforms

GC content distribution evaluation aims to check the potential AT-GC separation phenomenon, which may result from sample contamination, sequencing bias or errors in library preparation.In theory, GC or AT content should be constant across read positions. But it is common to see that the first 6 to 7 bases in both reads (read1 & read2) fluctuate in GC content, due to primer amplification bias and some other reasons.

document location：src/summary/8_Quality_Control/1_fastp

LC Science

8. Reference

1. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 2009, 25(14): 1754-1760

2. Kent W J, Sugnet C W, Furey T S, et al. The human genome browser at UCSC. Genome research, 2002, 12(6): 996-1006.

3. Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009, 25(16): 2078-2079.

4. Sherry S T, Ward M H, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research, 2001, 29(1): 308-311.

5. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research, 2010, 38(16): e164-e164.

6. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 2012, 491(7422): 56-65.

LC Science

9. Contact us

2575 West Bellfort Street

Suite 270

Houston, TX

77054 USA

Local (713) 664-7087

Toll Free: 1-888-528-8818

Fax: (713) 664-8181

http://www.lcsciences.com/