############################################################################# # README for GoNL re-analysis from scratch with GRCh38 as reference genome. ############################################################################# This README describes the data available for the re-analysis of GoNL with GRCh38 as reference genome. Versions: 1.0: Initial release. Note these variant calls are not nearly as polished as the the ones previously released based on analysis with GRCh37 as reference genome. Most importantly with regards to filtering and QC: VQSR was not yet applied to this new callset and missing genotype calls were not imputed. ############################################################################# # Files ############################################################################# * multisample.parents_only.info_only.vcf.gz Bgzip compressed VCF file containing INFO fields with the summary counts of variants seen in the unrelated individuals: only the parents = 250 father + 248 mothers. Note: just as in the original GoNL data set based on GRCh37 as reference 2 mothers failed QC and these are exlcuded from the results. * multisample.parents_only.info_only.vcf.gz.tbi The Tabix index for the corresponding bgzip compressed VCF. ############################################################################# # Reference genome ############################################################################# We have used the GRCh38_no_alt_plus_hs38d1_analysis_set supplemented with PhiX174 as decoy sequence. The FastA file that was used is in ./reference sub directory. Reference sequences were downloaded from the NCBI FTP server: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz ############################################################################# # Pipeline ############################################################################# We have used a pipeline with the following steps: For each library: - cutadapt 1.13 - bwa mem 0.7.15-r1140 - Picard SortSam 2.9.0 - GATK Baserecalibrator 3.7 For each sample: - sambambamerge 0.6.6 - sambambamarkdup 0.6.6 - GATK HaplotypeCaller 3.7 For each family: - GATK CombineGvcfs 3.7 For all families at once: - GATK CombineGvcfs 3.7 - GATK GenotypeGvcfs 3.7