#############################################################################
# README for GoNL re-analysis from scratch with GRCh38 as reference genome.
#############################################################################

This README describes the data available for the re-analysis of GoNL with
GRCh38 as reference genome. Versions:

 1.0: Initial release. Note these variant calls are  not nearly as polished
      as the the ones previously released based on analysis with GRCh37 as
      reference genome. Most importantly with regards to filtering and QC:
      VQSR was not yet applied to this new callset and missing genotype calls
      were not imputed.
      
#############################################################################
# Files
#############################################################################

 *  multisample.parents_only.info_only.vcf.gz
    Bgzip compressed VCF file containing INFO fields with the summary counts
    of variants seen in the unrelated individuals:
    only the parents = 250 father + 248 mothers.
    Note: just as in the original GoNL data set based on GRCh37 as reference
    2 mothers failed QC and these are exlcuded from the results.
 *  multisample.parents_only.info_only.vcf.gz.tbi
    The Tabix index for the corresponding bgzip compressed VCF.

#############################################################################
# Reference genome
#############################################################################

We have used the GRCh38_no_alt_plus_hs38d1_analysis_set supplemented with
PhiX174 as decoy sequence. The FastA file that was used is in ./reference
sub directory.

Reference sequences were downloaded from the NCBI FTP server:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/819/615/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz

#############################################################################
# Pipeline
#############################################################################

We have used a pipeline with the following steps:

For each library:
- cutadapt 1.13
- bwa mem 0.7.15-r1140
- Picard SortSam 2.9.0
- GATK Baserecalibrator 3.7

For each sample:
- sambambamerge 0.6.6
- sambambamarkdup 0.6.6
- GATK HaplotypeCaller 3.7

For each family:
- GATK CombineGvcfs 3.7

For all families at once:
- GATK CombineGvcfs 3.7
- GATK GenotypeGvcfs 3.7