The FLOSSIES sequencing project was inspired by the need for frequencies of genomic variants from appropriate controls for breast cancer genetics studies. Existing public databases are large but not stratified by gender or age. The FLOSSIES project includes allele frequencies of all classes of variants at known and candidate genes for breast cancer susceptibility among women who are cancer free and older than age 70. The participants are Fabulous Ladies Over Seventy. We hope this data will be useful as controls for breast cancer genetics studies by many groups.
Participants in this sequencing project are women who joined the Women’s Health Initiative (WHI) between 1993 and 2005. All participants were 50-79 years old and without any history of cancer at enrollment. More than 160,000 women enrolled in WHI, contributed DNA at enrollment, and have been followed ever since. From the WHI participants who are now older than age 70 years and have remained cancer-free, approximately 10,000 women were selected at random for this sequencing project: approximately 7,000 women who self-identified as European American and 3,000 women who identified as African American ancestry. Ancestries were estimated with more than 500 ancestry-informative markers.
Genes analyzed include established breast cancer genes, both high and moderate penetrance, and genes that have been suggested to predispose to breast cancer when mutant. Strength of evidence varies for these candidate genes. The genes are:
For each gene, rare and common variants in coding regions, 5’UTR, 3’UTR, donor splice sites with 6 flanking bp, and acceptor splice sites with 20 flanking bp are included. Point mutations, small indels, and CNVs are included (although CNV calls are still being updated).
Sequencing was carried out in the King Lab at the University of Washington, Seattle, and by Color Genomics on an Illumina HiSeq with 2x100 bp paired-end reads (King Lab) or NextSeq with 2x150 bp paired-end reads (Color) using modified versions of the BROCA panel1. Median coverage per sample was >250x.
Paired-end sequence reads were aligned to the human reference genome (hg19) using Burrows-Wheeler Aligner 0.7.9a. Removal of PCR duplicates, sorting, and indexing were carried out with SAMtools v0.1.19. Data was excluded for 3 samples with low coverage and 2 samples with low quality. For quality control, WHI included 105 duplicates and 20 triplicates, blinded to the sequencing teams. These duplicates and triplicates were all identified and removed.
Indel realignments and base quality score recalibration were based on with Genome Analysis Tool Kit (GATK v3.0) using recommended parameters. Variants were detected with GATK Unified Genotyper. Variants were included if variant fraction was at least 0.25. Copy number variants were identified using our read-depth-based in-house pipeline.
Each sample was assigned to the population with highest estimated likelihood of ancestry. Ancestry analysis yielded estimates of 7,325 women of European American ancestry and 2,559 women of African American ancestry.