A while ago I purchased whole genome sequencing (WGS) with an average of 30x depth. Partially for personal medical reasons, but mostly out of curiosity. I’ve always been interested in genetics, but never really had a good reason to get too much into it. Getting a copy of my own genome gave me the necessary motivation, but also a massive amount of genetic data to work with. As someone fairly new to genetics, but familiar with computing and data analysis, I did find a lot of the software fairly difficult to get the hang of at first. I had no issues running the software, but not knowing the various terms and acronyms used within genetics was an initial barrier to entry. As most consumers who purchase WGS will stick with the analysis that their genetics laboratory provides, most online guides and tutorials assume a great degree of preliminary genetics and genetic analysis knowledge. This series is meant to provide an alternative to that, without being a straight up tutorial. I’ll mostly go over the software that I used, what it does, and some cool things you can do with your genetic data. While I had initially intended this to be a single post, it got way too long and honestly fits better as a series.

I’ll include links to the other posts here, once they’re written.

Who to go with for WGS?

Firstly, if you don’t currently haven’t had whole genome sequencing done, getting that data is the first step. Pre-warning though, this is a process that can take a very long time. Depending on where you live and what provider you go with, you could potentially be looking at 6+ months from the order date to get your results. Mine personally was a few days short of 7 months, with the first two months mostly in transit to and from Australia. I don’t want to recommend any particular provider, but I will go over some of the factors that I considered when choosing the one that I went with.

Regional Availability

This one is pretty obvious, but a large factor in my personal choice was whether they shipped to where I live (Australia). This was the largest factor, as if a provider didn’t ship to me they were automatically not an option. Depending on where you live you might have more or less choice available to you.

Coverage

Different providers advertise different coverages, or average read depths for different prices, the most common being 30x. This is basically a measure of how many times on average they will read each gene, to lower chance of inaccuracy. While 30x might seem like it’s enough for all cases, as it’s an average it doesn’t actually guarantee you’ll have sufficient read depth across your entire genome. Due to the physical structure of DNA, some areas will inherently be harder or easier to read. This means while some areas might end up being read 70 times, others might only be read once or twice. While a majority of the genome will be of a sufficient read depth, if your primary motivation is rare genetic diseases affecting specific sections of genes or something else requiring high accuracy everywhere, higher read depth may be something to consider.

Output Data Formats

Many providers offer multiple formats for your data, but it’s important to be aware of what each of these formats mean, and the limitations that some of them have. In general, you want to make sure your chosen provider gives you your data in a format that’s the most versatile and contains the most extractable information. Ideally your provider will provide data in all of these formats. For a deeper look into the actual content of these file formats, I found this useful video.

FASTQ

FASTQ files are commonly provided, and are basically the raw output from the sequencer. These are functionally useless by themselves, but with alignment software these files provide the most versatility. They’re useful to have but unless you want to put in a lot of time aligning your genome (potentially days), you also want to ensure your provider provides other formats. These files will also generally be very large, around 100GB for a 30x coverage. These files are especially useful if you want to align the genome yourself to a specific chosen reference genome, but you can generally produce these files from other file formats that contain all genome data in a worst case situation.

CRAM/BAM/SAM

These files are similar to FASTQ files in that they contain all of the output of the sequencer, but they’ve already had the alignment process applied to them with a reference genome of your providers choice. These are the most “ready to go” files as they contain all of the data, but in a format that actually aligns with your genome. The genes are all labelled and placed on the chromosomes. If you wanted to look at a specific gene with an analysis program, you can do it with these files.

The difference between CRAM, BAM, and SAM is basically the level of compression applied. SAM files are plain text and opening them with a text editor like Notepad will show raw DNA and alignment data, if it doesn’t crash your computer due to trying to load a 100GB+ file first. BAM files are a compressed binary format instead, they don’t store it in plain text and are therefore able to save some space at the downside of not being able to open the file with a text editor. CRAM files are a newer format, and basically just an even more compressed version of BAM files. Out of any of these, CRAM files are the most likely a provider will provide. They’re functionally the same and fairly trivial (albeit time consuming) to convert between. In general, expect everything DNA related to be time consuming.

CRAM files can also come with a CRAI file, which acts as an index to allow faster accesses when opened by various genetics programs. If you’ve been given the option to download a CRAM and CRAI file, download both and put them in the same place.

VCF

VCF files can be useful but are generally the least useful out of these options. Similar to CRAM files, they are mapped to a reference genome. These files however, do not contain all data. In order to significantly save on file size, this file type only stores data that differs from the reference. This means that rather than storing your entire DNA, they’re instead storing a list of mutations compared to a reference genome. The reference genome used will also be listed, as these are functionally useless without knowing what they were built against.

The main downside of this file format is that you’re missing all data that could not be mapped with the reference genome. If you want to re-map this with another reference genome that covers a different area of the genome or covers it differently, you’re going to have missing data. Sites and services that accept VCF files also generally only accept VCF files mapped against certain reference genomes, so they aren’t as universally versatile as the other file formats.

Data Privacy

Genetic data is kind of personal, arguably one of the most personal kinds of data that exist. Not only is it personal to you, but your genetic data also has privacy implications for those related to you. While most people purchasing a WGS service will be doing it to use the platform’s DNA reports, if you’re looking to do this for purely raw DNA data access it might be a good idea to look for a service that lets you fully erase your DNA data once downloaded.

Due to the rising levels of data breaches targeting DNA, and the potential implications of this sort of data being stolen, I would recommend heavily considering this factor. Ensuring that the provider has a solid privacy policy that allows requesting complete data removal lowers this risk substantially. While it does mean you lose access to their online reports, it means you’re 100% in control of your own data. If you still want to have an online backup, one option is to upload an encrypted copy of the FASTQ or CRAM files to a cloud storage provider such as Azure Blob Storage’s archival tier.

This risk is not just hypothetical, and you may be targeted in ways that you wouldn’t expect. There was previously a targeted attack on 23andMe, specifically targeting users of Ashkenazi descent. As DNA sequencing becomes more popular, these attacks will likely become more frequent.

Sample Type

This one might be a bit more niche, but the location of your body that the DNA sample is taken from can partially impact what you can do with the data. For example, DNA sequencing of a buccal swab allows the provider to also provide the DNA of your oral microbiome, which many do. On the other hand, a blood draw is going to provide a significantly purer sample due to less bacterial contamination. Blood draws come with other downsides however, such as requiring a phlebotomist and being generally annoying to do. In general, if you follow the instructions from your provider carefully your sample should be viable.

Conclusion

There are many factors to consider when choosing a WGS supplier, so it’s important to make the choice that’s right for you. The next article in this series will cover what to do once you actually have your data, and some basic usages of the tool WGSExtract.

About the Author

Maddy Miller

Hi, I'm Maddy Miller, a Senior Software Engineer at Clipchamp at Microsoft. In my spare time I love writing articles, and I also develop the Minecraft mods WorldEdit, WorldGuard, and CraftBook. My opinions are my own and do not represent those of my employer in any capacity.