The Microbiome Quality Control project
Final data products
Note that all data products included here and below have been blinded to anonymize the handling labs (abbreviated HLs) and bioinformatics labs (BLs) who participated in the MBQC-baseline.
- Integrated OTU table. This includes all ~16,500 samples and OTUs that were deposited by any bioinformatics lab in appropriate format, in addition to metadata describing the sample's originating biospecimen, bioinformatics lab, and handling lab.
- Specimen list. The MBQC-baseline included 22 specimens (plus negative controls) of four types: fresh and freeze-dried human stool, chemostat aliquots, and fecal oral artificial communities (as positive controls).
- Sample set aliquots. From these originating specimens, aliquots were generated and assembled into standardized 96-sample sets for distribution to handling participants. This table lists the first stage blinded identifiers, specimen, and aliquot information for all sample set contents.
- Sample handling protocols. While labs could choose their own data generation protocols, as long as they resulted in demultiplexed Illumina 16S amplicon sequences, a detailed form systematically recorded protocol variables.
- Bioinformatics protocols. Likewise, while labs could choose any bioinformatics protocol that resulted in a standardized OTU table, the MBQC-baseline systematically recorded protocol variables in this table.
- Bioinformatics distribution blinding. A second internal blinding was used to hide the originating handling laboratory for each raw data file distributed prior to bioinformatics processing. This table links the handling lab ID and original specimen ID to this internally blinded random identifier (not used for final data integration, but useful with the blinded bioinformatics distribution below).
Mock community composition. The microbial strains and approximate quantities (by loop count) used in constructing the fecal and oral derived artificial communities.
Raw data from sample handling
All raw sequencing data from the MBQC-baseline is available at SRA BioProject SRP047083. This file table additionally provides completely raw data as deposited by each of the MBQC-baseline’s sample handling labs. This includes Illumina 16S sequences as provided by each lab and (anonymized) manifests linking each file to one (if demultiplexed) or more originating biospecimens. Sample IDs and metadata are as provided above and in the MBQC supplement.
Blinded sample handling data for bioinformatics distribution
Between the MBQC-baseline’s sample handling / data generation phase and its bioinformatics phase, all sequence data were deposited by the handling labs and re-blinded prior to distribution for bioinformatics. This file table provides the raw, re-blinded sequence data as distributed to bioinformatics teams. Each handling lab is anonymized, and each sample is centrally demultiplexed and identified by the “Bioinformatics.ID” field as provided above and in the MBQC supplement.
Raw data from bioinformatics
This file table provides completely raw data as deposited by each of the MBQC-baseline's bioinformatics labs. This includes, at the least, an OTU table identified using Greengenes identifiers and a Newick-formatted phylogeny incorporating any additional de novo OTUs. Some labs chose to also deposit supplemental files such as processing scripts or protocol documentation.