CDCB Accepted genotype file formats¶

Table of contents
CDCB Accepted genotype file formats

Important notice¶

It is important to note that the files shared with CDCB should NOT contain:

special characters
hexadecimal characters
values in scientific notation.
characters in foreign alphabet (chinese, japanese, cyrillic, etc)
letters with accents

Please check your files before submission to identify and correct such issues.

SampleSheet¶

Type of file: CSV, comma delimited, zipped (.zip)
File name: Year (4 bytes) + Month (2 bytes)+ Day (2 bytes) + Sample_set within day + Serial number for sample-set submission/Chip type + "SampleSheet"
File name example: 2014042812_50KSampleSheet.zip (file contained in the zip file should be "2014042812_50KSampleSheet.csv")

20140428 is the date
1 is the submission within date (sequential number)
2 is the version within submission (sequential number)
50K is the array ID (as detailed in Array ID information - please confirm with CDCB Staff)
SampleSheet is the type of file

Content: Consecutive commas indicate a null column
Example: Fake data is provided as example

[Header],,,,,,,,,,,,,,,,,,,,,,,,
Investigator Name,Chris Doe,,,,,,,,,,,,,,,,,,,,,,, 
Project Name,14-177,,,,,,,,,,,,,,,,,,,,,,,
Experiment Name,14-177_Shipment_20140415-171700-1,,,,,,,,,,,,,,,,,,,,,,,
Date,2014-04-28_11h19m27s,,,,,,,,,,,,,,,,,,,,,,,
[Manifests],,,,,,,,,,,,,,,,,,,,,,,,
A,BovineLD_C,,,,,,,,,,,,,,,,,,,,,,,,
[Data]
Sample_ID,Sample_Plate,Sample_Name,Project,AMP_Plate,Sample_Well,SentrixBarcode_A,SentrixPosition_A,Scanner,Date_Scan,Replicate,Parent1,Parent2,Gender,Sample_Source,Study,Subclient,Well on DNA plate,Sex,Family Info,Other_Name,Sire,Dam,Comment
HOITAF12345678903,Master-DLMDNA-2014-1727_1-1,HOITAF12345678903,AIPL,InfWP2014-0380-Reaction1,A01,7996554214,R01C01,,,,,,,Hair,14-177,,B5,Female,,Mona Lisa,HOITAM12345678901,HOITAF12345678902,
HODEUF123456789,Master-DLMDNA-2014-1727_1-1,HODEUF123456789,AIPL,InfWP2014-0380-Reaction1,B01,7996554214,R03C01,,,,,,,Hair,14-177,,C5,Female,,Schoene Maedchen,HOUSAM12345678,HONLDF123456789,
HODEUF123456790,Master-DLMDNA-2014-1727_1-1,HODEUF123456790,AIPL,InfWP2014-0380-Reaction1,C01,7996554214,R05C01,,,,,,,Hair,14-177,,D5,Female,,Tupfen Tulpe,HOUSAM12345679,HONLDF123456790,
[...]

None of the first 6 columns required can have missing data.

File Structure and Description¶

File Section	Required	Description
[Header]	No	Not used for processing. Provides Project's additional information
[Manifests]	No	Not used for processing. Provides Chip's additional information
[Data]	Yes	Used for processing. Identifies where data starts

Data Section structure and Description¶

[Data] Column Name	Required	Max. Length	Description
Sample_ID	Yes	20 bytes	Sample ID. Defined by the Nominator and Lab; they must have an established methodology. Used for processing and storage, must match FinalReport Sample_ID (Required)
Sample_Plate	No	13 bytes	Plate ID. Used to investigate problems
Sample_Name	Yes	18 bytes	Animal ID. Used for processing and storage; must contain animal ID but leading zeros may be excluded(Required). Ex: HOUSAF00123456789 = HOUSAF123456789
Project	Yes	12 bytes	Approved Requester ID; the organization arranging for genotyping. Used for processing and storage(Required)
AMP_Plate	No	14 bytes	Amplification plate ID. Used to investigate problems
Sample_Well	Yes	3 bytes	Well ID or coordinates in amplification plate. Used to investigate problems (Required)
SentrixBarcode_A	Yes	12 bytes	Chip's barcode and position on the chip. Used for processing and storage. Constitutes the unique genotype identification(Required)
SentrixPosition_A	Yes	6 bytes(Required)
Sample_Source	Yes	6 bytes	Tissue or DNA source. Used for processing and storage. It may be labeled "DNA_source" or "Tissue_source"(Required). Accepted values: hair, blood, semen, tissue (including extracted DNA), nasal
Additional Columns	No	N/A	May be included but is not stored

Notes: Standard GenomeStudio matrix output is expected. This means 1 column per genotype and 1 row per SNP.

FinalReport¶

Type of file: TXT, tab delimited (including the header), zipped (.zip)
File name: Year (4 bytes) + Month (2 bytes)+ Day (2 bytes) + Sample_set within day + Serial number for sample-set submission/Chip type + "FinalReport"
File name example: 2014042812_50KFinalReport.zip (file contained in the zip file should be "2014042812_50KFinalReport.txt"), where

20140428 is the date
1 is the submission within date (sequential number)
2 is the version within submission (sequential number)
50K is the array ID (as detailed in Array ID information - please confirm with CDCB Staff)

FinalReport is the type of file

Genotype coding: A/B format (use -- for no call)
Example: (note that fake data is provided as example).

[Header]
GSGT Version 1.8.4
Processing Date 04/28/2014 9:17 AM
Content BovineLD_C.bpm
Num SNPs 6909
Total SNPs 6909
Num Samples 262
Total Samples 264
[Data]
                   HOITAF12345678903    HODEUF123456789    HODEUF123456790    HODEUF123456791    HOITAF12345678904  […]
ARS‑BFGL‑BAC‑10975   AB   AB   AB   AB   AB   […]
ARS‑BFGL‑BAC‑11025   AB   AA   AB   AA   AB   […]
[...]

Notes:

use "AB" for SNP genotype … 1 column/animal … 1st row is sample ID, and following rows are genotypes with 1 row/SNP

Files (0)

Project

General

Profile

CDCB collaborator portal

Wiki