CDCB Accepted genotype file formats¶
- Table of contents
- CDCB Accepted genotype file formats
Important notice¶
It is important to note that the files shared with CDCB should NOT contain:- special characters
- hexadecimal characters
- values in scientific notation.
- characters in foreign alphabet (chinese, japanese, cyrillic, etc)
- letters with accents
Please check your files before submission to identify and correct such issues.
SampleSheet¶
Type of file: CSV, comma delimited, zipped (.zip)File name: Year (4 bytes) + Month (2 bytes)+ Day (2 bytes) + Sample_set within day + Serial number for sample-set submission/Chip type + "SampleSheet"
File name example: 2014042812_50KSampleSheet.zip (file contained in the zip file should be "2014042812_50KSampleSheet.csv")
- 20140428 is the date
- 1 is the submission within date (sequential number)
- 2 is the version within submission (sequential number)
- 50K is the array ID (as detailed in Array ID information - please confirm with CDCB Staff)
- SampleSheet is the type of file
Content: Consecutive commas indicate a null column
Example: Fake data is provided as example
[Header],,,,,,,,,,,,,,,,,,,,,,,, Investigator Name,Chris Doe,,,,,,,,,,,,,,,,,,,,,,, Project Name,14-177,,,,,,,,,,,,,,,,,,,,,,, Experiment Name,14-177_Shipment_20140415-171700-1,,,,,,,,,,,,,,,,,,,,,,, Date,2014-04-28_11h19m27s,,,,,,,,,,,,,,,,,,,,,,, [Manifests],,,,,,,,,,,,,,,,,,,,,,,, A,BovineLD_C,,,,,,,,,,,,,,,,,,,,,,,, [Data]
Sample_ID,Sample_Plate,Sample_Name,Project,AMP_Plate,Sample_Well,SentrixBarcode_A,SentrixPosition_A,Scanner,Date_Scan,Replicate,Parent1,Parent2,Gender,Sample_Source,Study,Subclient,Well on DNA plate,Sex,Family Info,Other_Name,Sire,Dam,Comment
HOITAF12345678903,Master-DLMDNA-2014-1727_1-1,HOITAF12345678903,AIPL,InfWP2014-0380-Reaction1,A01,7996554214,R01C01,,,,,,,Hair,14-177,,B5,Female,,Mona Lisa,HOITAM12345678901,HOITAF12345678902,
HODEUF123456789,Master-DLMDNA-2014-1727_1-1,HODEUF123456789,AIPL,InfWP2014-0380-Reaction1,B01,7996554214,R03C01,,,,,,,Hair,14-177,,C5,Female,,Schoene Maedchen,HOUSAM12345678,HONLDF123456789,
HODEUF123456790,Master-DLMDNA-2014-1727_1-1,HODEUF123456790,AIPL,InfWP2014-0380-Reaction1,C01,7996554214,R05C01,,,,,,,Hair,14-177,,D5,Female,,Tupfen Tulpe,HOUSAM12345679,HONLDF123456790,
[...]
None of the first 6 columns required can have missing data.
File Structure and Description¶
File Section | Required | Description |
[Header] | No | Not used for processing. Provides Project's additional information |
[Manifests] | No | Not used for processing. Provides Chip's additional information |
[Data] | Yes | Used for processing. Identifies where data starts |
Data Section structure and Description¶
[Data] Column Name | Required | Max. Length | Description |
Sample_ID | Yes | 20 bytes | Sample ID. Defined by the Nominator and Lab; they must have an established methodology. Used for processing and storage, must match FinalReport Sample_ID (Required) |
Sample_Plate | Yes | 13 bytes | Plate ID. Used to investigate problems |
Sample_Name | Yes | 18 bytes | Animal ID. Used for processing and storage; must contain animal ID but leading zeros may be excluded(Required). Ex: HOUSAF00123456789 = HOUSAF123456789 |
Project | Yes | 12 bytes | Approved Requester ID; the organization arranging for genotyping. Used for processing and storage(Required) |
AMP_Plate | Yes | 14 bytes | Amplification plate ID. Used to investigate problems |
Sample_Well | Yes | 3 bytes | Well ID or coordinates in amplification plate. Used to investigate problems |
SentrixBarcode_A | Yes | 12 bytes | Chip's barcode and position on the chip. Used for processing and storage. Constitutes the unique genotype identification(Required) |
SentrixPosition_A | Yes | 6 bytes(Required) | |
Sample_Source | Yes | 6 bytes | Tissue or DNA source. Used for processing and storage. It may be labeled "DNA_source" or "Tissue_source"(Required). Accepted values: hair, blood, semen, tissue (including extracted DNA), nasal |
Additional Columns | No | N/A | May be included but is not stored |
Notes: Standard GenomeStudio matrix output is expected. This means 1 column per genotype and 1 row per SNP.
FinalReport¶
Type of file: TXT, tab delimited (including the header), zipped (.zip)File name: Year (4 bytes) + Month (2 bytes)+ Day (2 bytes) + Sample_set within day + Serial number for sample-set submission/Chip type + "FinalReport"
File name example: 2014042812_50KFinalReport.zip (file contained in the zip file should be "2014042812_50KFinalReport.txt"), where
- 20140428 is the date
- 1 is the submission within date (sequential number)
- 2 is the version within submission (sequential number)
- 50K is the array ID (as detailed in Array ID information - please confirm with CDCB Staff)
- FinalReport is the type of file
Genotype coding: A/B format (use -- for no call)
Example: (note that fake data is provided as example).
[Header] GSGT Version 1.8.4 Processing Date 04/28/2014 9:17 AM Content BovineLD_C.bpm Num SNPs 6909 Total SNPs 6909 Num Samples 262 Total Samples 264 [Data] HOITAF12345678903 HODEUF123456789 HODEUF123456790 HODEUF123456791 HOITAF12345678904 […] ARS‑BFGL‑BAC‑10975 AB AB AB AB AB […] ARS‑BFGL‑BAC‑11025 AB AA AB AA AB […] [...]Notes:
- use "AB" for SNP genotype … 1 column/animal … 1st row is sample ID, and following rows are genotypes with 1 row/SNP