Project

General

Profile

CDCB Accepted genotype file formats

Important notice

It is important to note that the files shared with CDCB should NOT contain:
  • special characters
  • hexadecimal characters
  • values in scientific notation.
  • characters in foreign alphabet (chinese, japanese, cyrillic, etc)
  • letters with accents

Please check your files before submission to identify and correct such issues.

SampleSheet

Type of file: CSV, comma delimited, zipped (.zip)
File name: Year (4 bytes) + Month (2 bytes)+ Day (2 bytes) + Sample_set within day + Serial number for sample-set submission/Chip type + "SampleSheet"
File name example: 2014042812_50KSampleSheet.zip (file contained in the zip file should be "2014042812_50KSampleSheet.csv")
  • 20140428 is the date
  • 1 is the submission within date (sequential number)
  • 2 is the version within submission (sequential number)
  • 50K is the array ID (as detailed in Array ID information - please confirm with CDCB Staff)
  • SampleSheet is the type of file

Content: Consecutive commas indicate a null column
Example: Fake data is provided as example

[Header],,,,,,,,,,,,,,,,,,,,,,,,
Investigator Name,Chris Doe,,,,,,,,,,,,,,,,,,,,,,, 
Project Name,14-177,,,,,,,,,,,,,,,,,,,,,,,
Experiment Name,14-177_Shipment_20140415-171700-1,,,,,,,,,,,,,,,,,,,,,,,
Date,2014-04-28_11h19m27s,,,,,,,,,,,,,,,,,,,,,,,
[Manifests],,,,,,,,,,,,,,,,,,,,,,,,
A,BovineLD_C,,,,,,,,,,,,,,,,,,,,,,,,
[Data]
Sample_ID,Sample_Plate,Sample_Name,Project,AMP_Plate,Sample_Well,SentrixBarcode_A,SentrixPosition_A,Scanner,Date_Scan,Replicate,Parent1,Parent2,Gender,Sample_Source,Study,Subclient,Well on DNA plate,Sex,Family Info,Other_Name,Sire,Dam,Comment
HOITAF12345678903,Master-DLMDNA-2014-1727_1-1,HOITAF12345678903,AIPL,InfWP2014-0380-Reaction1,A01,7996554214,R01C01,,,,,,,Hair,14-177,,B5,Female,,Mona Lisa,HOITAM12345678901,HOITAF12345678902,
HODEUF123456789,Master-DLMDNA-2014-1727_1-1,HODEUF123456789,AIPL,InfWP2014-0380-Reaction1,B01,7996554214,R03C01,,,,,,,Hair,14-177,,C5,Female,,Schoene Maedchen,HOUSAM12345678,HONLDF123456789,
HODEUF123456790,Master-DLMDNA-2014-1727_1-1,HODEUF123456790,AIPL,InfWP2014-0380-Reaction1,C01,7996554214,R05C01,,,,,,,Hair,14-177,,D5,Female,,Tupfen Tulpe,HOUSAM12345679,HONLDF123456790,
[...]


None of the first 9 columns can have missing data.

File Structure and Description

File Section Required Description
[Header] No Not used for processing. Provides Project's additional information
[Manifests] No Not used for processing. Provides Chip's additional information
[Data] Yes Used for processing. Identifies where data starts

Data Section structure and Description

[Data] Column Name Required Max. Length Description
Sample_ID Yes 20 bytes Sample ID. Defined by the Nominator and Lab; they must have an established methodology.
Used for processing and storage, must match FinalReport Sample_ID
Sample_Plate Yes 13 bytes Plate ID. Used to investigate problems
Sample_Name Yes 18 bytes Animal ID. Used for processing and storage; must contain animal ID but leading zeros may be excluded.
Ex: HOUSAF00123456789 = HOUSAF123456789
Project Yes 12 bytes Approved Requester ID; the organization arranging for genotyping. Used for processing and storage
AMP_Plate Yes 14 bytes Amplification plate ID. Used to investigate problems
Sample_Well Yes 3 bytes Well ID or coordinates in amplification plate. Used to investigate problems
SentrixBarcode_A Yes 12 bytes Chip's barcode and position on the chip. Used for processing and storage. Constitutes the unique genotype identification
SentrixPosition_A Yes 6 bytes
Sample_Source Yes 6 bytes Tissue or DNA source. Used for processing and storage. It may be labeled "DNA_source" or "Tissue_source".
Accepted values: hair, blood, semen, tissue (including extracted DNA), nasal
Additional Columns No N/A May be included but is not stored

Notes: Standard GenomeStudio matrix output is expected. This means 1 column per genotype and 1 row per SNP.

FinalReport

Type of file: TXT, tab delimited (including the header), zipped (.zip)
File name: Year (4 bytes) + Month (2 bytes)+ Day (2 bytes) + Sample_set within day + Serial number for sample-set submission/Chip type + "FinalReport"
File name example: 2014042812_50KFinalReport.zip (file contained in the zip file should be "2014042812_50KFinalReport.txt"), where
  • 20140428 is the date
  • 1 is the submission within date (sequential number)
  • 2 is the version within submission (sequential number)
  • 50K is the array ID (as detailed in Array ID information - please confirm with CDCB Staff)
  • FinalReport is the type of file

Genotype coding: A/B format (use -- for no call)
Example: (note that fake data is provided as example).

[Header]
GSGT Version 1.8.4
Processing Date 04/28/2014 9:17 AM
Content BovineLD_C.bpm
Num SNPs 6909
Total SNPs 6909
Num Samples 262
Total Samples 264
[Data]
                   HOITAF12345678903    HODEUF123456789    HODEUF123456790    HODEUF123456791    HOITAF12345678904  […]
ARS‑BFGL‑BAC‑10975   AB   AB   AB   AB   AB   […]
ARS‑BFGL‑BAC‑11025   AB   AA   AB   AA   AB   […]
[...]
Notes:
  • use "AB" for SNP genotype … 1 column/animal … 1st row is sample ID, and following rows are genotypes with 1 row/SNP

Redmine Appliance - Powered by TurnKey Linux