Help: Common Sequence Formats

Common sequence formats used in BxSeqTools are: GenBank, FASTA, and Free Text

GenBank Format:

See detailed explanation, fields, feature keys, feature qualifiers, or feature locations

Example:

LOCUS       AAURRA                   118 bp ss-rRNA    linear       16-JUN-1986
DEFINITION  A.auricula-judae (mushroom) 5S ribosomal RNA.
ACCESSION   K03160
VERSION     K03160.1  GI:173593
KEYWORDS    5S ribosomal RNA; ribosomal RNA.
SOURCE      A.auricula-judae (mushroom) ribosomal RNA.
  ORGANISM  Auricularia auricula-judae
            Eukaryota; Fungi; Eumycota; Basidiomycotina; Phragmobasidiomycetes;
            Heterobasidiomycetidae; Auriculariales; Auriculariaceae.
REFERENCE   1  (bases 1 to 118)
  AUTHORS   Huysmans,E., Dams,E., Vandenberghe,A. and De Wachter,R.
  TITLE     The nucleotide sequences of the 5S rRNAs of four mushrooms and
            their use in studying the phylogenetic position of basidiomycetes
            among the eukaryotes
  JOURNAL   Nucleic Acids Res. 11, 2871-2880 (1983)
FEATURES             Location/Qualifiers
     rRNA            1..118
                     /note="5S ribosomal RNA"
ORIGIN      
        1 ATCCACGGCC ATAGGACTCT GAAAGCACTG CATCCCGTCC GATCTGCAAA GTTAACCAGA
       61 GTACCGCCCA GTTAGTACCA CGGTGGGGGA CCACGCGGGA ATCCTGGGTG CTGTGGTT
//

FASTA Format:

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.

The description line starts with a greater than symbol (">").
The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description.
The "ID" and the description are optional.
The sequence ends if there is another greater than symbol (">") symbol at the beginning of a line and another sequence begins.

The following example contains three sequences
(Example1, Example2, and hCdk9):

>Example1 envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTV
>Example2 synthetic peptide
HITREPLKHIPKERYRGTNDTLSPQIESIWAAELDRYKLVKTNCSNVS
>gi|17017983|ref|NM_001261.2| Homo sapiens cyclin-dependent kinase 9
CGCCCGCCGGAGGGGCCTGGAGTGCGGCGGCGGCGGGACCCGGAGCAGGAGCGGCGGCAGC
AGCGACTGGGGGCGGCGGCGGCGCGTTGGAGGCGGCCATGGCAAAGCAGTACGACTCGGTG
GAGTGCCCTTTTTGTGATGAAGTTTCCAAATACGAGAAGCTCGCCAAGATCGGCCAAGGCA

Free Text Format:

Example: (Most BxSeqTools programs will automatically remove non-IUPAC characters)

        121 atccacggcc ataggactct gaaagcactg catcccgtcc gatctgcaaa gttaaccaga
       61 gtaCCgccca gttagtaccGGGa cggtggggga ccagga atcctgggtg ctgtggtt
//

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
The nucleic acid codes supported are:

        A --> adenosine           M --> A C (amino)
        C --> cytidine            S --> G C (strong)
        G --> guanine             W --> A T (weak)
        T --> thymidine           B --> G T C
        U --> uridine             D --> G A T
        R --> G A (purine)        H --> A C T
        Y --> T C (pyrimidine)    V --> G C A
        K --> G T (keto)          N --> A G C T (any)
                                  -  gap of indeterminate length

The accepted amino acid codes are:

    A  alanine                         P  proline
    B  aspartate or asparagine         Q  glutamine
    C  cystine                         R  arginine
    D  aspartate                       S  serine
    E  glutamate                       T  threonine
    F  phenylalanine                   U  selenocysteine
    G  glycine                         V  valine
    H  histidine                       W  tryptophane
    I  isoleucine                      Y  tyrosine
    K  lysine                          Z  glutamate or glutamine
    L  leucine                         X  any
    M  methionine                      *  translation stop
    N  asparagine                      -  gap of indeterminate length