HOMEREGISTERLOGIN
BxSeqTools » Help, Videos, and Manuals Got questions? Contact Us Now
GenBank Format Description
Tip: Description of GenBank Sequence Format

Help: Explanation of GenBank File Formats

GBREL.TXT          Genetic Sequence Data Bank

                         February 15 2004



               NCBI-GenBank Flat File Release 140.0



                    Distribution Release Notes



==========================================================================

TABLE OF CONTENTS

==========================================================================



3. FILE FORMATS



3.4 Sequence Entry Files

     3.4.1 File Organization

     3.4.2  Entry Organization

     3.4.3 Sample Sequence Data File

     3.4.4 LOCUS Format

     3.4.5 DEFINITION Format

          3.4.5.1 DEFINITION Format for NLM Entries

     3.4.6 ACCESSION Format

     3.4.7 VERSION Format

     3.4.8 KEYWORDS Format

     3.4.9 SEGMENT Format

     3.4.10 SOURCE Format

     3.4.11 REFERENCE Format

     3.4.12 FEATURES Format

          3.4.12.1 Feature Key Names

          3.4.12.2 Feature Location

          3.4.12.3  Feature Qualifiers

          3.4.12.4 Cross-Reference Information

          3.4.12.5 Feature Table Examples

     3.4.13 ORIGIN Format

     3.4.14 SEQUENCE Format



==========================================================================



3. FILE FORMATS



3.4 Sequence Entry Files



  GenBank releases contain one or more sequence entry data files, one

for each "division" of GenBank.



3.4.1 File Organization



  Each of these files has the same format and consists of two parts:

header information (described in section 3.1) and sequence entries for

that division (described in the following sections).



3.4.2  Entry Organization



  In the second portion of a sequence entry file (containing the

sequence entries for that division), each record (line) consists of

two parts. The first part is found in positions 1 to 10 and may

contain:



1. A keyword, beginning in column 1 of the record (e.g., REFERENCE is

a keyword).



2. A subkeyword beginning in column 3, with columns 1 and 2 blank

(e.g., AUTHORS is a subkeyword of REFERENCE). Or a subkeyword beginning

in column 4, with columns 1, 2, and 3 blank (e.g., PUBMED is a

subkeyword of REFERENCE).



3. Blank characters, indicating that this record is a continuation of

the information under the keyword or subkeyword above it.



4. A code, beginning in column 6, indicating the nature of an entry

(feature key) in the FEATURES table; these codes are described in

Section 3.4.12.1 below.



5. A number, ending in column 9 of the record. This number occurs in

the portion of the entry describing the actual nucleotide sequence and

designates the numbering of sequence positions.



6. Two slashes (//) in positions 1 and 2, marking the end of an entry.



  The second part of each sequence entry record contains the information

appropriate to its keyword, in positions 13 to 80 for keywords and

positions 11 to 80 for the sequence.



  The following is a brief description of each entry field. Detailed

information about each field may be found in Sections 3.4.4 to 3.4.14.



LOCUS	- A short mnemonic name for the entry, chosen to suggest the

sequence's definition. Mandatory keyword/exactly one record.



DEFINITION	- A concise description of the sequence. Mandatory

keyword/one or more records.



ACCESSION	- The primary accession number is a unique, unchanging

code assigned to each entry. (Please use this code when citing

information from GenBank.) Mandatory keyword/one or more records.



VERSION		- A compound identifier consisting of the primary

accession number and a numeric version number associated with the

current version of the sequence data in the record. This is followed

by an integer key (a "GI") assigned to the sequence by NCBI.

Mandatory keyword/exactly one record.



NID		- An alternative method of presenting the NCBI GI

identifier (described above). The NID is obsolete and was removed

from the GenBank flatfile format in December 1999.



KEYWORDS	- Short phrases describing gene products and other

information about an entry. Mandatory keyword in all annotated

entries/one or more records.



SEGMENT	- Information on the order in which this entry appears in a

series of discontinuous sequences from the same molecule. Optional

keyword (only in segmented entries)/exactly one record.



SOURCE	- Common name of the organism or the name most frequently used

in the literature. Mandatory keyword in all annotated entries/one or

more records/includes one subkeyword.



   ORGANISM	- Formal scientific name of the organism (first line)

and taxonomic classification levels (second and subsequent lines).

Mandatory subkeyword in all annotated entries/two or more records.



REFERENCE	- Citations for all articles containing data reported

in this entry. Includes seven subkeywords and may repeat. Mandatory

keyword/one or more records.



   AUTHORS	- Lists the authors of the citation. Optional

subkeyword/one or more records.



   CONSRTM	- Lists the collective names of consortiums associated

with the citation (eg, International Human Genome Sequencing Consortium),

rather than individual author names. Optional subkeyword/one or more records.



   TITLE	- Full title of citation. Optional subkeyword (present

in all but unpublished citations)/one or more records.



   JOURNAL	- Lists the journal name, volume, year, and page

numbers of the citation. Mandatory subkeyword/one or more records.



   MEDLINE	- Provides the Medline unique identifier for a

citation. Optional subkeyword/one record.



    PUBMED 	- Provides the PubMed unique identifier for a

citation. Optional subkeyword/one record.



   REMARK	- Specifies the relevance of a citation to an

entry. Optional subkeyword/one or more records.



COMMENT	- Cross-references to other sequence entries, comparisons to

other collections, notes of changes in LOCUS names, and other remarks.

Optional keyword/one or more records/may include blank records.



FEATURES	- Table containing information on portions of the

sequence that code for proteins and RNA sequences and information on

experimentally determined sites of biological significance. Optional

keyword/one or more records.



BASE COUNT	- Summary of the number of occurrences of each base

code in the sequence. Mandatory keyword/exactly one record.



ORIGIN	- Specification of how the first base of the reported sequence

is operationally located within the genome. Where possible, this

includes its location within a larger genetic map. Mandatory

keyword/exactly one record.



	- The ORIGIN line is followed by sequence data (multiple records).



// 	- Entry termination symbol. Mandatory at the end of an

entry/exactly one record.



3.4.3 Sample Sequence Data File



  An example of a complete sequence entry file follows. (This example

has only two entries.) Note that in this example, as throughout the

data bank, numbers in square brackets indicate items in the REFERENCE

list. For example, in ACARR58S, [1] refers to the paper by Mackay, et

al.



1       10        20        30        40        50        60        70       79

---------+---------+---------+---------+---------+---------+---------+---------

GBSMP.SEQ          Genetic Sequence Data Bank

                         15 December 1992



                 GenBank Flat File Release 74.0



                     Structural RNA Sequences



      2 loci,       236 bases, from     2 reported sequences



LOCUS       AAURRA        118 bp ss-rRNA            RNA       16-JUN-1986

DEFINITION  A.auricula-judae (mushroom) 5S ribosomal RNA.

ACCESSION   K03160

VERSION     K03160.1  GI:173593

KEYWORDS    5S ribosomal RNA; ribosomal RNA.

SOURCE      A.auricula-judae (mushroom) ribosomal RNA.

  ORGANISM  Auricularia auricula-judae

            Eukaryota; Fungi; Eumycota; Basidiomycotina; Phragmobasidiomycetes;

            Heterobasidiomycetidae; Auriculariales; Auriculariaceae.

REFERENCE   1  (bases 1 to 118)

  AUTHORS   Huysmans,E., Dams,E., Vandenberghe,A. and De Wachter,R.

  TITLE     The nucleotide sequences of the 5S rRNAs of four mushrooms and

            their use in studying the phylogenetic position of basidiomycetes

            among the eukaryotes

  JOURNAL   Nucleic Acids Res. 11, 2871-2880 (1983)

FEATURES             Location/Qualifiers

     rRNA            1..118

                     /note="5S ribosomal RNA"

BASE COUNT       27 a     34 c     34 g     23 t

ORIGIN      5' end of mature rRNA.

        1 atccacggcc ataggactct gaaagcactg catcccgtcc gatctgcaaa gttaaccaga

       61 gtaccgccca gttagtacca cggtggggga ccacgcggga atcctgggtg ctgtggtt

//

LOCUS       ABCRRAA       118 bp ss-rRNA            RNA       15-SEP-1990

DEFINITION  Acetobacter sp. (strain MB 58) 5S ribosomal RNA, complete sequence.

ACCESSION   M34766

VERSION     M34766.1  GI:173603

KEYWORDS    5S ribosomal RNA.

SOURCE      Acetobacter sp. (strain MB 58) rRNA.

  ORGANISM  Acetobacter sp.

            Prokaryotae; Gracilicutes; Scotobacteria; Aerobic rods and cocci;

            Azotobacteraceae.

REFERENCE   1  (bases 1 to 118)

  AUTHORS   Bulygina,E.S., Galchenko,V.F., Govorukhina,N.I., Netrusov,A.I.,

            Nikitin,D.I., Trotsenko,Y.A. and Chumakov,K.M.

  TITLE     Taxonomic studies of methylotrophic bacteria by 5S ribosomal RNA

            sequencing

  JOURNAL   J. Gen. Microbiol. 136, 441-446 (1990)

FEATURES             Location/Qualifiers

     rRNA            1..118

                     /note="5S ribosomal RNA"

BASE COUNT       27 a     40 c     32 g     17 t      2 others

ORIGIN      

        1 gatctggtgg ccatggcggg agcaaatcag ccgatcccat cccgaactcg gccgtcaaat

       61 gccccagcgc ccatgatact ctgcctcaag gcacggaaaa gtcggtcgcc gccagayy

//

---------+---------+---------+---------+---------+---------+---------+---------

1       10        20        30        40        50        60        70       79



Example 9. Sample Sequence Data File





3.4.4 LOCUS Format



  The items of information contained in the LOCUS record are always

found in fixed positions. The locus name (or entry name), which is

always sixteen characters or less, begins in position 13. The locus name

is designed to help group entries with similar sequences: the first

three characters usually designate the organism; the fourth and fifth

characters can be used to show other group designations, such as gene

product; for segmented entries the last character is one of a series

of sequential integers.



  The number of bases or base pairs in the sequence ends in position 40.

The letters `bp' are in positions 42 to 43. Positions 45 to 47 provide

the number of strands of the sequence. Positions 48 to 53 indicate the

type of molecule sequenced. Topology of the molecule is indicated in

positions 56 to 63.



  GenBank sequence entries are divided among many different

'divisions'. Each entry's division is specified by a three-letter code

in positions 65 to 67. See Section 3.3 for an explanation of division

codes.



  Positions 69 to 79 of the record contain the date the entry was

entered or underwent any substantial revisions, such as the addition

of newly published data, in the form dd-MMM-yyyy.



The detailed format for the LOCUS line format is as follows:



Positions  Contents

---------  --------

01-05      'LOCUS'

06-12      spaces

13-28      Locus name

29-29      space

30-40      Length of sequence, right-justified

41-41      space

42-43      bp

44-44      space

45-47      spaces, ss- (single-stranded), ds- (double-stranded), or

           ms- (mixed-stranded)

48-53      NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), 

           mRNA (messenger RNA), uRNA (small nuclear RNA), snRNA,

           snoRNA. Left justified.

54-55      space

56-63      'linear' followed by two spaces, or 'circular'

64-64      space

65-67      The division code (see Section 3.3)

68-68      space

69-79      Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)



  Although each of these data values can be found at column-specific

positions, we encourage those who parse the contents of the LOCUS

line to use a token-based approach. This will prevent the need for

software changes if the spacing of the data values ever has to be

modified.



3.4.5 DEFINITION Format



  The DEFINITION record gives a brief description of the sequence,

proceeding from general to specific. It starts with the common name of

the source organism, then gives the criteria by which this sequence is

distinguished from the remainder of the source genome, such as the

gene name and what it codes for, or the protein name and mRNA, or some

description of the sequence's function (if the sequence is

non-coding). If the sequence has a coding region, the description may

be followed by a completeness qualifier, such as cds (complete coding

sequence). There is no limit on the number of lines that may be part

of the DEFINITION.  The last line must end with a period.



3.4.5.1 DEFINITION Format for NLM Entries



  The DEFINITION line for entries derived from journal-scanning at the NLM is

an automatically generated descriptive summary that accompanies each DNA and

protein sequence. It contains information derived from fields in a database 

that summarize the most important attributes of the sequence.  The DEFINITION

lines are designed to supplement the accession number and the sequence itself

as a means of uniquely and completely specifying DNA and protein sequences. The

following are examples of NLM DEFINITION lines:



NADP-specific isocitrate dehydrogenase [swine, mRNA, 1 gene, 1585 nt]



94 kda fiber cell beaded-filament structural protein [rats, lens, mRNA

Partial, 1 gene, 1873 nt]



inhibin alpha {promoter and exons} [mice, Genomic, 1 gene, 1102 nt, segment

1 of 2]



cefEF, cefG=acetyl coenzyme A:deacetylcephalosporin C o-acetyltransferase

[Acremonium chrysogenum, Genomic, 2 genes, 2639 nt]



myogenic factor 3, qmf3=helix-loop-helix protein [Japanese quails,

embryo, Peptide Partial, 246 aa]





  The first part of the definition line contains information describing

the genes and proteins represented by the molecular sequences.  This can

be gene locus names, protein names and descriptions that replace or augment

actual names.  Gene and gene product are linked by "=".  Any special

identifying terms are presented within brackets, such as: {promoter},

{N-terminal}, {EC 2.13.2.4}, {alternatively spliced}, or {3' region}.



  The second part of the definition line is delimited by square brackets, '[]',

and provides details about the molecule type and length.  The biological

source, i.e., genus and species or common name as cited by the author.

Developmental stage, tissue type and strain are included if available.

The molecule types include: Genomic, mRNA, Peptide. and Other Genomic

Material. Genomic sequences are assumed to be partial sequence unless

"Complete" is specified, whereas mRNA and peptide sequences are assumed

to be complete unless "Partial" is noted.



3.4.6 ACCESSION Format



  This field contains a series of six-character and/or eight-character

identifiers called 'accession numbers'. The six-character accession

number format consists of a single uppercase letter, followed by 5 digits.

The eight-character accession number format consists of two uppercase

letters, followed by 6 digits. The 'primary', or first, of the accession

numbers occupies positions 13 to 18 (6-character format) or positions

13 to 20 (8-character format). Subsequent 'secondary' accession numbers

(if present) are separated from the primary, and from each other, by a

single space. In some cases, multiple lines of secondary accession

numbers might be present, starting at position 13.



  The primary accession number of a GenBank entry provides a stable identifier

for the biological object that the entry represents. Accessions do not change

when the underlying sequence data or associated features change.



  Secondary accession numbers arise for a number of reasons. For example, a

single accession number may initially be assigned to a sequence described in

a publication. If it is later discovered that the sequence must be entered

into the database as multiple entries, each entry would receive a new primary

accession number, and the original accession number would appear as a secondary

accession number on each of the new entries.



3.4.7 VERSION Format



  This line contains two types of identifiers for a GenBank database entry:

a compound accession number and an NCBI GI identifier. 



LOCUS       AF181452     1294 bp    DNA             PLN       12-OCT-1999

DEFINITION  Hordeum vulgare dehydrin (Dhn2) gene, complete cds.

ACCESSION   AF181452

VERSION     AF181452.1  GI:6017929

            ^^^^^^^^^^  ^^^^^^^^^^

            Compound    NCBI GI

            Accession   Identifier

            Number



  A compound accession number consists of two parts: a stable, unchanging

primary-accession number portion (see Section 3.4.6 for a description of

accession numbers), and a sequentially increasing numeric version number.

The accession and version numbers are separated by a period. The initial

version number assigned to a new sequence is one. Compound accessions are

often referred to as "Accession.Version" .



  An accession number allows one to retrieve the same biological object in the

database, regardless of any changes that are made to the entry over time. But

those changes can include changes to the sequence data itself, which is of

fundamental importance to many database users. So a numeric version number is

associated with the sequence data in every database entry. If an entry (for

example, AF181452) undergoes two sequence changes, its compound accession

number on the VERSION line would start as AF181452.1 . After the first sequence

change this would become: AF181452.2 . And after the second change: AF181452.3 .



  The NCBI GI identifier of the VERSION line also serves as a method for

identifying the sequence data that has existed for a database entry over

time. GI identifiers are numeric values of one or more digits. Since they

are integer keys, they are less human-friendly than the Accession.Version

system described above. Returning to our example for AF181452, it was

initially assigned GI 6017929. If the sequence changes, a new integer GI will

be assigned, perhaps 7345003 . And after the second sequence change, perhaps

the GI would become 10456892 .



  Why are both these methods for identifying the version of the sequence

associated with a database entry in use? For two reasons:



- Some data sources processed by NCBI for incorporation into its Entrez

  sequence retrieval system do not version their own sequences.



- GIs provide a uniform, integer identifier system for every sequence

  NCBI has processed. Some products and systems derived from (or reliant

  upon) NCBI products and services prefer to use these integer identifiers

  because they can all be processed in the same manner.



GenBank Releases contain only the most recent versions of all sequences

in the database. However, older versions can be obtained via GI-based or

Accession.Version-based queries with NCBI's web-Entrez and network-Entrez

applications. A sequence revision history web page is also available:



	  http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/girevhist



NOTE: All the version numbers for the compound Accession.Version identifier

system were initialized to a value of one in February 1999, when that

system was introduced.



3.4.8 KEYWORDS Format



  The KEYWORDS field does not appear in unannotated entries, but is

required in all annotated entries. Keywords are separated by

semicolons; a "keyword" may be a single word or a phrase consisting of

several words. Each line in the keywords field ends in a semicolon;

the last line ends with a period. If no keywords are included in the

entry, the KEYWORDS record contains only a period.



3.4.9 SEGMENT Format



  The SEGMENT keyword is used when two (or more) entries of known

relative orientation are separated by a short (<10 kb) stretch of DNA.

It is limited to one line of the form `n of m', where `n' is the

segment number of the current entry and `m' is the total number of

segments.



3.4.10 SOURCE Format



  The SOURCE field consists of two parts. The first part is found after

the SOURCE keyword and contains free-format information including an

abbreviated form of the organism name followed by a molecule type;

multiple lines are allowed, but the last line must end with a period.

The second part consists of information found after the ORGANISM

subkeyword. The formal scientific name for the source organism (genus

and species, where appropriate) is found on the same line as ORGANISM.

The records following the ORGANISM line list the taxonomic

classification levels, separated by semicolons and ending with a

period.



3.4.11 REFERENCE Format



  The REFERENCE field consists of five parts: the keyword REFERENCE, and

the subkeywords AUTHORS, TITLE (optional), JOURNAL, MEDLINE (optional),

PUBMED (optional), and REMARK (optional).



  The REFERENCE line contains the number of the particular reference and

(in parentheses) the range of bases in the sequence entry reported in

this citation. Additional prose notes may also be found within the

parentheses. The numbering of the references does not reflect

publication dates or priorities.



  The AUTHORS line lists the authors in the order in which they appear

in the cited article. Last names are separated from initials by a

comma (no space); there is no comma before the final `and'. The list

of authors ends with a period.  The TITLE line is an optional field,

although it appears in the majority of entries. It does not appear in

unpublished sequence data entries that have been deposited directly

into the GenBank data bank, the EMBL Nucleotide Sequence Data Library,

or the DNA Data Bank of Japan. The TITLE field does not end with a

period.



  The JOURNAL line gives the appropriate literature citation for the

sequence in the entry. The word `Unpublished' will appear after the

JOURNAL subkeyword if the data did not appear in the scientific

literature, but was directly deposited into the data bank. For

published sequences the JOURNAL line gives the Thesis, Journal, or

Book citation, including the year of publication, the specific

citation, or In press.



  For Book citations, the JOURNAL line is specially-formatted, and

includes:



	editor name(s)

	book title

	page number(s)

	publisher-name/publisher-location

	year



For example:



LOCUS       AY277550                1440 bp    DNA     linear   BCT 17-JUN-2003

DEFINITION  Stenotrophomonas maltophilia strain CSC13-6 16S ribosomal RNA gene,

            partial sequence.

ACCESSION   AY277550

....

REFERENCE   1  (bases 1 to 1440)

  AUTHORS   Gonzalez,J.M., Laiz,L. and Saiz-Jimenez,C.

  TITLE     Classifying bacterial isolates from hypogean environments:

            Application of a novel fluorimetric method dor the estimation of

            G+C mol% content in microorganisms by thermal denaturation

            temperature

  JOURNAL   (in) Saiz-Jimenez,C. (Ed.);

            MOLECULAR BIOLOGY AND CULTURAL HERITAGE: 47-54;

            A.A. Balkema, The Netherlands (2003)



  The presence of "(in)" signals the fact that the reference is for a book

rather than a journal article. A semi-colon signals the end of the editor

names. The next semi-colon signals the end of the page numbers, and the

colon that immediately *precedes* the page numbers signals the end of the

book title. The publisher name and location are a free-form text string.

Finally, the year appears at the very end of the JOURNAL line, enclosed in

parentheses.



  The MEDLINE line provides the National Library of Medicine's Medline

unique identifier for a citation (if known). Medline UIs are 8 digit

numbers.



  The PUBMED line provides the PubMed unique identifier for a citation

(if known). PUBMED ids are numeric, and are record identifiers for article

abstracts in the PubMed database :



       http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed



  Citations in PubMed that do not fall within Medline's scope will have only

a PUBMED identifier. Similarly, citations that *are* in Medline's scope but

which have not yet been assigned Medline UIs will have only a PUBMED identifier.

If a citation is present in both the PubMed and Medline databases, both a

MEDLINE and a PUBMED line will be present.



  The REMARK line is a textual comment that specifies the relevance

of the citation to the entry.



3.4.12 FEATURES Format



  GenBank releases use a feature table format designed jointly by

GenBank, the EMBL Nucleotide Sequence Data Library, and the DNA Data

Bank of Japan. This format is in use by all three databases. The

most complete and accurate Feature Table documentation can be found

on the Web at:



	http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html



  Any discrepancy between the abbreviated feature table description

of these release notes and the complete documentation on the Web

should be resolved in favor of the version at the above URL.



  The Feature Table specification is also available as a printed

document: `The DDBJ/EMBL/GenBank Feature Table: Definition'. Contact

GenBank at the address shown on the first page of these Release Notes

if you would like a copy.



  The feature table contains information about genes and gene products,

as well as regions of biological significance reported in the

sequence. The feature table contains information on regions of the

sequence that code for proteins and RNA sequences. It also enumerates

differences between different reports of the same sequence, and

provides cross-references to other data collections, as described in

more detail below.



  The first line of the feature table is a header that includes the

keyword `FEATURES' and the column header `Location/Qualifier.' Each

feature consists of a descriptor line containing a feature key and a

location (see sections below for details). If the location does not

fit on this line, a continuation line may follow. If further

information about the feature is required, one or more lines

containing feature qualifiers may follow the descriptor line.



  The feature key begins in column 6 and may be no more than 15

characters in length. The location begins in column 22. Feature

qualifiers begin on subsequent lines at column 22. Location,

qualifier, and continuation lines may extend from column 22 to 80.



  Feature tables are required, due to the mandatory presence of the

source feature. The sections below provide a brief introduction to

the feature table format.



3.4.12.1 Feature Key Names



  The first column of the feature descriptor line contains the feature

key. It starts at column 6 and can continue to column 20. The list of

valid feature keys is shown below.



  Remember, the most definitive documentation for the feature table can

be found at:



	http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html



allele		Obsolete; see variation feature key

attenuator	Sequence related to transcription termination

C_region	Span of the C immunological feature

CAAT_signal	`CAAT box' in eukaryotic promoters

CDS		Sequence coding for amino acids in protein (includes

		stop codon)

conflict	Independent sequence determinations differ

D-loop      	Displacement loop

D_segment	Span of the D immunological feature

enhancer	Cis-acting enhancer of promoter function

exon		Region that codes for part of spliced mRNA

gene            Region that defines a functional gene, possibly

                including upstream (promotor, enhancer, etc)

		and downstream control elements, and for which

		a name has been assigned.

GC_signal	`GC box' in eukaryotic promoters

iDNA		Intervening DNA eliminated by recombination

intron		Transcribed region excised by mRNA splicing

J_region	Span of the J immunological feature

LTR		Long terminal repeat

mat_peptide	Mature peptide coding region (does not include stop codon)

misc_binding	Miscellaneous binding site

misc_difference	Miscellaneous difference feature

misc_feature	Region of biological significance that cannot be described

		by any other feature

misc_recomb	Miscellaneous recombination feature

misc_RNA	Miscellaneous transcript feature not defined by other RNA keys

misc_signal	Miscellaneous signal

misc_structure	Miscellaneous DNA or RNA structure

modified_base	The indicated base is a modified nucleotide

mRNA		Messenger RNA

mutation 	Obsolete: see variation feature key

N_region	Span of the N immunological feature

old_sequence	Presented sequence revises a previous version

polyA_signal	Signal for cleavage & polyadenylation

polyA_site	Site at which polyadenine is added to mRNA

precursor_RNA	Any RNA species that is not yet the mature RNA product

prim_transcript	Primary (unprocessed) transcript

primer		Primer binding region used with PCR

primer_bind	Non-covalent primer binding site

promoter	A region involved in transcription initiation

protein_bind	Non-covalent protein binding site on DNA or RNA

RBS		Ribosome binding site

rep_origin	Replication origin for duplex DNA

repeat_region	Sequence containing repeated subsequences

repeat_unit	One repeated unit of a repeat_region

rRNA		Ribosomal RNA

S_region	Span of the S immunological feature

satellite	Satellite repeated sequence

scRNA		Small cytoplasmic RNA

sig_peptide	Signal peptide coding region

snRNA		Small nuclear RNA

source		Biological source of the sequence data represented by

		a GenBank record. Mandatory feature, one or more per record.

		For organisms that have been incorporated within the

		NCBI taxonomy database, an associated /db_xref="taxon:NNNN"

		qualifier will be present (where NNNNN is the numeric

		identifier assigned to the organism within the NCBI taxonomy

		database).

stem_loop	Hair-pin loop structure in DNA or RNA

STS		Sequence Tagged Site; operationally unique sequence that

		identifies the combination of primer spans used in a PCR assay

TATA_signal	`TATA box' in eukaryotic promoters

terminator	Sequence causing transcription termination

transit_peptide	Transit peptide coding region

transposon	Transposable element (TN)

tRNA 		Transfer RNA

unsure		Authors are unsure about the sequence in this region

V_region	Span of the V immunological feature

variation 	A related population contains stable mutation

- (hyphen)	Placeholder

-10_signal	`Pribnow box' in prokaryotic promoters

-35_signal	`-35 box' in prokaryotic promoters

3'clip		3'-most region of a precursor transcript removed in processing

3'UTR		3' untranslated region (trailer)

5'clip		5'-most region of a precursor transcript removed in processing

5'UTR		5' untranslated region (leader)





3.4.12.2 Feature Location



  The second column of the feature descriptor line designates the

location of the feature in the sequence. The location descriptor

begins at position 22. Several conventions are used to indicate

sequence location.



  Base numbers in location descriptors refer to numbering in the entry,

which is not necessarily the same as the numbering scheme used in the

published report. The first base in the presented sequence is numbered

base 1. Sequences are presented in the 5 to 3 direction.



Location descriptors can be one of the following:



1. A single base;



2. A contiguous span of bases;



3. A site between two bases;



4. A single base chosen from a range of bases;



5. A single base chosen from among two or more specified bases;



6. A joining of sequence spans;



7. A reference to an entry other than the one to which the feature

belongs (i.e., a remote entry), followed by a location descriptor

referring to the remote sequence;



  A site between two residues, such as an endonuclease cleavage site, is

indicated by listing the two bases separated by a carat (e.g., 23^24).



  A single residue chosen from a range of residues is indicated by the

number of the first and last bases in the range separated by a single

period (e.g., 23.79). The symbols < and > indicate that the end point

of the range is beyond the specified base number.



  A contiguous span of bases is indicated by the number of the first and

last bases in the range separated by two periods (e.g., 23..79). The

symbols < and > indicate that the end point of the range is beyond the

specified base number. Starting and ending positions can be indicated

by base number or by one of the operators described below.



  Operators are prefixes that specify what must be done to the indicated

sequence to locate the feature. The following are the operators

available, along with their most common format and a description.



complement (location): The feature is complementary to the location

indicated. Complementary strands are read 5 to 3.



join (location, location, .. location): The indicated elements should

be placed end to end to form one contiguous sequence.



order (location, location, .. location): The elements are found in the

specified order in the 5 to 3 direction, but nothing is implied about

the rationality of joining them.



3.4.12.3  Feature Qualifiers



  Qualifiers provide additional information about features. They take

the form of a slash (/) followed by a qualifier name and, if

applicable, an equal sign (=) and a qualifier value. Feature

qualifiers begin at column 22.



Qualifiers convey many types of information. Their values can,

therefore, take several forms:



1. Free text;

2. Controlled vocabulary or enumerated values;

3. Citations or reference numbers;

4. Sequences;

5. Feature labels.



  Text qualifier values must be enclosed in double quotation marks. The

text can consist of any printable characters (ASCII values 32-126

decimal). If the text string includes double quotation marks, each set

must be `escaped' by placing a double quotation mark in front of it

(e.g., /note="This is an example of ""escaped"" quotation marks").



  Some qualifiers require values selected from a limited set of choices.

For example, the `/direction' qualifier has only three values `left,'

`right,' or `both.' These are called controlled vocabulary qualifier

values. Controlled qualifier values are not case sensitive; they can

be entered in any combination of upper- and lowercase without changing

their meaning.



  Citation or published reference numbers for the entry should be

enclosed in square brackets ([]) to distinguish them from other

numbers.



  A literal sequence of bases (e.g., "atgcatt") should be enclosed in

quotation marks. Literal sequences are distinguished from free text by

context. Qualifiers that take free text as their values do not take

literal sequences, and vice versa.



  The `/label=' qualifier takes a feature label as its qualifier.

Although feature labels are optional, they allow unambiguous

references to the feature. The feature label identifies a feature

within an entry; when combined with the accession number and the name

of the data bank from which it came, it is a unique tag for that

feature. Feature labels must be unique within an entry, but can be the

same as a feature label in another entry. Feature labels are not case

sensitive; they can be entered in any combination of upper-and

lowercase without changing their meaning.



The following is a partial list of feature qualifiers.



/anticodon	Location of the anticodon of tRNA and the amino acid

		for which it codes



/bound_moiety	Moiety bound



/citation	Reference to a citation providing the claim of or

		evidence for a feature



/codon		Specifies a codon that is different from any found in the

		reference genetic code



/codon_start	Indicates the first base of the first complete codon

		in a CDS (as 1 or 2 or 3)



/cons_splice	Identifies intron splice sites that do not conform to

		the 5'-GT... AG-3' splice site consensus



/db_xref	A database cross-reference; pointer to related information

		in another database. A description of all cross-references

		can be found at:



		http://www.ncbi.nlm.nih.gov/collab/db_xref.html



/direction	Direction of DNA replication



/EC_number	Enzyme Commission number for the enzyme product of the

		sequence



/evidence	Value indicating the nature of supporting evidence



/frequency	Frequency of the occurrence of a feature



/function	Function attributed to a sequence



/gene		Symbol of the gene corresponding to a sequence region (usable

		with all features)



/label		A label used to permanently identify a feature



/map		Map position of the feature in free-format text



/mod_base	Abbreviation for a modified nucleotide base



/note		Any comment or additional information



/number		A number indicating the order of genetic elements

		(e.g., exons or introns) in the 5 to 3 direction



/organism	Name of the organism that is the source of the

		sequence data in the record. 



/partial	Differentiates between complete regions and partial ones



/phenotype	Phenotype conferred by the feature



/product	Name of a product encoded by a coding region (CDS)

		feature



/pseudo		Indicates that this feature is a non-functional

		version of the element named by the feature key



/rpt_family	Type of repeated sequence; Alu or Kpn, for example



/rpt_type	Organization of repeated sequence



/rpt_unit	Identity of repeat unit that constitutes a repeat_region



/standard_name	Accepted standard name for this feature



/transl_except	Translational exception: single codon, the translation

		of which does not conform to the reference genetic code



/translation	Amino acid translation of a coding region



/type		Name of a strain if different from that in the SOURCE field



/usedin		Indicates that feature is used in a compound feature

		in another entry



3.4.12.4 Cross-Reference Information



  One type of information in the feature table lists cross-references to

the annual compilation of transfer RNA sequences in Nucleic Acids

Research, which has kindly been sent to us on CD-ROM by Dr. Sprinzl.

Each tRNA entry of the feature table contains a /note= qualifier that

includes a reference such as `(NAR: 1234)' to identify code 1234 in

the NAR compilation. When such a cross-reference appears in an entry

that contains a gene coding for a transfer RNA molecule, it refers to

the code in the tRNA gene compilation. Similar cross-references in

entries containing mature transfer RNA sequences refer to the

companion compilation of tRNA sequences published by D.H. Gauss and M.

Sprinzl in Nucleic Acids Research.



3.4.12.5 Feature Table Examples



  In the first example a number of key names, feature locations, and

qualifiers are illustrated, taken from different sequences. The first

table entry is a coding region consisting of a simple span of bases

and including a /gene qualifier. In the second table entry, an NAR

cross-reference is given (see the previous section for a discussion of

these cross-references). The third and fourth table entries use the

symbols `<`and `>' to indicate that the beginning or end of the

feature is beyond the range of the presented sequence. In the fifth

table entry, the symbol `^' indicates that the feature is between

bases.



1       10        20        30        40        50        60        70       79

---------+---------+---------+---------+---------+---------+---------+---------

     CDS             5..1261

                     /product="alpha-1-antitrypsin precursor"

                     /map="14q32.1"

                     /gene="PI"

     tRNA            1..87

                     /note="Leu-tRNA-CAA (NAR: 1057)"

                     /anticodon=(pos:35..37,aa:Leu)

     mRNA            1..>66

                     /note="alpha-1-acid glycoprotein mRNA"

     transposon      <1..267

                     /note="insertion element IS5"

     misc_recomb     105^106

                     /note="B.subtilis DNA end/IS5 DNA start"

     conflict        258

                     /replace="t"

                     /citation=[2]

---------+---------+---------+---------+---------+---------+---------+---------

1       10        20        30        40        50        60        70       79



Example 10. Feature Table Entries





The next example shows the representation for a CDS that spans more

than one entry.



1       10        20        30        40        50        60        70       79

---------+---------+---------+---------+---------+---------+---------+---------

LOCUS       HUMPGAMM1    3688 bp ds-DNA             PRI       15-OCT-1990

DEFINITION  Human phosphoglycerate mutase (muscle specific isozyme) (PGAM-M)

            gene, 5' end.

ACCESSION   M55673 M25818 M27095

KEYWORDS    phosphoglycerate mutase.

SEGMENT     1 of 2

  .

  .

  .

FEATURES             Location/Qualifiers

     CAAT_signal     1751..1755

                     /gene="PGAM-M"

     TATA_signal     1791..1799

                     /gene="PGAM-M"

     exon            1820..2274

                     /number=1

                     /EC_number="5.4.2.1"

                     /gene="PGAM-M"

     intron          2275..2377

                     /number=1

                     /gene="PGAM2"

     exon            2378..2558

                     /number=2

                     /gene="PGAM-M"

  .

  .

  .

//

LOCUS       HUMPGAMM2     677 bp ds-DNA             PRI       15-OCT-1990

DEFINITION  Human phosphoglycerate mutase (muscle specific isozyme) (PGAM-M),

            exon 3.

ACCESSION   M55674 M25818 M27096

KEYWORDS    phosphoglycerate mutase.

SEGMENT     2 of 2

  .

  .

  .

FEATURES             Location/Qualifiers

     exon            255..457

                     /number=3

                     /gene="PGAM-M"

     intron          order(M55673:2559..>3688,<1..254)

                     /number=2

                     /gene="PGAM-M"

     mRNA            join(M55673:1820..2274,M55673:2378..2558,255..457)

                     /gene="PGAM-M"

     CDS             join(M55673:1861..2274,M55673:2378..2558,255..421)

                     /note="muscle-specific isozyme"

                     /gene="PGAM2"

                     /product="phosphoglycerate mutase"

                     /codon_start=1

                     /translation="MATHRLVMVRHGESTWNQENRFCGWFDAELSEKGTEEAKRGAKA

                     IKDAKMEFDICYTSVLKRAIRTLWAILDGTDQMWLPVVRTWRLNERHYGGLTGLNKAE

                     TAAKHGEEQVKIWRRSFDIPPPPMDEKHPYYNSISKERRYAGLKPGELPTCESLKDTI

                     ARALPFWNEEIVPQIKAGKRVLIAAHGNSLRGIVKHLEGMSDQAIMELNLPTGIPIVY

                     ELNKELKPTKPMQFLGDEETVRKAMEAVAAQGKAK"

  .

  .

  .

//

---------+---------+---------+---------+---------+---------+---------+---------

1       10        20        30        40        50        60        70       79



Example 11. Joining Sequences





3.4.13 ORIGIN Format



  The ORIGIN record may be left blank, may appear as `Unreported.' or

may give a local pointer to the sequence start, usually involving an

experimentally determined restriction cleavage site or the genetic

locus (if available). The ORIGIN record ends in a period if it

contains data, but does not include the period if the record is left

empty (in contrast to the KEYWORDS field which contains a period

rather than being left blank).



3.4.14 SEQUENCE Format



  The nucleotide sequence for an entry is found in the records following

the ORIGIN record. The sequence is reported in the 5 to 3 direction.

There are sixty bases per record, listed in groups of ten bases

followed by a blank, starting at position 11 of each record. The

number of the first nucleotide in the record is given in columns 4 to

9 (right justified) of the record.