Continued from 3AA-18 and 3AA-19
Contents of 3AA-20 to 3AA-21
3AA-20 The Need for a Concise Representation of Sequence
Continued in 3AA-22
3AA-20. THE NEED FOR A CONCISE REPRESENTATION OF SEQUENCE
3AA-20.1. General Considerations Regarding the One-Letter System
There are difficulties in using the three-letter system (3AA-14 to 3AA-19) in presenting long protein sequences. A one-letter code is much more concise, and is helpful in summarizing large amounts of data, in aligning and comparing homologous sequences, and in computer techniques for these processes. It may also be used to label residues in three-dimensional pictures of protein molecules.
The possibility of using one-letter symbols was mentioned by Gamow & Ycas  in 1958. Sorm et al.  systematized the idea in 1961 (see, for example, ), and Dayhoff and Eck used one-letter symbols derived partly from the code of Sorm et al. in their compilations of protein sequences (, latest edition ). IUB-IUPAC recommendations  were approved in 1968 on the basis of proposals of a subcommittee of W. E. Cohn, M. O. Dayhoff, R. V. Eck, and B. Keil, and these recommendations are given here with no substantial change.
3AA-20.2. Limits of Application of the One-Letter System
The one-letter system is less easily understood than the three-letter system by those not familiar with it, so it should not be used in simple text or in reporting experimental details of sequence determination. It is therefore recommended for comparisons of long sequences in tables and lists, and in other special uses where brevity is important. If both it and the single-letter system for nucleotide sequences  are used in the same paper, particular care should be taken to avoid confusion.
3AA-21. DESCRIPTION OF THE ONE-LETTER SYSTEM
3AA-21.1. Use of the Code
The letter written at the left-hand end is that of the amino-acid residue carrying the free amino group, and the letter written at the right-hand end is that of the residue carrying the free carboxyl group. The absence of punctuation beyond either end of a sequence implies that the residue indicated at that end is known to be terminal. A fragmentary sequence is preceded or followed by a slash (/) if its end is not known to be the end of the complete molecule.
3AA-21.2. The Code Symbols
Click here for "table free" view if the table below is faulty.
The symbols are listed, in alphabetical order of amino-acid names, in Table 1. Table 5 gives them in alphabetical order of symbols.
Table 5. The One-Letter Symbols
|A||Ala||alanine||B||Asx||aspartic acid or asparagine||C||Cys||cysteine||D||Asp||aspartic acid||E||Glu||glutamic acid||F||Phe||phenylalanine||G||Gly||glycine||H||His||histidine||I||Ile||isoleucine||K||Lys||lysine||L||Leu||leucine||M||Met||methionine||N||Asn||asparagine||P||Pro||proline||Q||Gln||glutamine||R||Arg||arginine||S||Ser||serine||T||Thr||threonine||U*||Sec||selenocysteine||V||Val||valine||W||Trp||tryptophan||X**||Xaa||unknown or 'other' amino acid||Y||Tyr||tyrosine||Z||Glx||glutamic acid or glutamine (or substances such as
4-carboxyglutamic acid and 5-oxoproline that
yield glutamic acid on acid hydrolysis of peptides)
** See the Addendum for an alternative use of X.
Note on the Choice of Symbols
Initial letters of the names of the amino acids were chosen where there was no ambiguity. There are six such cases: cysteine, histidine. isoleucine, methionine, serine and valine. All the other amino acids share the initial letters A, G, L, P or T, so arbitrary assignments were made. These letters were assigned to the most frequently occurring and structurally most simple of the amino acids with these initials, alanine (A), glycine (G), leucine (L), proline (P) and threonine (T).
Other assignments were made on the basis of associations that might be helpful in remembering the code, e.g. the phonetic associations of F for phenylalanine and R for arginine. For tryptophan the double ring of the molecule is associated with the bulky letter W. The letters N and Q were assigned to asparagine and glutamine respectively; D and E to aspartic and glutamic acids respectively. K and Y were chosen for the two remaining amino acids, lysine and tyrosine, because, of the few remaining letters, they were close alphabetically to the initial letters of the names. U and O were avoided because U is easily confused with V in handwritten material, and O with G, Q, C and D in imperfect computer print-outs, and also with zero. J was avoided because it is absent from several languages.
Two other symbols are often necessary in partly determined sequences, so B was assigned to aspartic acid or asparagine when these have not been distinguished; Z was similarly assigned to glutamic acid or glutamine. X means that the identity of an amino acid is undetermined, or that the amino acid is atypical. See the Addendum for an alternative use of X.
An important use of the one-letter notation is in presenting alignment of homologous sequences. It is therefore vital not to destroy alignment by variable punctuation or variable width of letters. A single space is therefore left between symbols as a blank if not occupied by punctuation (3AA-21.4 and 3AA-21.5), so that such punctuation can be inserted without destroying alignment. Exactly the same spacing is given to each letter each blank and each punctuation mark as in typewritten material or if printed as in 'typewriter type font'.
3AA-21.4. Known Sequences
A blank between letters indicates that the sequence was determined experimentally. For
A C D E F G H I K L M N P Qmeans Ala-Cys-Asp-Glu-Phe-Gly-His-lle-Lys-Leu-Met-Asn-Pro-Gln
3AA-21.5. Punctuation in Partly Known Sequences
Parentheses are used to indicate regions of a sequence in which the composition is known but the sequence undetermined; they are also placed round the symbol for a single residue to show that its identification is tentative. The one-space symbol '=' can be used for ')(' to indicate the end of one unknown sequence and the beginning of another.
If the residue inside parentheses can be positioned with confidence by homology with related proteins the letters are separated by dots. If their position is arbitrary for lack of even indirect evidence the letters are separated by commas. A slash (/) may be used to separate the symbols for residues that have not been shown experimentally to be connected, because they are derived from different peptides. A slash before or after a sequence shows that termination has not been demonstrated (3AA-21.1).
This punctuation is illustrated in the comparison of three sequences, where two partly known (a, c) are aligned with a known one (b):
(a) (A,C,D)E F G(H.I.K.L=M,N)P Q (b) R S T E F G H I K L A D P Q (c) A C D E F/G H I K L(M,N)P QThus the sequence of one of the fragments (H.I.K.L) can be inferred with confidence for (a) whereas that of fragments (A,C,D) and (M,N) cannot. Two fragments were sequenced independently in (c). Their positioning is made only by analogy with (b).
If more elaborate punctuation is required for special circumstances, it is essential that only one character (or a blank of similar size) should appear between the letters of the code.
7. International Union of Biochemistry (1978) Biochemical Nomenclature and Related Documents, The Biochemical Society, London.
11. IUPAC-IUB Commission on Biochemical Nomenclature (CBN), A One-Letter Notation for Amino Acid Sequences, 1968, Arch. Biochem. Biophys. 125(3), i-v (l968); Biochem. J. 113, 1-4 (1969); Biochemistry, 7, 2703-2705 (1968); Biochim. Biophys. Acta, 168, 6-10 (1968); Bull. Soc. Chim. Biol. 50, 1577-1582 (1968) (in French); Eur. J. Biochem. 5, 151-153 (1968); Hoppe-Seyler's Z. Physiol. Chem. 350, 793-797 (1969) (in German); J. Biol. Chem. 243, 3557-3559 (1968); Mol. Biol. 3, 473-477 (1969) (in Russian); Pure Appl. Chem. 31, 641-645 (1972), also pp. 91-93 in .
23. IUPAC-IUB Commission on Biochemical Nomenclature (CBN), Abbreviations for and Symbols for Nucleic Acids, Polynucleotides and their Constituents, Recommendations 1970, Arch. Biochem. Biophys. 145, 425-436 (1971); Biochem. J. 120, 449-454 (1970); Biochemistry, 9, 4022-4027 (1970); Biochim. Biophys. Acta, 247, 1-12 (1971); Eur. J. Biochem. 15, 203-208 (1970), corrected 25, 1 (1972); Hoppe-Seyler's Z. Physiol. Chem. 351, 1055-1063 (1970) (in German); J. Biol. Chem. 245, 5171-5176 (1970); Mol. Biol. 6, 166-174 (1972) (in Russian); Pure Appl. Chem. 40, 277-290 (1974); also pp. 116-121 in .
The following three references have the Czech hacek accent which cannot be programmed in the html language. Click here for details of the accents.
26. Gamow, G. & Yčas. M (1958) Symposium on Information Theory in Biology. Pergamon Press, New York.
27. Šorm, F., Keil, B., Vaněček, J., Tomášek, V., Mikeš, O., Meloun, B., Kostka, V. & Holeyšovský, V. (1961). Collect. Czech. Chem. Commun. 26, 531-578.
28. Keil, B., Prusik, Z. & Šorm, F. (1963) Biochim. Biophys. Acta, 78, 559-578.
29. Dayhoff, M . O., Eck, R. V., Chang, M . A & Sochard, M. R. (1965) Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Silver Spring, Maryland.
30. Dayhoff, M. O., in Atlas of Protein Sequence and Structure, vol. 5 (1972), suppl. 1 (1973), suppl. 2 (1976) and suppl. 3 (1979). National Biomedical Research Foundation, Washington, DC.
Return to Amino Acids and Peptides home page.