Protein Naming Guidelines

Preamble

Consistent nomenclature is indispensable for communication, literature searching and entry retrieval. Therefore, the JCBN has, in cooperation with the EBI, the SIB, and PIR agreed on minimal protein nomenclature rules. Ambiguities regarding gene/protein names are a major problem in the literature and it is even worse in the sequence databases which tend to propagate the confusion.

Warning: this is a preliminary document; many rules still have to be added, modified or expanded.

General naming rules

If it exists, the approved nomenclature should be used. If no accepted unification exists, and several alternatives are of equal frequency in the literature, the one with the easiest extensibility or standardization should be used. In addition, preference is given to names that best reflect the common acronym or gene symbol.

The protein naming guidelines are based on the premise that a good and stable recommended name (Recommended name) for a protein is a name that is as neutral aspossible. A recommended name should be, as far as possible, unique and attributed to all orthologs. One reason for this is that it should be possible to propagate a protein name to all orthologous proteins, from various organisms. This is why, ideally, the protein name should not contain a specific characteristic of the protein, and in particular it should not reflect the function or role of the protein, nor its subcellular location, its domain structure, its tissue specificity, its molecular weight or its species of origin.

Therefore:

A recommended name should not contain information about the molecular weight of the protein.
e.g. "unicornase subunit A" is preferred to "unicornase 52 kDa subunit."

A recommended name should not be based on the name of a disease.
e.g. "Bloom syndrome protein" is not suitable.

A recommended name should not be based on tissue specificity.
e.g. "testis-specific protein ..." is not suitable.

A recommended name must not be based on the species name.
e.g. "Yeast Ku70 protein" is not suitable.

A recommended name should not be based on the gene induction.
e.g. "androgen-induced protein 1" is not suitable.

The most optimal recommended name is a word that ends with "in" and which can be easily pronounced in English.

e.g. "zyxin", "insulin", "hemoglobin", "caveolin", "desmoglein", "secretin", etc.

Names ending in "ine" should be avoided, e.g. "maurocalcin" instead of "maurocalcine".

Wherever appropriate, the recommended name should use American spelling conventions (as opposed to British spelling), e.g. "hemoglobin" instead of "haemoglobin".

A recommended name should not contain a Roman numeral, e.g. "caveolin-2" instead of "caveolin-II". Exception: historical cases.e.g. "coagulation factor IX", "casein kinase II", "HLA class I", etc.

Abbreviations should not be built using the molecular weight, e.g. Abbreviations such as p123, Gp62, p34 are not suitable.

Exception: cases where historically the molecular weight has been consistently and generally applied as part of the accepted name, e.g. "p53".

For proteins that belong to a multigene family, it is recommended that you choose a coherent nomenclature with numbers to specify the different members of the family.

When naming proteins which can be grouped into a family based on homology or according to a notion of shared function (like the interleukins), the different members should be enumerated with a dash "-" followed by an Arabic number, e.g. "desmoglein-1", "desmoglein-2", etc.

General syntax

Greek letters must be written in full, e.g. "alpha", "omega". Greek letters are written entirely in lower case with the exception of "Delta" in the context of the steroid/fatty acid metabolism nomenclature. If a Greek letter is preceded or followed by a number or letter, then it must be separated by a dash "-", e.g. "unicornase alpha-1", "Myprotease A-beta".

A recommended name should not use diacritics, such as accents, umlauts and so on, e.g. "Krüppel" is not suitable.

Eponyms should be used in the non-possessive form (a name should not be followed by "'s"). Note: an eponym is a person, whether real or fictitious, whose name has (or is thought to have) given rise to the name of a particular item. There used to be a debate as to whether the possessive form (e.g. Alzheimer's disease) or the non-possessive form (Alzheimer disease) of eponyms is preferred. As a rule the non possessive form is now preferred.

e.g. "Alzheimer disease amyloid A4 protein" instead of "Alzheimer's disease amyloid A4 protein".

A recommended name based on the gene symbol should be in the form "Protein gene symbol" instead of "gene symbol protein", e.g. "protein abcD" instead of "abcD protein". When a recommended name includes a gene symbol, the casing of the gene symbol should be the one used for the gene in the nomenclature for that organism. Since we are always dealing with proteins, it will be understood that gene=protein, e.g. "response regulator algR", "Protein HEX23".

Whenever possible commas should be avoided in a recommended name, e.g. "acyl-CoA dehydrogenase, short-chain specific" should be "short-chain specific acyl-CoA dehydrogenase"

Symbols of chemical elements can be used in abbreviations, e.g. "magnesium/calcium co-transporter" can be abbreviated as "Mg/Ca co-transporter". For ions, chemical element symbols (e.g. Cu(+), Mg(2+), etc.) are preferred to systematic names (copper(I), magnesium ion, etc.) and common names (cupric, ferrous, etc). For ions, when necessary, valence should be indicated within parentheses, e.g. "Fe(2+)", "Fe(3+)", Cl(-), etc.

Abbreviations should not appear inside a RN, with the exception of:

Deoxyribonucleic acid: DNA, cDNA, dsDNA, ssDNA

Ribonucleic acid: dsRNA, siRNA, snRNA, ssRNA, tmRNA,

Mono-, di-, tri- nucleic acid phosphates: d[ACGT][MDT]P, c[AG]MP

Cofactors: FAD, FMN, NAD, NADP

Others: hnRNP

Charged tRNAs are indicated by "tRNA" followed by the three-letter amino acid code, with the first letter capitalized, in brackets, e.g. "Glu-tRNA(Gln) amidotransferase subunit B".

Hyphens should be used to form compound modifiers (i.e. two or more words that are acting as a single modifier for a noun). For example before:

activated, activating, adapting, adding, amplified, anchored, anchoring, antagonizing, associated, associating, attracting, binding, blocking, bound, branching, bridging, bundling, capping, complementing, concentrating, conjugating, containing, controlled, controlling, converting, coupled, coupling, decapping, degrading, dependent, depolymerizing, derepressing, derived, deriving, destabilizing, docking, editing, enhanced, enhancing, enriched, exposed, expressed, flanking, forming, gated, grabbing, harvesting, independent, induced, inducible, inducing, inhibited, inhibiting, insensitive, interacting, laying, like, linked, linking, metabolizing, modifying, modulating, polymerizing, potentiating, preventing, processing, promoting, recognizing, recruited, recruiting, regulated, regulating, related, released, releasing, remodeling, removing, repressing, required, requiring, resistant, responsive, rich, ripening, scaffolding, sensing, sensitive, signaling, specific, splicing, spreading, stabilized, stabilizing, stacking, stimulated, stimulating, structuring, sulfating, suppressing, trafficking, transformed, transforming, transporting

[Note: This list is not complete]

e.g. "secretin-binding protein", "pyrophosphate-dependent phosphofructokinase".

See: http://grammar.uoregon.edu/punctuation/hyphen.html

Specific rules for enzymes

Enzymes commonly have accepted names ending in "ase", e.g. "aminoacylase", "arginase", "caspase", "elastase", etc.

Transfer enzymes are often indicated with the source and destination substrate separated by a double dash (—), e.g. "formylmethanofuran—tetrahydromethanopterin formyltransferase".

For protein kinases and phosphatases, use the format: "-protein ", e.g. "serine/threonine-protein kinase", "tyrosine-protein phosphatase".

In cases where the protein is possibly an inactive version of an enzyme, avoid mentioning the activity in the name unless in expressions such as "X domain-containing protein", e.g. "protease domain-containing protein".

In some cases, the protein is named based on the pathway it is involved in. In such cases the following format is suitable: " biosynthesis protein ", e.g. "thiamine biosynthesis protein thic".

Specific rules for multiprotein complexes

Sometimes a protein is named after a multiprotein complex name, which is only suitable for well-defined complexes. Keep in mind that in some cases, the complex composition is variable and proteins can belong to different multiple complexes (transcription, chromatin remodelling or ubiquitin ligase E3 complex). In such a case, it may be better not to cite the complex name in the recommended name.

Proteins that belong to well-defined multi-subunit complexes can be named according to the complex, followed by the specific subunit name, e.g. "26S proteasome non-ATPase regulatory subunit 1".

The word "subunit" is preferred to "chain", "component" or "polypeptide". Chain refers to proteolytically processed polypeptides arising from a common precursor protein, e.g. "unicornase heavy chain", "unicornase light chain".

If the name contains a "type" of subunit, then precede the word "subunit" with the "type". The "type" is a controlled vocabulary: ATP-binding, catalytic, ferredoxin, flavoprotein, modulatory, regulatory

[Note: This list is not complete]

e.g. "unicornase regulatory subunit".

Avoid the word "subunit" with a size indicator: e.g. "unicornase large subunit".

If the name contains a "designator" of the subunit, then the "designator" must follow the word "subunit": Numbers - Unicornase subunit 2, Letters - Unicornase subunit A, gene symbol - Unicornase subunit abcD, Greek letters - Unicornase subunit alpha.

The preference is to use Numbers > Letters > gene symbol > Greek letters

A recommended name can include both a "Type" and a "Designator" e.g. "unicornase regulatory subunit 1".

Additional rules

Unfortunately many existing protein names are based not only on the role or function, but sometimes on the domain structure, or on plenty of other characteristics. In these cases we try to apply the following syntactic rules.

Proteins which are NOT conserved and with no known or predicted function or characteristics should be called "Uncharacterized protein ".

The following words should be avoided in a RN: Hypothetical, Possible, Potential, Precursor

e.g. "hypothetical protein Abcd" is not suitable.

Note: these words can be used IF they are 'internal' to the recommended name and do not convey a 'global' meaning, e.g. "high-potential iron-sulfur protein", "thiamine precursor biosynthesis protein".

When a recommended name is based on the predicted activity of the protein, it is allowed to precede the recommended name by 'Probable' or 'Putative', e.g. "probable acetylornithine deacetylase", "putative acetylornithine deacetylase".

Proteins of unknown function which nevertheless contain a defined domain or motif (that itself does not specify a particular function) have been named sometimes according to the domain(s) or repeat(s) present. The name should then be of the following type: "-containing protein", e.g. "PAS domain-containing protein 5".

If there is more than one domain/repeat, only use dash for the last item preceding "containing" even though this violates conventional grammar, e.g. "ankyrin repeat and SAM domain-containing protein 1" is correct, but "ankyrin repeat- and SAM domain-containing protein 1" is wrong.

Do not use plurals, e.g. "ankyrin repeats-containing protein 8" is wrong.

Proteins of unknown function which exhibit significant sequence similarity to a defined protein family have been named in accordance with other members of that family.., e.g. "Holliday junction resolvase family endonuclease".

It is also possible to use "-like" in the name. Bear in mind that this should only be used for cases that are outliers to a tight homomorphic family, e.g. "Holliday junction resolvase-like protein".

The CD antigen nomenclature defined for surface proteins of human leucocytes is propagated to mammalian orthologs.

Certain proteins have multiple functions. The recommended could reflect this situation.