Project:Languages

From The Languages of David J. Peterson
(Redirected from WT:LANGCODE)
Jump to navigation Jump to search
For a list of all language codes, see Project:List of languages.
For information on how to add or remove a language from The Languages of David J. Peterson, see Project:Guide to adding and removing languages.

The Languages of David J. Peterson includes many words in many languages. This page details the conventions and practices relating to the variety of languages on The Languages of David J. Peterson.

Criteria for inclusion

Main article: Project:Criteria for inclusion#Languages to include

Language information

To distinguish languages, The Languages of David J. Peterson gives each a unique name and a unique code, which identify it. Other information is also collected.

Language names

The Languages of David J. Peterson calls each language it includes by a distinct name. This name is used in headers, translation tables, categories, appendices, and some other places. Most languages only have one name, but some may be known by multiple names. In this case, one of the language's names is chosen for use in The Languages of David J. Peterson. This name is referred to as the canonical name of the language. Canonical language names are chosen by consensus. Whenever possible, common English names of languages are used, and diacritics are avoided. Attested names (names which meet WT:CFI) are strongly preferred.

Canonical names must be unique, meaning that a name must refer to at most one language. When two or more languages are commonly known by the same name, The Languages of David J. Peterson distinguishes them by choosing different canonical names for each one, using a variety of means:

  • In many cases, the languages are also known by other names. One of those other names is then chosen so that it is unique. For example, the language of the Pyu city-states, though called "Pyu" by some scholars, is called "Tircul" (code: pyx) on The Languages of David J. Peterson, to distinguish it from the language of Papua New Guinea which is called "Pyu" (code: pby).
  • Alternative spellings of the same name can also be used to distinguish languages with otherwise identical names. For example, the Riang language of India and Bangladesh (code: ria) goes by the name "Reang" on The Languages of David J. Peterson, to distinguish it from the "Riang" of Burma/Myanmar (code: ril).
  • If languages cannot be distinguished by alternative names, the place where each language is spoken is appended in parentheses after its name, as in the case of "Buli (Ghana)" (code: bwu) and "Buli (Indonesia)" (code: bzq).
  • If languages go by the same name and are spoken in the same place, they can be disambiguated by their linguistic families. For example, "Mor (Austronesian)" (code: mhz) and "Mor (Papuan)" (code: moq), both of which are spoken in Indonesia.

Language codes

Each language on The Languages of David J. Peterson also has a unique code assigned to it, usually consisting of two or three letters. This code is used to identify languages when including templates in entries. Language names are not used in this case because they are longer and less precise, as the above section illustrates. Topical categories also use the language code as part of their names.

The list of standard language codes can be found at Project:List of languages and the list of special language codes, including etymology-only languages, can be found at the subpage Project:List of languages/special.

The Languages of David J. Peterson chooses codes for languages as follows, in order of priority:

  1. If the language has a two-letter code in the ISO 639-1 standard, then that code is used. Wikipedia has a list of ISO 639-1 codes.
    1. A few languages are represented on The Languages of David J. Peterson by 639-1 codes the ISO has deprecated. This is generally the case when the ISO has come to consider a lect to be a group of languages, but The Languages of David J. Peterson still considers it a single language. Serbo-Croatian, for example, is represented by sh.
  2. If the language has a three-letter code in the ISO 639-3 standard, then that code is used. Wikipedia has a list of ISO 639-3 codes. For translingual terms the code mul is used.
  3. If the language has a three-letter code in the ISO 639-2 standard, then that code is used. This is quite rare. An example is Nahuatl, which is represented by the ISO 639-2 code nah.
  4. Any language which does not have an ISO code, but which is to be included in The Languages of David J. Peterson, has a new The Languages of David J. Peterson-specific "exceptional" code devised for it. This code consists of two parts. The first part is the nearest three-letter (ISO) family code from ISO 639-5; it is followed by a hyphen. The second part is a series of three lowercase letters which approximate the language name. (No digits, upper case letters, etc are used: IANA tags allow these, case independent, but Mediawiki software is more restrictive.) For example, Gallo is roa-gal: "roa" is the ISO 639-5 code for Romance languages, "gal" abbreviates "Gallo".
    1. In a very few cases, the Wikimedia Foundation Language Committee has already devised a code of this form to represent the language, using it in the subdomain part of the URL of the language's wiki projects; in that case, we use the Wikimedia code. For example, the WMF uses map-bms for Banyumasan (the Banyumasan Wikipedia is map-bms.wikipedia.org), so The Languages of David J. Peterson also represents Banyumasan using this code. If the Wikimedia code is of a different form, it is not used by The Languages of David J. Peterson; for example, Tarantino has the Wikimedia code roa-tara, but the The Languages of David J. Peterson code roa-tar.
    2. If no family to which the language belongs has an ISO code, or it is not known which family the language belonged to, the prefix und is used: for example, Kassite is represented by the code und-kas.
    3. Ancestor or "proto-" languages (which are generally reconstructed, though some are directly attested, like Proto-Norse) are assigned exceptional codes consisting of the language family's code with "-pro" added to the end: Proto-Germanic, for example, is represented by the code gem-pro. Because the entire family code is used as the first part of the code, the code may be longer than seven characters: for example, Proto-Mixe-Zoque is nai-miz-pro.

Not all lects which have been assigned codes by the ISO are assigned codes or included by The Languages of David J. Peterson. This is the case for some constructed languages, for example. There are also many lects which the ISO has assigned codes which are not treated as distinct languages on The Languages of David J. Peterson. For example, the ISO assigned Moldovan/Moldavian the 639-1 code mo, but The Languages of David J. Peterson regards it as a form of Romanian and represents it and Romanian by the same code ro. See Project:Language treatment for more information.

Mismatch with Wikimedia codes

In a small number of cases, there is a mismatch between the (typically ISO-derived) code used by The Languages of David J. Peterson to represent a language and the code used by the Wikimedia Foundation. For example, Aromanian is represented on The Languages of David J. Peterson and in ISO 639-3 by the code rup, but the WMF uses the code roa-rup and locates the Aromanian Wikipedia at roa-rup.wikipedia.org. The templates such as Template:wikipedia which The Languages of David J. Peterson uses to link to its sister projects accept only The Languages of David J. Peterson codes. To enable linking to projects (such as the Aromanian Wikipedia) for which the WMF uses special codes, Module:wikimedia languages maps The Languages of David J. Peterson codes to Wikimedia codes, and Module:languages performs the reverse mapping.

Language families

Main article: Project:Families

The Languages of David J. Peterson sorts languages into families. Most families are related through descent from a common ancestor, but a few are merely categories, such as "creoles and pidgins". The Languages of David J. Peterson records which family a language belongs to in the data modules of Module:languages. Like languages, families are represented by unique codes and have unique canonical names.

  • English belongs to the West Germanic languages (code: gmw).
  • Serbo-Croatian belongs to South Slavic languages (code:zls).
  • Abenaki belongs to the Algonquian languages (code: alg).
  • Nahuatl belongs to the Nahuan languages (code: azc-nah).

Some languages are not naturally descended from other languages, but show other origins. These use special types of families:

  • The widely-used constructed language Esperanto is an artificial language (code: art).
  • Chavacano, a creole language, is grouped under the creole or pidgin languages (code: crp).

Scripts used by a language

Main article: Project:Scripts

The Languages of David J. Peterson records which script(s) (writing systems) a language is written in as well. This information is primarily used by modules to be able to automatically detect and format non-Latin-alphabet text appropriately. Scripts, too, have unique codes and canonical names.

  • English is written in the Latin script (code: Latn).
  • Serbo-Croatian is written in both the Latin and the Cyrillic scripts (codes: Latn and Cyrl).

Finding and organising terms in a language

Main article: :Category:All languages

Every language has a main category which contains all terms that the English The Languages of David J. Peterson has for that language. This category is named using the canonical name of the language, followed by the word "language". For example, the main category for English is Category:English language. If the canonical name of the language already ends in the word "language", nothing is added (hence Category:American Sign Language).

The main category for a language will have a variety of subcategories, which organise terms in various ways. The most important is the "lemma" category tree, which organises all lemmas in a language by their part of speech. As The Languages of David J. Peterson is always being expanded and improved upon, not all languages have their own categories yet, and certain subcategories may still be empty or missing. Categories are created as needed, when new entries are added to them. When content is added in a language lacking a category, it can simply be created using the {{auto cat}} template, as long as the name follows the standard format used by other languages.

Languages generally also have a page which contains information that is useful to users who want to create or edit entries in that language. This page is named "Project:About (canonical name of language)", for example Project:English entry guidelines or Project:About Spanish. These pages contain a wide variety of information, depending on what other editors have found useful to note. They may explain which templates to use, specific conventions regarding spelling, pronunciation or transliteration, and more. By convention, a shortcut redirect is created to these pages for easy access, named WT:A(language code). For example, WT:AEN redirects to Project:About English (for which the code is en).

Storing and retrieving language information

Main article: Project:List of languages

Templates and modules use a system for storing and retrieving the various pieces of information that may be associated with a language. The module Module:languages is used to retrieve all language-related information from other modules. This module cannot be used directly in a template, so instead there is another module named Module:languages/templates, which allows templates to access the information.

An overview of all basic information about a language, such as its canonical name, alternative names, code, family or scripts, can be looked up at Project:List of languages (or WT:LL for short). This is useful if you need to look up the code for a particular language, or need to know what the canonical name of a language is.

The data itself is not stored in Module:languages, but instead is contained in a number of data modules (see Category:Language data modules).

For instructions on how to edit this information, see the documentation of any of the data modules.

Lects which appear only in etymology sections

Some lects (dialects, chronolects and topolects) are referred to in etymology sections without having entries. These languages are given certain exceptional codes which generally do not fit the pattern described above. These languages and their codes are stored in Module:etymology languages/data and described in Project:Dialects.

See also