Formal language

This article is about the term formal language as it is used in mathematics, logic and computer science. For information about a mode of expression that is more disciplined or precise than everyday speech, see Register (linguistics).

In mathematics, logic, and computer science, a formal language is a language that is defined by precise mathematical or machine processable formulas.

Like languages in linguistics, formal languages generally have two aspects: syntax and semantics. The syntax of a language is what the language looks like. It is defined by the set of possible expressions that are valid utterances in the language. The semantics of a language are what the utterances of the language mean. This is formalized in various ways, depending on the type of language in question.

The branch of mathematics and computer science which studies exclusively the theory of language syntax is known as formal language theory. In formal language theory, a language is nothing more than its syntax; questions of semantics are not addressed in this specialty.

1 Elementary example
2 Definition and basic terminology
3 Examples
4 Language specification formalisms
5 Operations on languages
6 References
7 See also
8 External links

Elementary example

In mathematics, a formal language consists of two parts, an alphabet and rules of syntax. The alphabet is any set of symbols; the rules of syntax are rules that a string of these symbols must follow if it is to be considered part of the formal language.

As a simple example, consider the alphabet {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, =} together with the following rules of syntax:

Every string that does not contain + or = and does not start with 0 is in the language.
The string consisting of 0 is in the language.
A string containing = is in the language if and only if there is exactly one =, and it separates two strings in the language.
Any number of + symbols can appear in a string of the language as long as each of them is surrounded by symbols other than + or =.
No string is in the language other than those implied by the previous rules.

Under these rules, the string "23+4=555" is in the language, the string "=234=+" is not. This formal language expresses whole numbers, well-formed addition statements, and well-formed addition equalities, but it expresses only what they look like (their syntax), not what they mean (semantics). For instance, nowhere in these rules is it defined that 0 means the number zero, or that + means addition.

The rest of this article is devoted to the basic notions of formal language theory.

Definition and basic terminology

In formal language theory, a language is defined as a possibly infinite set of finite-length sequences of elements drawn from a specified finite set, say $\boldsymbol{A}$ .

The set $\boldsymbol{A}$ is called the alphabet of the language; an element of $\boldsymbol{A}$ is called a symbol or character; a finite sequence of symbols/characters is called a string; an element of the language is called a word.

However, the theory is not concerned with applications at all, and therefore, completely neutral to what symbols and words actually stand for.

For instance, in linguistics, formal language theory can be applied at many different levels of language description simultaneously:

in syntax, which describes how words in the lexicon (or vocabulary) combine to form sentences;
in morphology, which describes how parts of words combine to form words;
in orthography (or spelling), which describes how characters (in the alphabet) combine to form words
in phonology, which describes how phonemes combine to form words.

Examples

As an example of a formal language, an alphabet might be $\left \{ a , b \right \}$ , and a string over that alphabet might be $ababba\,$ .

A typical language over that alphabet, containing that string, would be the set of all strings which contain the same number of symbols $a\,$ and $b\,$ .

The empty word (that is, length-zero string) is allowed and is often denoted by $e\,$ , $\epsilon\,$ or $\Lambda\,$ . While the alphabet is a finite set and every string has finite length, a language may very well have infinitely many member strings (because the length of words belonging to it may be unbounded).

Some more examples:

the set of all words over ${a, b}\,$
the set $\left \{ a^{n}\right\}$ , where $n\,$ is a natural number and $a^n\,$ means $a\,$ repeated $n\,$ times
Finite languages, such as $\{a,b,ab,cba\}\,$ or $\{ac, aca, bba\}\,$
the set of syntactically correct programs in a given programming language; or
the set of inputs upon which a certain Turing machine halts; or,
the set of the longest strings of alphanumeric ASCII characters on this line, (i.e., the set {the, set, of, longest, strings, of, alphanumeric, ASCII, characters, on, this, line, i, e})

Language specification formalisms

Formal language theory rarely concerns itself with particular languages (except as examples), but is mainly concerned with the study of various types of formalisms to describe languages. For instance, a language can be given as

the set of strings that can be generated by some grammar (see Chomsky hierarchy);
the set of strings described or matched by a particular regular expression;
the set of strings accepted by some automaton, such as a Turing machine or finite state automaton;
the set of strings for which some decision procedure (an algorithm that asks a sequence of related YES/NO questions) produces the answer YES.

Typical questions asked about such formalisms include:

What is their expressive power? (Can formalism X describe every language that formalism Y can describe? Can it describe other languages?)
What is their recognizability? (How difficult is it to decide whether a given word belongs to a language described by formalism X?)
What is their comparability? (How difficult is it to decide whether two languages, one described in formalism X and one in formalism Y, or in X again, are actually the same language?)

Surprisingly often, the answer to these decision problems is "it cannot be done at all", or "it is extremely expensive" (with a precise characterization of how expensive exactly). Therefore, formal language theory is a major application area of computability theory and complexity theory.

Operations on languages

Certain operations on languages are common. This includes the standard set operations, such as union, intersection, and complementation. Another class of operation is the elementwise application of string operations.

Examples: suppose $L 1$ and $L 2$ are languages over some common alphabet.

The concatenation ${L}_{1}{L}_{2}\,$ consists of all strings of the form $vw\,$ where $v\,$ is a string from ${L}_{1}\,$ and $w\,$ is a string from ${L}_{2}\,$ .
The intersection ${L}_1 \cap {L}_2$ of ${L}_{1}\,$ and ${L}_{2}\,$ consists of all strings which are contained in both languages
The complement $\neg L$ of a language w.r.t. a given alphabet consists of all strings over the alphabet that are not in the language.

Such operations are used to investigate closure properties of classes of languages. A class of languages is closed under a particular operation when the operation, applied to languages in the class, always produces a language in the same class again. For instance, the context-free languages are known to be closed under union, concatenation, and intersection with regular languages, but not closed under intersection or complementation.

References

A. G. Hamilton, Logic for Mathematicians, Cambridge University Press, 1978, ISBN 0 521 21838 1.
Michael A. Harrison: Introduction to Formal Language Theory, Addison-Wesley, 1978.
Grzegorz Rozenberg, Arto Salomaa, Handbook of Formal Languages: Volume I-III, Springer, 1997, ISBN 3 540 61486 9.
Patrick Suppes, Introduction to Logic, D. Van Nostrand, 1957, ISBN 0 442 08072 7.