The purposeful exchange of information caused by the creation and perception of signals drawn from a shared system of conventional signs is known as communication. Most animals employ signals to convey vital messages: there’s food here, there’s a predator nearby, approach, recede, and let’s mate. Communication can help agents succeed in a partially visible world because they can learn knowledge that others have observed or inferred. Humans are the most talkative of all species, thus computer agents will need to master the language if they are to be useful. Language models for communication are examined in this chapter. Deep comprehension of a discussion needs more advanced models than simple spam categorization techniques. We begin with grammatical models of sentence phrase structure, add semantics, then apply the model to machine translation and speech recognition.
N-gram language models were built on word sequences. The main problem with these models is data sparsity: with a vocabulary of 105 words, there are 1015 trigram probabilities to estimate, and even a trillion words won’t be enough to provide accurate estimates for all of them. We may use generalization to solve the problem of sparsity. We may deduce from the fact that “black dog” is more common than “dog black” and other similar findings that adjectives in English tend to come before nouns (whereas adjectives in French tend to come after nouns: “chien noir” is more common). There are exceptions, of course; “galore” is an adjective that comes after the noun it modifies. Despite the exceptions, the concept of a lexical category (also known a
The generating capacity of grammatical formalisms, or the number of languages they can represent, can be used to classify them. Chomsky (1957) distinguishes four types of grammatical formalisms based solely on the rewriting rules. The classes can be placed in a hierarchy, with each class being able to describe all of the languages that a less powerful class can describe, as well as a few extra languages. The hierarchy is listed below, with the most powerful class listed first:
Unrestricted rules are used in recursively enumerable grammars: both sides of the rewrite rules, as in the rule
The single constraint on context-sensitive grammars is that the right-hand side must include at least as many symbols as the left-hand side. A rule like
The left-hand side of context-free grammars (or CFGs) is made up of a single nonterminal symbol. As a result, each rule allows the nonterminal to be rewritten as the right-hand side in any context. Although it is now commonly understood that at least some natural languages feature constructs that are not context-free, CFGs are popular for natural-language and programming-language grammars (Pullum, 1991).
Regular grammars are the most limited type of grammar. Every rule has a single non-terminal on the left and a terminal symbol on the right, which is optionally followed by a non-terminal. The power of regular grammars is comparable to that of finite-state machines. Because they can’t express features like balanced opening and closing parenthesis, they’re unsuitable for programming languages (a variation of the
Higher up the hierarchy, the grammars have a more expressive capacity, but the methods for dealing with them are inefficient. Linguists concentrated on context-free and context-sensitive languages until the 1980s. Since then, there has been a resurgence of interest in regular grammar, owing to the need to process and learn from gigabytes or terabytes of online text quickly, even if it means a less thorough analysis. “The older I become, the lower down the Chomsky hierarchy I go,” Fernando Pereira said. Compare Pereira and Warren (1980) with Mohri, Pereira, and Riley (2002) to see what he means (and note that these three authors all now work on large text corpora at Google).
There has been a slew of rival language models based on the concept of phrase structure; we’ll go through one of the most prominent, the probabilistic context-free grammar. A grammar is a set of rules that define a language as a set of permissible word strings. The “probabilistic” refers to the grammar’s assignment of a probability to each string. A PCFG rule is as follows:
The non-terminal symbols
We’ve now defined grammar for a short bit of English that may be used to communicate with wumpus world agents. This language is referred to as