AI | Phrase and Grammar structure in Natural Language
The purposeful exchange of information caused by the creation and perception of signals drawn from a shared system of conventional signs is known as communication. Most animals employ signals to convey vital messages: there’s food here, there’s a predator nearby, approach, recede, and let’s mate. Communication can help agents succeed in a partially visible world because they can learn knowledge that others have observed or inferred. Humans are the most talkative of all species, thus computer agents will need to master the language if they are to be useful. Language models for communication are examined in this chapter. Deep comprehension of a discussion needs more advanced models than simple spam categorization techniques. We begin with grammatical models of sentence phrase structure, add semantics, then apply the model to machine translation and speech recognition.
N-gram language models were built on word sequences. The main problem with these models is data sparsity: with a vocabulary of 105 words, there are 1015 trigram probabilities to estimate, and even a trillion words won’t be enough to provide accurate estimates for all of them. We may use generalization to solve the problem of sparsity. We may deduce from the fact that “black dog” is more common than “dog black” and other similar findings that adjectives in English tend to come before nouns (whereas adjectives in French tend to come after nouns: “chien noir” is more common). There are exceptions, of course; “galore” is an adjective that comes after the noun it modifies. Despite the exceptions, the concept of a lexical category (also known as a part of speech) such as noun or adjective is a useful generalization—useful in and of itself, but even more so when we string together lexical categories to form syntactic categories such as noun phrase or verb phrase and combine these syntactic categories into trees representing sentence phrase structure: nested phrases, each marked with a category.
The generating capacity of grammatical formalisms, or the number of languages they can represent, can be used to classify them. Chomsky (1957) distinguishes four types of grammatical formalisms based solely on the rewriting rules. The classes can be placed in a hierarchy, with each class being able to describe all of the languages that a less powerful class can describe, as well as a few extra languages. The hierarchy is listed below, with the most powerful class listed first:
Unrestricted rules are used in recursively enumerable grammars: both sides of the rewrite rules, as in the rule can have any number of terminal and nonterminal symbols. In terms of expressive capability, these grammars are comparable to Turing machines.
The single constraint on context-sensitive grammars is that the right-hand side must include at least as many symbols as the left-hand side. A rule like specifies that an can be recast as a in the context of a previous and a subsequent , thus the name “context-sensitive.” Languages like can be represented using context-sensitive grammars (a sequence of copies of followed by the same number of s and then s).
The left-hand side of context-free grammars (or CFGs) is made up of a single nonterminal symbol. As a result, each rule allows the nonterminal to be rewritten as the right-hand side in any context. Although it is now commonly understood that at least some natural languages feature constructs that are not context-free, CFGs are popular for natural-language and programming-language grammars (Pullum, 1991). can be represented by context-free grammars, but not .
Regular grammars are the most limited type of grammar. Every rule has a single non-terminal on the left and a terminal symbol on the right, which is optionally followed by a non-terminal. The power of regular grammars is comparable to that of finite-state machines. Because they can’t express features like balanced opening and closing parenthesis, they’re unsuitable for programming languages (a variation of the language). They can get closest by expressing , which is a series of any number of s followed by any number of s.
Higher up the hierarchy, the grammars have a more expressive capacity, but the methods for dealing with them are inefficient. Linguists concentrated on context-free and context-sensitive languages until the 1980s. Since then, there has been a resurgence of interest in regular grammar, owing to the need to process and learn from gigabytes or terabytes of online text quickly, even if it means a less thorough analysis. “The older I become, the lower down the Chomsky hierarchy I go,” Fernando Pereira said. Compare Pereira and Warren (1980) with Mohri, Pereira, and Riley (2002) to see what he means (and note that these three authors all now work on large text corpora at Google).
There has been a slew of rival language models based on the concept of phrase structure; we’ll go through one of the most prominent, the probabilistic context-free grammar. A grammar is a set of rules that define a language as a set of permissible word strings. The “probabilistic” refers to the grammar’s assignment of a probability to each string. A PCFG rule is as follows:
The non-terminal symbols (verb phrase) and (noun phrase) are used here. The terminal symbols, which are genuine words, are also mentioned in the grammar. This rule states that a verb phrase containing only a verb has a probability of 0.70, while a followed by an has a chance of 0.30.
We’ve now defined grammar for a short bit of English that may be used to communicate with wumpus world agents. This language is referred to as . is improved in later parts to bring it closer to genuine English. We’re unlikely to ever come up with perfect English grammar, if only because no two people would agree on what constitutes proper English.