Natural Language Programming

Last Updated : 19 Sep, 2023

Having programmed for many years in many languages, I often find myself thinking in English pseudo-code, then I translate my thoughts into whatever artificial syntax I’m working with at the time. So one day I thought, “Why not simply code at a natural language level and skip the translation step?” My elder son (also a programmer) and I talked it over, and we decided to test the theory. Specifically, we wanted to know:

1. Is it easier to program when you don’t have to translate your natural-language thoughts into an alternate syntax?

2. Can natural languages be parsed in a relatively “sloppy” manner (as humans apparently parse them) and still provide a stable enough environment for productive programming?

3. Can low-level programs (like compilers) be conveniently and efficiently written in high level languages (like English)?

And so we set about developing a Plain English compiler (in Plain English) in the interest of answering those questions. And we are happy to report that we can now answer each of those three questions, from direct experience, with a resounding, “Yes!”

The Theory

Our parser operates, we believe, something like the parsing centers in the human brain. Consider, for example, a father saying to his baby son…

“Want to suck on this bottle, little guy?”

…and the kid hears…

“blah, blah, SUCK, blah, blah, BOTTLE, blah, blah.”

…but he properly responds because he’s got a “picture” of a bottle in the right side of his head connected to the word “bottle” on the left side, and a pre-existing “skill” near the back of his neck connected to the term “suck.” In other words, the kid matches what he can with the pictures (types) and skills (routines) he’s accumulated, and simply disregards the rest. Our compiler does very much the same thing, with new pictures (types) and skills (routines) being defined — not by us, but — by the programmer, as he writes new application code.

The Practice

A typical type definition looks like this:

A polygon is a thing with some vertices.

Internally, the name “polygon” is now associated with a dynamically-allocated structure that contains a doubly-linked list of vertices. “Vertex” is defined elsewhere (before or after this definition) in a similar fashion; the plural is automatically understood.

A typical routine looks like this:

To append an x coord and a y coord to a polygon:
Create a vertex given the x and the y.
Append the vertex to the polygon’s vertices.

Note that formal names (proper nouns) are not required for parameters and variables. This, we believe, is a major insight. A real-world chair or table is never (in normal conversation) called “c” or “myTable” — we refer to such things simply as “the chair” or “the table”. Likewise here: “the vertex” and “the polygon” are the most natural names for these variables.

Note also that spaces are allowed in routine and variable names (like “x coord”). It’s surprising that all languages don’t support this feature; this is the 21st century, after all. Note also that “nicknames” are also allowed (such as “x” for “x coord”). And that possessives (“polygon’s vertices”) are used in a very natural way to reference fields within records.

Note, as well, that the word “given” could have been “using” or “with” or any other equivalent since our sloppy parsing focuses on the pictures (types) and skills (routines) needed for understanding, and ignores, as much as possible, the rest.

Like a Math Book

At the lowest level, things look like this:

To add a number to another number:
Intel $8B85080000008B008B9D0C0000000103.

Note that in this case we have both the highest and lowest of languages — English and machine code (in hexadecimal) — in a single routine. The insight here is that a program should be written primarily in a natural language, with snippets of code in more appropriate syntax as (and only as) required. Like a typical math book: mostly natural language with formula snippets interspersed.

We hope someday the technology will be extended, at the high end, to include Plain Spanish, and Plain French, and Plain German, etc; and at the low end to include “snippet parsers” for the most useful, domain-specific languages.

An Objection Answered

Now perhaps you’re thinking natural language programming is a silly idea. But have you considered the fact that most of the code in most programs does simple stuff like “move this over there” and “show that on the screen” — things that can be most conveniently and most naturally expressed in a natural language? Let’s consider an example we can examine in detail:

Our compiler — a sophisticated Plain-English-to-Executable-Machine-Code translator — has 3,050 imperative sentences in it.

1,306 of those (about 42%) are conditional statements, and at least half of those are trivial things like these:

If the item is not found, break.
If the compiler's abort flag is set, exit.

The remainder of those conditional statements are slightly more complex, but all of them fit on a single line. Here are a couple of the longer ones:

If the length is 4, attach $FF32 to the fragment's code; exit.
If the rider's token is any numeric literal, compile the literal given the rider; exit.

Of the remaining sentences:

272 (about 9%) are simple assignment statements:

Put the type name into the field's type name.

202 (about 7%) are just the infrastructure for various loops:

Loop.
Get a field from the type's fields.
[ other stuff here]
Repeat.

183 (6%) simply add something to the end of this or that list, like so:

Add the field to the type's fields.

164 (about 5%) are trivial statements used to return boolean results, start and stop various timers, show the program’s current status, and write interesting things to the compiler’s output listing.

Say no.
Say yes.
Set the variable's compiled flag.
Start the compiler's timer.
Stop the compiler's timer.
Show status "Compiling...".
List the globals in the compiler's listing.

119 (about 4%) advance the focus in the source code, sentences like:

Bump the rider.
Move the rider (code rules).

92 (about 3%) are used to create, destroy and keep internal indexes up to date, sentences like:

Create the type index using 7919 for the bucket count.
Index the type given the type's name.
Destroy the type index.

58 (about 2%) are used to find things in various lists:

Find a variable given the name.

37 (about 1%) are calls to various conversion routines:

Convert the rider's token to a ratio.

31 (about 1%) are used to generate actual machine code (plus those that appear in conditional statements, as above):

Attach $E8 and the address to the fragment.

And that accounts for 80% of the code in our compiler.

Only 57 of the remaining sentences (less than 2% of the whole) are mathematical in nature, a line here and there like these:

Add 4 to the routine's parameter size.
Subtract the length from the local's offset.
Multiply the type's scale by the base type's scale.
Calculate the length of the field's type.
Round the address up to the nearest multiple of 4096.

And the rest are not formulaic at all. Stuff like:

Copy the field into another field.
Append the fragment to the current routine's fragments.
Abort with "I was hoping for a definition but all I found was '" then the token.
Initialize the compiler.
Remove any trailing backslashes from the path name.
Reduce the monikette's type to a type for utility use.
Eliminate duplicate nicknames from the type's fields.
Prepend "original " to the term's name.
Extend the name with the rider's token.
Unquote the other string.
Read the source file's path into the source file's buffer.
Generate the literal's name.
Extract a file name from the compiler's abort path.
Write the compiler's exe to the compiler's exe path.
Swap the monikettes with the other monikettes.
Skip any leading noise in the substring.
Scrub the utility index.
Fill the compiler's exe with the null byte given the compiler's exe size.
Position the rider's token on the rider's source.
Pluralize the type's plural name.
Link.
Finalize the compiler.
Check for invalid optional info on the type.

And that’s why we say that most of what most programs do is easy stuff, stuff that can be conveniently expressed in a natural language. And that, in turn, is why we like programming in Plain English: the thoughts in our heads are typed in as Plain English “pseudo code” and, with a tweak here and there, that pseudo code actually compiles and runs. And is self-documenting, to boot.

Another Objection Answered

You may be thinking that natural language is just too verbose for programming. But is it really that bad? Let’s consider a couple of examples. In a traditional programming language, we might draw a box using a statement like this:

substring.draw ( box, color, source.text.font, source.text.alignment ) ;

Which is 10 words and 11 punctuation marks: 21 total elements.

The Plain English equivalent would be:

Draw the substring in the box with the color and the source's text's font and alignment.

Which is 16 words and 3 punctuation marks: 19 total elements.

Admittedly, the Plain English version requires a few more easy-to-type alphabetic characters (it’s difficult to say exactly how many since traditional coders put spaces in different places); but that’s a small price to pay for not having to learn (or think in) an artificial syntax.

Here’s another example:

if ( ! source.colorized ( ) ) color = black ;

Which is 5 words and 8 punctuation marks: 13 total elements.

Compared with the Plain English:

If the source is not colorized, put black into the color.

Which is 11 words and 2 punctuation marks: 13 total elements.

Again, it’s mostly a matter of whether you like to type words or (specialized) punctuation. And whether you like to think in two different syntactical and grammatical forms simultaneously. And whether you want your code to be self-documenting. And whether you want code that’s friendly for beginners. And
whether you want to code in a language (like English) that will still be in common use 100 years
from now. Personally, we think you may have lost some human perspective if you’ve come to think that “(!source.colorized())” is a good way of saying anything!

The Prototype

If you’re interested, you can download the whole shebang here:

It’s a small Windows program, less than a megabyte in size. But it’s a complete development environment, including a unique interface, a simplified file manager, an elegant text editor, a handy hexadecimal dumper, a native-code-generating compiler/linker, and even a wysiwyg page layout facility (that we used to produce the documentation). It is written entirely in Plain English. The source code (about 25,000 sentences) is included in the download. No installation is necessary; just unzip. Start with the “instructions.pdf” in the “documentation” directory and before you go ten pages you won’t just be writing “Hello, World!” to the screen, you’ll be re-compiling the entire thing in itself (in less than three seconds on a bottom-of-the-line machine from Walmart).

Thanks for your time and interest.

Gerry Rzeppa
Grand Negus of the Osmosian Order of Plain English Programmers