Natural Language Processing Done Wrong

Recently, I have been reading novels of different languages in my spare time. After reading some, it occurs to me that one question appears in my mind: What makes one understand human languages? I believe it has already been discussed and solved by linguisticians, but it is necessary to think of it when teaching computers those human languages.

When implementing compilers, how do we translate programming languages into machine languages? We split the text into tokens, check for syntax errors, and report them to the programmer. After correctly parsing the source file, it assumes that the computer “understands” this program. The program can then be translated into machine languages and executed on the target machine. This process is just like how humans understand languages, and since the programming languages only contain really few words, they are easy to understand by both humans and computers.

But the human languages are different. They contain thousands of words describing the wonderful world around us. It should and only could be understood when having reality in mind. Current NLP methods only include the parsing process: splitting the text and looking them up in the dictionary. Then the partly parsed text is thrown into a black box with pre-trained models. Those models are results of data computations using statistical methods. Traditional classifiers, modern neural networks, or even more advanced techniques could only extract the features of the text and try to find a pattern in it. It is not how we do it!

How do we understand a piece of text? We use our knowledge to understand it! We are alive and living on this planet, and we do things that everyone does. We would have our meals, brush our teeth, walk on earth, and sleep at night. They are partly written in our DNA and partly taught by others. We store them in our brains, and we can use them when in need. Do you see it?

That is the tough point if we want to teach machines human languages. We need to tell them everything we know about this world. Some have never appeared in any book in human history, and some are hard to teach to computers! Here comes the loop: They can not read books because they don’t understand human languages, and without the knowledge, they don’t know how to understand languages! We could only wait for it to learn human languages gradually, like babies. Or we encode the reality in its program.