Natural Language Processing has turned out to be a very difficult challenge. One of the reasons is that the way language has evolved. An advantage the human brain has is the availability of thousands of classifiers (read as neurons) making decisions. We will understand a sentence written in natural language probably only when the output of these agrees. Some of the neurons possibly are trying to see if a sentence makes sense in the currently understood form (Context information). However these facilities are currently unavailable to a computer.
May be we can use the procedures of statistical machine learning here. Here is the details of a system I wish to suggest.
- As everyone understands, English language (all languages apart from Sanskrit possibly) has some disadvantages like, a word can mean different things in different cases (for example table in table of contents and a table as in a dining table). Multiple words can be used to represent the same meaning (synonyms). A set of words representing a concept (rear view mirror). Etc. One way to simplify the mess is to use an intermediate language. Some of the properties of this intermediate language should be:
- One word, one meaning.
- One meaning one word. (All synonyms are represented by one word).
- All set of words that mean something are condensed into one word. A word in the intermediate language is a concept and not necessarily an English word. But it makes sense to make it as close to English as possible to reduce the effort. After all we will be using English most of the times)
- Such an intermediate language can now be converted into any form of representation. One such form is an "association graph of concepts". Here every noun (or a noun with adjectives) forms a node and verbs form the edges. Example "Ram is a good cat". Translates to a node for Ram, A node for "good cat" and a node for cat. There is an "is" edge from "Ram" to "Good Cat" and another "is" relation from "good cat" to "cat". (Guess this can be added by default). There can be a base graph already built after reading millions of documents say, to get the base knowledge.
- Once such a graph is built, the same graph can be used for word disambiguation as it would have a strong set of links for related concepts.
- The graph can form a common base for all human languages with "intermediate to particular language" translator built on top of it. So machine translation can be made to work in this way.
- Once the computer starts reading a new document, the concepts of the document will be close to a particular part of the graph. These ideas can be possibly used for text summarization and sentiment analysis.
Anyway this is still in the idea stage.. If we ever make any progress on such a system, I will post it here.