BIS 2040 Home Page The Next Page
Middlesex Logo

Text Mining

Most of the documents and thus the data in the world are written in text. There has been a lot of work in Data Mining on numbers or data base records, but much less on mining text. Why?
The answer is that text is hard for computers to understand. This seems kind of odd as almost every human understands language. Perhaps this is one of the chief differences between computers, in their current state, and people. (Turing Test) (Semantics and Context)

Linguistics

Fortunately, there is a long history of linguistics. This work has been used to help us develop systems for natural languge processing (understanding and generation). More recently (the last 50 years), there has been a lot of work in getting computers to process language. This often uses linguistics, but also takes advantage of the strengths of the computer: fast processing and a lot of memory.
This is an active industrial area. To work in this area, you need to have the basic skills. You also need to know the limits of the technology. It is also an active research area.