JFK: The Man, the Airport, the Text Analytics Nightmare
Extracting meaning from words is about much more than a statistical numbers game, especially when it comes to Big Text analytics.
Meaning can be communicated in a million different ways. Consider idioms. It is estimated that idioms make up an average of 40- to 60-percent of text content. While "Joe kicked the ball" and "Joe kicked the bucket" are identical structurally, they have very, very different meanings. Just because I understand the first sentence doesn't mean I'll automatically understand the second. Far from it.
There are just some things that require human intelligence to decode. That's where linguistics comes in. A linguistics-centric approach to text analytics applies layers upon layers of human-programmed resources to extract actual meaning from unstructured content.
So, what does JFK have to do with all of this? Those three simple letters are at the center of a few clear examples of both the limitations of traditionally mathematical platforms and the potential in a hybrid approach that includes a heavy dose of human-powered linguistics.
JFK versus Traditional Text Analytics
Analytics platforms that rely on statistics and machine learning alone would have difficulty properly identifying the entity “JFK” in the phrase “JFK to Dallas.” Without an understanding of the contextual meaning of “to,” the proximity of “JFK” to “Dallas” could lead the platform to assume the entity “JFK” refers to the president, not the airport.
Let's see how applying linguistics can help offset this problem with another example. Take these three sentences that say the same thing:
- John F. Kennedy delivered his speech in Berlin at 0700Z on 26 June 1963.
- JFK gave a speech in Berlin on June 26, 1963 at 8.
- JFK spoke in Berlin on June 26, 1963 at 8.
Some challenges for a purely statistical system? "Gave" is the verb but only the phrase "gave a speech" supplies meaning. If you don't have the words together, you lose the meaning of the sentence. "Gave a speech" is also more or less the same as "spoke." A platform needs to know that "gave a speech" is a unit and analyze it as such.
Additionally, while JFK can be an acronym that represents everything from a U.S. president to an airport to the Justice for Khojaly (and more), the phrase "gave a speech" implies the subject of the sentence must be a human being.
Statistical analysis is essentially unable to capture this knowledge and extend it to new content. Even with machine learning in place, complex meaning needs to be annotated to a training set up front and huge amounts of data tagged appropriately.
Remember how much meaning is implied in the word "to" within the phrase "JFK to Dallas"? Few companies have access to the power or the people to create the exhaustive language models to drive comprehensive and consistent results. And neural networks are notoriously fussy. If a behavior isn't correct, there is no way to know how a tweak in one area will affect the performance of the system as a whole.
Using Linguistics for Big Text Analytics
Depending on traditional AI and machine learning methodology—the "bag of words" approach, which treats letters, words, and phrases as merely symbols—strips valuable context from documents.
A resource-backed linguistics method, however, implements an extensive four-step process applying:
- a comprehensive dictionary with all possible meanings for millions of words and phrases,
- large semantic dictionaries with hundreds of semantic classes of known entities in all forms,
- additional grammars and dictionaries of synonyms, paraphrases, and idioms (including local grammars such as date/time phrasing) and
- millions of complex parsing rules that identify meaning and relationships syntactically.
The result? A platform that reads like a human, only faster.