Boulder’s Platform is Not a One Tech Pony
We have our own AI technology and secret sauce, but to be clear, we are an equal opportunity user of machine learning and text analytics technologies. We use our own python programs for gathering the data from around the web, Diffbot for extracting metadata from web sources, custom data cleaning routines, stop word dictionaries, separate databases indexed to companies and people, entity extraction, sentiment algorithms, our own tone algorithms, multiple grade level and psychometric linguistic libraries. Of course, I’m leaving out some of the tools we use to create and organize the attributes that drive our dashboards but here’s a peek under the hood via some comparisons.
Comparison of Our Tech to Others
Even though we use Natural Language Processing (NLP) and other core language algorithms, I am often asked what’s different with Boolean and NLP powered solutions, the two that come up most often. Here’s a brief explanation from an application perspective.
Search/Boolean Logic – Boolean operators are very precise in that they’re basically a set of math/logic operations on words. If a word is in a document/paragraph then it returns a “hit”. As you work with the expression, you can eliminate false positive results with exclusions (i.e. if it has “Charlie” then “yes”; but not if it also has “Hof”). This is a time consuming and iterative process. It also doesn’t include misspelled words so exact hits can be missed. It is also subject to the developer’s bias, in that the Boolean string is created based on what the developer thinks is in the corpus. If the word isn’t in the string then that “hit” will be missed completely. It is a 0 or 1 result, there is no “almost” capability.
Below is a Boolean string supplied by one of our clients for use with his Factiva service. Starting with “virtual personal assistant”, you can see the accumulation of “NOT” words and sources he gradually added to eliminate false positive results.
In contrast, our application returns a similarity score to the object of the search that allows us to capture a paragraph that is similar and provide the analyst with the ability to increase or decrease the “almost” aperture if they are getting too many or too few hits. Once we make sure we have filtered out all the paragraphs/documents that are noise, we can use a Boolean operator, entity extraction, sentiment, dates, or other metadata to bring precision to the search.
NLP (natural language processing) has many of the same issues as above in that it attempts to reverse engineer language. With large data sets for training it has become quite good in many applications which is why we use it in some of our projects. It has the same problems as Boolean with text quality so it works great (best?) on professionally written documents. But extend it to a large variety of writing styles, types, fragments, OCR docs, etc. and it will break.
Like Boolean, it doesn’t tell you if there is a concept similar to what you wanted but just missed the cut, although it does have a bit more flexibility. It is good for short fragments and bullets where there isn’t much for our concept agents to understand. So, we use it for that purpose and where there is a mix of large paragraphs and short bullets.
Google and Watson – Quick Take
While not technologies, I’m often asked what the differences are between our approach and Google and IBM Watson. Mostly it’s just that they are a few billions bigger… and of course we aren’t even solving the same problem. But here’s a quick (dangerously simple) explanation that ignores the apples and oranges issue.
Google - they use all the above tech in addition to personal search history, geo location, and lots of machine learning to create an index that provides great search functionality. While this is great for general web search, it doesn’t work well for our clients for several reasons. First, our clients do not want us to capture search history and index their activity, possibly to the benefit of other parties. Second, a Google search doesn’t tell you what it almost found and didn’t show you. It might list an “almost’ hit but since you don’t have control over the ranking, the result you’re looking for may be on page 102 and you’ll never see it. Finally, it’s a “black box” and can’t be tailored by you as the developer or as a user. What it learns about you might work great in one search and then mislead you when you’re working a different project. For financial search, AlphaSense and Sentieo are companies using NLP technology doing link building with the primary differentiator over Google being specialization on financial documents.
IBM Watson – at NASA and other clients the comparison to Watson comes up because of their ads (Bob Dylan really?) and sales organization. In the right application, it can be optimized and taught to deliver real benefits. This quote from Wikipedia offers a great description and the differences with search (Google).
“The key difference between QA technology and document search is that document search takes a keyword query and returns a list of documents, ranked in order of relevance to the query (often based on popularity and page ranking), while QA technology takes a question expressed in natural language, seeks to understand it in much greater detail, and returns a precise answer to the question.”
However, aside from other objections, client engineers call it a “Wizard of Oz” solution, i.e. you must be prepared to accept that Watson is the genius. While Watson is better than Google in exposing some of the underlying analytics, expert users demand control over the parameters of the interface (interactive dashboards), the underlying logic and the ability to drill all the way down to the source content to verify the result. This feature is key to our architecture, the way we index the data and build our database (index). It also means that the application/UI layer is independent from the database and can be added to/changed/combined in multiple ways as the system is deployed, grows or for security. The decision to expose certain data fields, records or source data is up to the client developers for custom applications.
In the end, software is like any other tool, pick the one(s) that do the best job for you. There are no perfect “Swiss Army knives” for research. Every search engine is perfect if you ask the right questions.
Disclaimer – none of the above companies have shared the technical details of their products with me. The above explanations are based entirely on my research without any inside information. This information is offered as a guide to help non-technical users ask questions and understand the answers.
Good Technology, Better Design, Great Database
I would add that in addition our UI/dashboards are independent of the underlying enriched database. This means the same data can be used in applications or used to enhance existing algorithms/statistical analysis. At NASA, an engineer used the output of our application and ran it through clustering algorithms and graphing programs to “see” roadmaps for wireless sensor technology. Combining the attributes we extract with other analyses can filter out noise and improve results.
An additional advantage is the flexibility of the topic agent/group/metagroups that allow multiple combinations of agents for specific tasks. An analyst can use the default groups but can also create a custom group of agents for a particular use case. If you think of the groupings of agents as clusters of concepts in your brain, you can apply them in different combinations depending on the task/context in front of you. The tool lets you create your own context and doesn’t force you to adapt to a rigid taxonomic structure. We believe this is critical when you are working to find insight and patterns hidden in plain sight that could drive a winning trading strategy.
#research #textanalytics #linguistics #finance #AI #investing