Good Concept Detection Requires an "Almost Engine"
How many geese in this picture? You didn't need to count. Your answer probably contained the words “approximately”, “average”, “almost”, “sort of”, "guesstimate" or “about”. One of the most powerful features of your brain is that it does not treat language as math, a series of binary yes or no formal constructs. Humans are masters of writing the same idea in many ways, understanding what you meant even if you didn't say it perfectly. You also know when someone is being so careful with their words, they're lying (we're all tested on this one daily). This critical skill is used by analysts all the time.
Many technical approaches to textual analysis try to convert sentences into math or use the presence of specific words to return a yes or no result. Our approach here at Boulder overcomes this issue with software agents that compare arrays (models) of patterns and return “similarity scores”. These normalized scores allow us to determine whether a paragraph is about the same as another and by how much.
In my last post, I discussed the nature and definition of concepts and how our solution is built to find them. Whether it’s a concept you’ve created or an example you've found in a document, if you can't describe it you can't find it.
You can spot a concept when you read one, but learning to describe a one isn’t as easy as it sounds and its rarely exact. Our AI platform enables us to create and teach “intelligence agents” to “read” documents to score similarity to concepts.
The first step in the process is to provide examples in a simple text box. This is surprisingly difficult the first time because our brain is never reading text without injecting our own education, bias and assumptions into the process. Unfortunately the software only sees exactly what you provide it, no more, no less. When one user asked us how to teach the concept of “fraud” to our agents, we had to take a step back.
Risk and Opportunity in Legal Issues
In the financial filings we focus on, concepts are often expressed indirectly, through a description of the act or its consequences. The objective of the analyst is to find indications of wrongdoing or their cover up in company filings, earnings calls, news and social media. The critical evidence is never a clear statement like “I just gave my friend inside information on our earnings announcement so they could trade ahead of our disclosure.”
People don’t recite the definition of the crime when they talk about it. The language used is always more subtle and disguised. It also varies dramatically from one context to another (email vs. interview transcript for example). Tweets and text messages are full of acronyms, slang, phrases and partial sentences. The guilty party is usually aware of the act and tries to avoid being discovered. He is more likely to say, “Hi Dave, here are some stats you might find interesting.” Is he talking about the company earnings report or his fantasy football team? If he says, “we’ll have a big surprise for you tomorrow”, is it a surprise birthday party or a merger announcement?
Kant’s tree concept (again)
To underscore this point, let’s revisit the example of the tree concept from my last post. The concept of a tree abstracted from descriptions of many trees is clear enough. The challenge in language analytics and many (most) language processing problems is that we are looking for the indirect effect or consequences resulting from the existence or actions of the tree.
This is easier to understand with examples. “The fall colors in New England are beautiful this time of year”. Or “We need to get some shade for the yard at the nursery.” The concept of the tree is there but if I was searching for trunks, roots, branches and leaves, the “criminal” tree would escape detection!
Context is key
The context of the concept is critical and defining the context for the software can be complex but in most cases the patterns are there once you clear away the noise.
Capturing context can be as simple as setting filters. For example, documents between specific individuals during specific time periods during which they had access to the information, resources and counterparts needed to commit wrongdoing. Metadata from documents and entity attributes like organization, location, titles are commonly used for this purpose.
Where possible, we extract context from the source documents (and the source) to ensure that the more complex contextual factors are incorporated automatically into the intelligence agent. For example, the same person in the same period does not use the same language on twitter as they do in email. Our powerful 'almost engine" is what makes our system resilient and with our user in charge of how tight to set the "almost' meter, it adapts to a range of problems.
We are all guilty of injecting our own bias and filters into understanding language. Good technical solutions capture the richness and subtlety in the context of the communications and ensure consistency of review. This is not a problem solved by more and bigger data.
Want to know how we use "almost" to solve your problems? We believe analysts have more than enough information and not enough time to find what matters. We can help with that.