Situation
A lawyer's work often consists of sifting through a significant number of contracts. In an ideal situation, where every document has been carefully and consistently categorized, the task is an easy one. However, most corporate environments are far from this ideal situation.
Our client, the legal department of a servicing company, had to deal with a large database of unstructured legal documents where the time needed to extract contractual information had significantly increased in the last years. Once a document was found, it was relatively simple to analyze it. However, hidden among thousands of documents, finding a specific contractual information was so difficult that our client feared to potentially miss strategic information, harming the company in the process.
Approach
Our Open Web Technology team supported our client in designing and developing a solution that could analyze and categorize this large database of contracts in order to allow users to search through them.
Leveraging the benefits of our joint venture with Swisscom offers us, we teamed up to deliver an innovative and intelligent solution. This solution could profit from the latest advancements in AI algorithms to efficiently analyze and categorize documents and provided a frontend to intuitively query the generated model of documents.
In a first step, our solution had to read the content of scanned documents. We collected the contracts from different archiving systems and extracted their text using state of the art OCR technologies. With this text content available, we then applied natural language processing techniques for both document categorization and document relation inference.
Learning the Document Category
To perform machine learning tasks, a computer needs to deal with a digital representation of the contract. The transformation of text into a mathematical object is called document embedding.
With each document represented as a mathematical object, a computer can measure distances between the objects and group the closest neighbors together. This step is called clustering. In our case, those groups represent documents of the same type.
Finally, the system could store the categories produced by our clustering algorithm, allowing the jurist to search by document category.
Visualization of clustering of more than 1000 documents in contract types
Inferring Relations between Documents
To detect relations between documents, we used the habit of authors to use common patterns when referring to similar documents. This can be detected and resulted in a document relation inference workflow:
The algorithm could parse every document trying to detect mentioned documents, using either regular expressions or named entity recognition.
When a reference was found, the algorithm would search in the database if the mentioned documents had been identified.
Based on document category, the relation between the two documents could be deducted, allowing our program to gradually create a list of documents mentioning each other.
This list is saved could be as well stored in a database to be later queried by the legal team.
Having detected both the document category and the relation between documents, the jurist only needed to use the solution to access this vast amount of information.
The solution we developed demonstrates how artificial intelligence can disrupt the corporate world, automating cumbersome and repetitive work to let people spend time on more valuable tasks.
Our client has been able to reclassify an enormous amount of documents within only a few months, which wouldn't have been possible with conventional techniques and resulted in an improved legal database.
At Open Web Technology, we believe that Artificial Intelligence can help businesses in multiple and unexpected ways in the near future. It will open new opportunities for both cost reduction and business development.
If the subject of text classification interests you, don't hesitate to read our article dedicated to it!