Search plus Cognitive AI
Modern general-purpose search implementations are good. We can easily and efficiently index and query billions of documents*, but by utilising Cognitive AI techniques we can provide users with better results by adding contextual understanding - turning documents into knowledge.
Cognitive Search applies cognitive AI techniques around web-scale search engines to provide an understanding of what is being indexed and what is being queried. Before they are indexed documents are classified into corpora of related knowledge and extended with metadata derived by cognitive analysis. Before the user's query is sent to the search engine it is enhanced by knowledge about the corpus/corpora being searched and learnings from previous queries. The results get better over time.
*documents in this sense might be files of any kind, web pages, videos, images or records in a database - any container of information is a candidate.
Cognitive Search on Azure
Unearth is a Cognitive Search implementation built around Microsoft's Azure Search service that utilises Microsoft Cognitive Services, Machine Learning and Cognitive Toolkit to enable cognitive understanding.
Unearth is intended to be used by organizations with private document repositories. It is offered as a shared service with private knowledge stores or may be installed into customer subscriptions. Custom implementations for specific needs are also an option.
Unearth doesn’t store documents, it derives and stores metadata about documents and references back to the source document libraries (which may be on-premise). Interfaces are already available for several document library types and the design allows for easy creation of new ingest interfaces and on-premise crawlers.
Contextual Understanding: The same terms can mean different things depending on their context. If my query is 'What do we know about AWT?' I might be looking for information about the Abstract Window Toolkit (if I am a programmer) or Advanced Water Treatment (if I am a hydraulic engineer). Unearth supports a hierarchy of contexts: generic, industry-specific, company-specific and domain-specific. So if the target corpus hierarchy was within a civil engineering domain the search would be automatically expanded to include AWT, Advanced Water Treatment, Reverse Osmosis, Membrane Filtration and Water Oxidation.
Unearth comes with generic and industry-specific models which are constantly updated. We provide tools, education and assistance to help customers create company and domain-specific contextual models. Contextual models created by or for customers are IP that belongs to the customer and may be useful to empower other customer applications providing competitive advantage.
Automatic Classification: As documents are ingested metadata including key phrases, referenced real-world entities and relationships to contextual glossaries are created. This metadata can be used to automatically classify the new document into a corpus.
More like this: Similarity learning is used to determine how closely the content of documents match. This allows searches for similar documents over different corpora and identification of duplicates or near duplicates.
Know what you know: Analysis and reporting on metadata allows you to see what topics are covered by each corpus of documents and to what extent. This can help identify duplication and knowledge gaps.
Unearth itself does not store documents. You can choose to leave your documents in their current repositories, or we can help to move them to an appropriate secure repository in Azure.
Document metadata at rest is encrypted and stored in Unearth knowledge stores. These are automatically triplicated for fault tolerance and regularly backed up. Access to knowledge stores and backups is secured with Azure authentication and authorization, as are the APIs that update and query them and the user interfaces that call the APIs.
Each customer has their own private single-tenant knowledge stores either in a shared or private Azure subscription. You always own and control your own metadata.
Azure authentication and authorization allows access to be controlled with Azure Active Directory (AAD) which can be set up as an extension of your on-premise Active Directory. Users can be authenticated and authorized in the same way or via Microsoft, Facebook, Google and Twitter identities. End-user authorization can be per knowledge store or per corpus.
Yes. No question.
Unearth depends on Azure Search as its primary processing engine. Out of the box Azure Search can easily scale to handle billions of documents and for extremely large scale requirements Microsoft can provide custom search deployments (think ‘Bing’).
We are unlikely to ever need a custom search deployment because for any single implementation Unearth supports multiple knowledge stores with multiple search services and multiple indexes, combining the results from all these in the knowledge discovery API. We can safely say 'no job too big'.
If your needs are small, or start small, we've got that covered too. Using our shared services, generic and industry models the only component needed is a private knowledge store and some assistance to integrate with your document store and Active Directory.
We are already working on extensions to Unearth for organisations that want to use the basic Cognitive Search functionality to provide a custom UI with very specific capabilities for their customers. This includes training application-specific contextual models.
Trained industry, company and domain-specific contextual models can be exposed directly to other applications. For example, a Bot could use a domain model to help it understand the meaning of human input in a particular context or a domain model could help train a LUIS model which itself can provide language understanding to bots or other applications.
Company and domain models are the IP of the organisation that trains them and can be utilized in many ways to achieve competitive advantage.
Our generic and industry models are currently only available in English but if there is demand we can build contextual models in any Azure Search supported language. 54 languages are currently supported by Analysers in Azure Search. The basic Cognitive Search capability will however work in any Azure Search supported language.
No. Unearth will work fine as a cognitive search tool with only the generic model, and by learning from your users it will improve over time. However, you will see significant advantages with trained industry, company and domain-specific models.
We don’t yet have models for every industry, but as customers adopt Unearth we will create and/or improve the model for that specific industry. Industry models are our responsibility to build and maintain.
We can show you how to create and train your own contextual models which know about your business, belong to you only and can improve your employees and/or customer's search experiences - or be utilised to give other custom applications (like bots) an understanding of the context of your organisation or application/domain. That takes some effort but can provide very rewarding outcomes.
Probably not. Not today anyway. We have tried many OCR engines, each one has strengths and weaknesses. Our preference is to use the Microsoft Computer Vision API because we understand and trust the science behind it and that its capability will grow over time. We can however use multiple OCR engines on each ingested document, evaluate the results from each and choose the best result.
Given that there is not currently one OCR engine to rule them all this approach gives our customers the best results possible for now. For serious OCR problems, we can identify and stream difficult to read scanned documents to a human for interpretation.
Turning documents into knowledge
Document Ingest: PDFs, OCR, Word, Excel, PowerPoint, video, audio or custom. Simple, adaptable interfaces to document stores using Microsoft Flow services or custom crawlers.
Knowledge Generation: Extract meaning from language and images with Cognitive Services, Machine Learning and Cognitive Toolkit. Generic, industry-specific, company-specific and domain-specific models.
Knowledge Store: Hierarchy of corpora, source references, knowledge metadata: text, topics, entities, key phrases, corpus glossaries. Adaptable JSON schemas for formal document types.
Knowledge Analysis: Know what you know.
Search Engine: Azure Search. Standard indexes and indexes programmatically created from JSON schemas for corpus-specific document types. Customer or corpus-specific search scopes.
Query Tuning: Cognitive Services, Recommendation API: learn over time what search terms mean in relation to specific corpora. “People who searched for x often searched for y” etc.
Knowledge Discovery: Different views for different people/corpora. Natural language search, Glossary suggestions (‘Wikipedia’) and semi-formal (facets based on standard fields and JSON schema) views.