In this blog I’m going to talk you through a proof of concept we created here at Silversands, exploring the AI capabilities of Azure Search across multiple document types.

What is Azure Search?

Azure Search is not a search of Azure as the name might suggest. It is in-fact a managed search service by Microsoft that allows you to index data from one or more sources, and use ‘document cracking’ to extract further information using Cognitive Services. Ultimately you would typically integrate the search service into a website or bot for user consumption.

Step 1 – Select the Data Source

The first task was to gather some test data. This could be an existing database or just a collection of unstructured data in various formats. With this in mind, there are two indexing methods to consider.

  • Pulling Data- Automatically crawls and uploads data into the index from supported Azure data sources, such as Azure SQL, Cosmos DB, Azure Blob Storage.
  • Pushing Data – Programmatically send documents to Azure Search ether individually or in batches, regardless of where the data might be.

In this proof of concept, we wanted to highlight how Azure search could index through various document types, including images, PDF’s, html and text all stored in blob storage. The import data wizard within the Azure portal provided a simple way to get started using the pull method, so we used that and selected Azure Blob Storage from the list.

Step 2 – Add cognitive search

With the data source defined we now needed to consider the AI enrichments. These ‘skills’, as the service refers to them, will allow the search service to extract key information, detect languages and most importantly for this example, extract text from images.

To make this happen we needed a Cognitive Services resource in Azure. This can be a free or billable instance but beware, the free tier is restricted to 20 documents a day so only suitable for testing purposes. To allow for future expansion we created a billable instance and named it skillset as shown below.

Phase two of this process was the configuration of the skillset. Because our data collection includes images, the key part was to enable Optical character recognition (OCR) so the Cognitive Services resource could provide insights into the images and map it to a corresponding field.

So what do these skills exactly do? The text cognitive skills are self-explanatory. These essentially detect patterns and extract key information allowing a faster, more refined search through the use of filters.

The image cognitive skills look at the images and create tags based on the objects detected or people identified. The built-in example being the recognition of celebrities.

If there’s a requirement for recognising specific information within a document, you do have the option to create custom skills and link those into the enrichment pipeline.

Step 3 – Create the index

The index tells Azure Search which fields are available and the data type contained within them. In this case we knew we wanted to create filters for the image tags, people, organisations and celebrities, so filterable and facetable were enabled for those fields. This would give users a way to drill into the search results using a menu.

Step 4 – Test

At this point much of the configuration was at a point where testing could begin. For this proof of concept we knew we wanted a website, but search results could also easily be presented to users using a chat bot within Microsoft Teams.

Using the AzSearch tool a simple website was created to show the search data being returned and the filter menu working for the fields we enabled as filterable and facetable in Step 3.

This proves the search is working, however it’s returning the results in ugly json format so a little more custom work was needed before we could present a usable interface.

After some more custom changes a search for “Linux” now returns relevant images, documents and highlighted text in a caption.

As you can see from the first result, the only text returned is within the image. Proving cognitive services has extracted the text and is working as expected!

Conclusion

Azure Search offers the ability to gain rich insights into your existing structured or unstructured data located anywhere and could be of real value to organisations that have accumulated large data stores that require a modern search method.

With Azure services changing all the time it’s important to run proof of concepts to test new improvements and features. Here at Silversands we found the Cognitive Search features within Azure Search added significant value to the indexing of unstructured data, especially when it comes to extracting searchable text from images. A process that would traditionally involve manually tagging images one by one.

What’s next?

If you’re interested in how we can help you harness the power of Azure for your organisation, please get in touch for a no-obligation chat.