Use the GROK client to make rest calls to the Azure Search Service to create and run the indexing pipeline. Blob client is used to transfer the images to blob and download the extracted OCR from blob.
Example usage:
Create an .env file with the environment variables that includes the names of you index, indexer, skillset, and datasource to create on the search service. Include keys to the blob that contains the documents you want to index, keys to the cognitive service and keys to you computer vision subscription and search service. In order to index more than 20 documents, you must have a computer services subscription. You can find the keys for the services in the Azure Portal. An example of the .env file content is given below:
SEARCH_SERVICE_NAME = "ocr-ner-pipeline"
SKILLSET_NAME = "ocrskillset"
INDEX_NAME = "ocrindex"
INDEXER_NAME = "ocrindexer"
DATASOURCE_NAME = "syntheticimages"
DATASOURCE_CONTAINER_NAME = "ocrimages"
PROJECTIONS_CONTAINER_NAME = "ocrprojection"
BLOB_NAME = "syntheticimages"
BLOB_KEY = "<YOUR BLOB KEY>"
SEARCH_SERVICE_KEY = "<YOUR SEARCH SERVICE KEY>"
COGNITIVE_SERVICE_KEY = "<YOUR COGNITIVE SERVICE KEY>"
from genalog.ocr.blob_client import GrokBlobClient
from dotenv import load_dotenv
load_dotenv(".env")
upload_images_to_blob
function. This function takes in the local and remote path and an optional parameter to specify whether to use asyncio asynchronous uploads [https://docs.python.org/3/library/asyncio.html]. Asynchronous uploads are faster, however, some setups of python may not support them. In such cases, sychronous uploads can be made using use_async=False
.local_path = "testimages"
remote_path = "testimages"
destination_folder_name, upload_task = blob_client.upload_images_to_blob(local_path, remote_path, use_async=True)
await upload_task
poll_indexer_till_complete
will block and continuosly poll the indexer until it completly processes all docs.grok_rest_client = GrokRestClient.
grok_rest_client.create_indexing_pipeline()
grok_rest_client.run_indexer()
indexer_status = grok_rest_client.poll_indexer_till_complete()
output_folder = "./ocr"
async_download_task = blob_client.get_ocr_json( remote_path, output_folder, use_async=True)
await async_download_task