Enhance productiveness when processing scanned PDFs utilizing Amazon Q Enterprise

Amazon Q Enterprise is a generative AI-powered assistant that may reply questions, present summaries, generate content material, and extract insights instantly from the content material in digital in addition to scanned PDF paperwork in your enterprise knowledge sources with no need to extract the textual content first.

Clients throughout industries similar to finance, insurance coverage, healthcare life sciences, and extra have to derive insights from numerous doc varieties, similar to receipts, healthcare plans, or tax statements, that are regularly in scanned PDF format. These doc varieties typically have a semi-structured or unstructured format, which requires processing to extract textual content earlier than indexing with Amazon Q Enterprise.

The launch of scanned PDF doc assist with Amazon Q Enterprise may also help you seamlessly course of a wide range of multi-modal doc varieties via the AWS Administration Console and APIs, throughout all supported Amazon Q Enterprise AWS Areas. You may ingest paperwork, together with scanned PDFs, out of your knowledge sources utilizing supported connectors, index them, after which use the paperwork to reply questions, present summaries, and generate content material securely and precisely out of your enterprise methods. This function eliminates the event effort required to extract textual content from scanned PDF paperwork outdoors of Amazon Q Enterprise, and improves the doc processing pipeline for constructing your generative synthetic intelligence (AI) assistant with Amazon Q Enterprise.

On this publish, we present the right way to asynchronously index and run real-time queries with scanned PDF paperwork utilizing Amazon Q Enterprise.

Answer overview

You should utilize Amazon Q Enterprise for scanned PDF paperwork from the console, AWS SDKs, or AWS Command Line Interface (AWS CLI).

Amazon Q Enterprise supplies a flexible suite of knowledge connectors that may combine with a variety of enterprise knowledge sources, empowering you to develop generative AI options with minimal setup and configuration. To be taught extra, go to Amazon Q Enterprise, now usually accessible, helps enhance workforce productiveness with generative AI.

After your Amazon Q Enterprise utility is able to use, you’ll be able to instantly add the scanned PDFs into an Amazon Q Enterprise index utilizing both the console or the APIs. Amazon Q Enterprise gives a number of knowledge supply connectors that may combine and synchronize knowledge from a number of knowledge repositories into single index. For this publish, we display two situations to make use of paperwork: one with the direct doc add possibility, and one other utilizing the Amazon Easy Storage Service (Amazon S3) connector. If it’s essential ingest paperwork from different knowledge sources, consult with Supported connectors for particulars on connecting further knowledge sources.

Index the paperwork

On this publish, we use three scanned PDF paperwork as examples: an bill, a well being plan abstract, and an employment verification kind, together with some textual content paperwork.

Step one is to index these paperwork. Full the next steps to index paperwork utilizing the direct add function of Amazon Q Enterprise. For this instance, we add the scanned PDFs.

On the Amazon Q Enterprise console, select Purposes within the navigation pane and open your utility.
Select Add knowledge supply.
Select Add Recordsdata.
Add the scanned PDF recordsdata.

You may monitor the uploaded recordsdata on the Knowledge sources tab. The Add standing modifications from Obtained to Processing to Listed or Up to date, as which level the file has been efficiently listed into the Amazon Q Enterprise knowledge retailer. The next screenshot exhibits the efficiently listed PDFs.

The next steps display the right way to combine and synchronize paperwork utilizing an Amazon S3 connector with Amazon Q Enterprise. For this instance, we index the textual content paperwork.

On the Amazon Q Enterprise console, select Purposes within the navigation pane and open your utility.
Select Add knowledge supply.
Select Amazon S3 for the connector.
Enter the knowledge for Identify, VPC and safety group settings, IAM position, and Sync mode.
To complete connecting your knowledge supply to Amazon Q Enterprise, select Add knowledge supply.
Within the Knowledge supply particulars part of your connector particulars web page, select Sync now to permit Amazon Q Enterprise to start syncing (crawling and ingesting) knowledge out of your knowledge supply.

When the sync job is full, your knowledge supply is able to use. The next screenshot exhibits all 5 paperwork (scanned and digital PDFs, and textual content recordsdata) are efficiently listed.

The next screenshot exhibits a complete view of the 2 knowledge sources: the instantly uploaded paperwork and the paperwork ingested via the Amazon S3 connector.

Now let’s run some queries with Amazon Q Enterprise on our knowledge sources.

Queries on dense, unstructured, scanned PDF paperwork

Your paperwork may be dense, unstructured, scanned PDF doc varieties. Amazon Q Enterprise can establish and extract probably the most salient information-dense textual content from it. On this instance, we use the multi-page well being plan abstract PDF we listed earlier. The next screenshot exhibits an instance web page.

That is an instance of a well being plan abstract doc.

Within the Amazon Q Enterprise net UI, we ask “What’s the annual complete out-of-pocket most, talked about within the well being plan abstract?”

Amazon Q Enterprise searches the listed doc, retrieves the related data, and generates a solution whereas citing the supply for its data. The next screenshot exhibits the pattern output.

Queries on structured, tabular, scanned PDF paperwork

Paperwork may additionally comprise structured knowledge parts in tabular format. Amazon Q Enterprise can robotically establish, extract, and linearize structured knowledge from scanned PDFs to precisely resolve any person queries. Within the following instance, we use the bill PDF we listed earlier. The next screenshot exhibits an instance.

That is an instance of an bill.

Within the Amazon Q Enterprise net UI, we ask “How a lot had been the headphones charged within the bill?”

Amazon Q Enterprise searches the listed doc and retrieves the reply just about the supply doc. The next screenshot exhibits that Amazon Q Enterprise is ready to extract invoice data from the bill.

Queries on semi-structured types

Your paperwork may additionally comprise semi-structured knowledge parts in a kind, similar to key-value pairs. Amazon Q Enterprise can precisely fulfill queries associated to those knowledge parts by extracting particular fields or attributes which are significant for the queries. On this instance, we use the employment verification PDF. The next screenshot exhibits an instance.

That is an instance of an employment verification kind.

Within the Amazon Q Enterprise net UI, we ask “What’s the applicant’s date of employment within the employment verification kind?” Amazon Q Enterprise searches the listed employment verification doc and retrieves the reply just about the supply doc.

Index paperwork utilizing the AWS CLI

On this part, we present you the right way to use the AWS CLI to ingest structured and unstructured paperwork saved in an S3 bucket into an Amazon Q Enterprise index. You may rapidly retrieve detailed details about your paperwork, together with their statuses and any errors occurred throughout indexing. For those who’re an present Amazon Q Enterprise person and have listed paperwork in numerous codecs, similar to scanned PDFs and different supported varieties, and also you now need to reindex the scanned paperwork, full the next steps:

Verify the standing of every doc to filter failed paperwork based on the standing "DOCUMENT_FAILED_TO_INDEX". You may filter the paperwork based mostly on this error message:

"errorMessage": "Doc can't be listed because it accommodates no textual content to index and search on. Doc should comprise some textual content."

For those who’re a brand new person and haven’t listed any paperwork, you’ll be able to skip this step.

The next is an instance of utilizing the ListDocuments API to filter paperwork with a selected standing and their error messages:

aws qbusiness list-documents --region <area> 
--application-id <application-id> 
--index-id <index-id> 
--query "documentDetailList[?status=='DOCUMENT_FAILED_TO_INDEX'].{DocumentId:documentId, ErrorMessage:error.errorMessage}"
--output json

The next screenshot exhibits the AWS CLI output with an inventory of failed paperwork with error messages.

Now you batch-process the paperwork. Amazon Q Enterprise helps including a number of paperwork to an Amazon Q Enterprise index.

Use the BatchPutDocument API to ingest a number of scanned paperwork saved in an S3 bucket into the index:

aws qbusiness batch-put-document —area <area> 
--documents '[{ "id":"s3://<your-bucket-path>/<scanned-pdf-document1>","content":{"s3":{"bucket":"<your-bucket> ","key":"<scanned-pdf-document1>"}}}, { "id":"s3://<your-bucket-path>/<scanned-pdf-document2>","content":{"s3":{"bucket":" <your-bucket>","key":"<scanned-pdf-document2>"}}}]' 
--application-id <application-id> 
--index-id <index-id> 
--endpoint-url <application-endpoint-url> 
--role-arn <role-arn> 
--no-verify-ssl

The next screenshot exhibits the AWS CLI output. You must see failed paperwork as an empty record.

Lastly, use the ListDocuments API once more to assessment if all paperwork had been listed correctly:

aws qbusiness list-documents --region <area> 
--application-id <application-id> 
--index-id <index-id> 
--endpoint-url <application-endpoint-url> 
--no-verify-ssl

The next screenshot exhibits that the paperwork are listed within the knowledge supply.

Clear up

For those who created a brand new Amazon Q Enterprise utility and don’t plan to make use of it additional, unsubscribe and take away assigned customers from the applying and delete it in order that your AWS account doesn’t accumulate prices. Furthermore, should you don’t want to make use of the listed knowledge sources additional, consult with Managing Amazon Q Enterprise knowledge sources for directions to delete your listed knowledge sources.

Conclusion

This publish demonstrated the assist for scanned PDF doc varieties with Amazon Q Enterprise. We highlighted the steps to sync, index, and question supported doc varieties—now together with scanned PDF paperwork—utilizing generative AI with Amazon Q Enterprise. We additionally confirmed examples of queries on structured, unstructured, or semi-structured multi-modal scanned paperwork utilizing the Amazon Q Enterprise net UI and AWS CLI.

To be taught extra about this function, consult with Supported doc codecs in Amazon Q Enterprise. Give it a attempt on the Amazon Q Enterprise console right now! For extra data, go to Amazon Q Enterprise and the Amazon Q Enterprise Person Information. You may ship suggestions to AWS re:Put up for Amazon Q or via your typical AWS assist contacts.

In regards to the Authors

Sonali Sahu is main the Generative AI Specialist Options Structure group in AWS. She is an creator, thought chief, and passionate technologist. Her core space of focus is AI and ML, and she or he regularly speaks at AI and ML conferences and meetups all over the world. She has each breadth and depth of expertise in know-how and the know-how business, with business experience in healthcare, the monetary sector, and insurance coverage.

Chinmayee Rane is a Generative AI Specialist Options Architect at AWS. She is enthusiastic about utilized arithmetic and machine studying. She focuses on designing clever doc processing and generative AI options for AWS prospects. Exterior of labor, she enjoys salsa and bachata dancing.

Himesh Kumar is a seasoned Senior Software program Engineer, at present working at Amazon Q Enterprise in AWS. He’s enthusiastic about constructing distributed methods within the generative AI/ML area. His experience extends to develop scalable and environment friendly methods, making certain excessive availability, efficiency, and reliability. Past the technical abilities, he’s devoted to steady studying and staying on the forefront of technological developments in AI and machine studying.

Qing Wei is a Senior Software program Developer for Amazon Q Enterprise group in AWS, and enthusiastic about constructing trendy functions utilizing AWS applied sciences. He loves community-driven studying and sharing of know-how particularly for machine studying internet hosting and inference associated subjects. His essential focus proper now could be on constructing serverless and event-driven architectures for RAG knowledge ingestion.