Organizations are sometimes inundated with video and audio content material that incorporates invaluable insights. Nevertheless, extracting these insights effectively and with excessive accuracy stays a problem. This put up explores an modern answer to speed up video and audio assessment workflows by way of a thoughtfully designed consumer expertise that allows human and AI collaboration. By approaching the issue from the consumer’s viewpoint, we will create a robust instrument that enables individuals to shortly discover related data inside lengthy recordings with out the chance of AI hallucinations.
Many professionals, from legal professionals and journalists to content material creators and medical practitioners, have to assessment hours of recorded content material commonly to extract verifiably correct insights. Conventional strategies of guide assessment or easy key phrase searches over transcripts are time-consuming and infrequently miss vital context. Extra superior AI-powered summarization instruments exist, however they threat producing hallucinations or inaccurate data, which will be harmful in high-stakes environments like healthcare or authorized proceedings.
Our answer, the Recorded Voice Perception Extraction Webapp (ReVIEW), addresses these challenges by offering a seamless methodology for people to collaborate with AI, accelerating the assessment course of whereas sustaining accuracy and belief within the outcomes. The applying is constructed on prime of Amazon Transcribe and Amazon Bedrock, a completely managed service that gives a selection of high-performing basis fashions (FMs) from main AI corporations like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon by way of a single API, together with a broad set of capabilities to construct generative AI functions with safety, privateness, and accountable AI.
Person expertise
To speed up a consumer’s assessment of a long-form audio or video whereas mitigating the chance of hallucinations, we introduce the idea of timestamped citations. Not solely are giant language fashions (LLMs) able to answering a consumer’s query based mostly on the transcript of the file, they’re additionally able to figuring out the timestamp (or timestamps) of the transcript throughout which the reply was mentioned. Through the use of a mixture of transcript preprocessing, immediate engineering, and structured LLM output, we allow the consumer expertise proven within the following screenshot, which demonstrates the conversion of LLM-generated timestamp citations into clickable buttons (proven underlined in purple) that navigate to the proper portion of the supply video.
The consumer on this instance has uploaded various movies, together with some recordings of AWS re:Invent talks. You’ll discover that the previous reply really incorporates a hallucination originating from an error within the transcript; the AI assistant replied that “Hyperpaths” was introduced, when in actuality the service known as Amazon SageMaker HyperPod.
The consumer within the previous screenshot had the next journey:
- The consumer asks the AI assistant “What’s new with SageMaker?” The assistant searches the timestamped transcripts of the uploaded re:Invent movies.
- The assistant gives a solution with citations. These citations include each the title of the video and a timestamp, and the frontend shows buttons akin to the citations. Every quotation can level to a unique video, or to totally different timestamps throughout the similar video.
- The consumer reads that SageMaker “Hyperpaths” was introduced. They proceed to confirm the accuracy of the generated reply by choosing the buttons, which auto play the supply video beginning at that timestamp.
- The consumer sees that the product is definitely referred to as Amazon SageMaker HyperPod, and will be assured that SageMaker HyperPod was the product introduced at re:Invent.
This expertise, which is on the coronary heart of the ReVIEW software, allows customers to effectively get solutions to questions based mostly on uploaded audio or video information and to confirm the accuracy of the solutions by rewatching the supply media for themselves.
Resolution overview
The total code for this software is offered on the GitHub repo.
The structure of the answer is proven within the following diagram, showcasing the circulation of information by way of the appliance.
The workflow consists of the next steps:
- A consumer accesses the appliance by way of an Amazon CloudFront distribution, which provides a customized header and forwards HTTPS site visitors to an Elastic Load Balancing software load balancer. Behind the load balancer is a containerized Streamlit software operating on Amazon Elastic Container Service (Amazon ECS).
- Amazon Cognito handles consumer logins to the frontend software and Amazon API Gateway.
- When a consumer uploads a media file by way of the frontend, a pre-signed URL is generated for the frontend to add the file to Amazon Easy Storage Service (Amazon S3).
- The frontend posts the file to an software S3 bucket, at which level a file processing circulation is initiated by way of a triggered AWS Lambda. The file is shipped to Amazon Transcribe and the ensuing transcript is saved in Amazon S3. The transcript will get postprocessed right into a textual content type extra applicable to be used by an LLM, and an AWS Step Capabilities state machine syncs the transcript to a information base configured in Amazon Bedrock Information Bases. The information base sync course of handles chunking and embedding of the transcript, and storing embedding vectors and file metadata in an Amazon OpenSearch Serverless vector database.
- If a consumer asks a query of 1 particular transcript (designated by the “decide media file” dropdown menu within the UI), the complete transcript is used to generate the response, so a retrieval step utilizing the information base will not be required and an LLM known as immediately by way of Amazon Bedrock.
- If the consumer is asking a query whose reply may seem in any variety of supply movies (by selecting Chat with all media information on the dropdown menu within the UI), the Amazon Bedrock Information Bases RetrieveAndGenerate API is used to embed the consumer question, discover semantically related chunks within the vector database, enter these chunks into an LLM immediate, and generate a specifically formatted response.
- All through the method, software information from monitoring transcription and ingestion standing, mapping consumer names to uploaded information, and caching responses are completed with Amazon DynamoDB.
One vital attribute of the structure is the clear separation of frontend and backend logic by way of an API Gateway deployed REST API. This was a design determination to allow customers of this software to interchange the Streamlit frontend with a customized frontend. There are directions for changing the frontend within the README of the GitHub repository.
Timestamped citations
The important thing to this answer lies within the immediate engineering and structured output format. When producing a response to a consumer’s query, the LLM is instructed to not solely present a solution to the query (if attainable), but in addition to quote its sources in a particular method.
The total immediate will be seen within the GitHub repository, however a shortened pseudo immediate (for brevity) is proven right here:
You might be an clever AI which makes an attempt to reply questions based mostly on retrieved chunks of routinely generated transcripts.
Under are retrieved chunks of transcript with metadata together with the file title. Every chunk features a <media_name> and contours of a transcript, every line starting with a timestamp.
$$ retrieved transcript chunks $$
Your reply ought to be in json format, together with a listing of partial solutions, every of which has a quotation. The quotation ought to embody the supply file title and timestamp. Right here is the consumer’s query:
$$ consumer query $$
The frontend then parses the LLM response into a set schema information mannequin, described with Pydantic BaseModels:
from pydantic import BaseModel
class Quotation(BaseModel):
"""A single quotation from a transcript"""
media_name: str
timestamp: int
class PartialQAnswer(BaseModel):
"""A part of an entire reply, to be concatenated with different partial solutions"""
partial_answer: str
citations: Record[Citation]
class FullQAnswer(BaseModel):
"""Full consumer question response together with citations and a number of partial solutions"""
reply: Record[PartialQAnswer]
This format permits the frontend to parse the response and show buttons for every quotation that cue up the related media phase for consumer assessment.
Deployment particulars
The answer is deployed within the type of one AWS Cloud Growth Package (AWS CDK) stack, which incorporates 4 nested stacks:
- A backend that handles transcribing uploaded media and monitoring job statuses
- A Retrieval Augmented Technology (RAG) stack that handles organising OpenSearch Serverless and Amazon Bedrock Information Bases
- An API stack that stands up an Amazon Cognito approved REST API and varied Lambda capabilities to logically separate the frontend from the backend
- A frontend stack that consists of a containerized Streamlit software operating as a load balanced service in an ECS cluster, with a CloudFront distribution linked to the load balancer
Stipulations
The answer requires the next conditions:
- You should have an AWS account and an AWS Identification and Entry Administration (IAM) function and consumer with permissions to create and handle the required sources and parts for this software. When you don’t have an AWS account, see How do I create and activate a brand new Amazon Internet Providers account?
- You additionally have to request entry to no less than one Amazon Bedrock LLM (to generate solutions to questions) and one embedding mannequin (to seek out transcript chunks which might be semantically just like a consumer query). The next Amazon Bedrock fashions are the default, however will be modified utilizing a configuration file on the software deployment time as described later on this put up:
- Amazon Titan Embeddings V2 – Textual content
- Amazon’s Nova Professional
- You want a Python surroundings with AWS CDK dependencies put in. For directions, see Working with the AWS CDK in Python.
- Docker is required to construct the Streamlit frontend container at deployment time.
- The minimal IAM permissions wanted to bootstrap and deploy the AWS CDK are described within the ReVIEW/infra/minimal-iam-policy.json file within the GitHub repository. Be certain the IAM consumer or function deploying the stacks has these permissions.
Clone the repository
Fork the repository, and clone it to the situation of your selection. For instance:
Edit the deployment config file
Optionally, edit the infra/config.yaml
file to supply a descriptive base title to your stack. This file can be the place you may select particular Amazon Bedrock embedding fashions for semantic retrieval and LLMs for response technology, and outline chunking methods for the information base that may ingest transcriptions of uploaded media information. This file can be the place you may reuse an present Amazon Cognito consumer pool if you wish to bootstrap your software with an present consumer base.
Deploy the AWS CDK stacks
Deploy the AWS CDK stacks with the next code:
You solely want to make use of the previous command one time per AWS account. The deploy command will deploy the guardian stack and 4 nested stacks. The method takes roughly 20 minutes to finish.
When the deployment is full, a CloudFront distribution URL of the shape xxx.cloudfront.web
will likely be printed on the console display to entry the appliance. This URL will also be discovered on the AWS CloudFormation console by finding the stack whose title matches the worth within the config file, then selecting the Outputs tab and finding the worth related to the important thing ReVIEWFrontendURL
. That URL will lead you to a login display like the next screenshot.
Create an Amazon Cognito consumer to entry the app
To log in to the operating net software, you must create an Amazon Cognito consumer. Full the next steps:
- On the Amazon Cognito console, navigate to the just lately created consumer pool.
- Within the Customers part beneath Person Administration¸ select Create consumer.
- Create a consumer title and password to log in to the ReVIEW software deployed within the account.
When the appliance deployment is destroyed (as described within the cleanup part), the Amazon Cognito pool stays to protect the consumer base. The pool will be totally eliminated manually utilizing the Amazon Cognito console.
Take a look at the appliance
Take a look at the appliance by importing a number of audio or video information on the File Add tab. The applying helps media codecs supported by Amazon Transcribe. If you’re in search of a pattern video, take into account downloading a TED discuss. After importing, you will notice the file seem on the Job Standing tab. You may observe processing progress by way of transcription, postprocessing, and information base syncing steps on this tab. After no less than one file is marked Full, you may chat with it on the Chat With Your Media tab.
The Analyze Your Media tab permits you to create and apply customized LLM template prompts to particular person uploaded information. For instance, you may create a primary abstract template, or an extract key data template, and apply it to your uploaded information right here. This performance was not described intimately on this put up.
Clear up
The deployed software will incur ongoing prices even when it isn’t used, for instance from OpenSearch Serverless indexing and search OCU minimums. To delete all sources created when deploying the appliance, run the next command:
Conclusion
The answer offered on this put up demonstrates a robust sample for accelerating video and audio assessment workflows whereas sustaining human oversight. By combining the ability of AI fashions in Amazon Bedrock with human experience, you may create instruments that not solely increase productiveness but in addition keep the crucial aspect of human judgment in vital decision-making processes.
We encourage you to discover this totally open sourced answer, adapt it to your particular use circumstances, and supply suggestions in your experiences.
For knowledgeable help, the AWS Generative AI Innovation Middle, AWS Skilled Providers, and our AWS Companions are right here to assist.
Concerning the Creator
David Kaleko is a Senior Utilized Scientist within the AWS Generative AI Innovation Middle.