Metadata filtering for tabular information with Data Bases for Amazon Bedrock

Amazon Bedrock is a totally managed service that gives a selection of high-performing basis fashions (FMs) from main synthetic intelligence (AI) firms like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon by means of a single API. To equip FMs with up-to-date and proprietary data, organizations use Retrieval Augmented Technology (RAG), a way that fetches information from firm information sources and enriches the immediate to supply extra related and correct responses. Data Bases for Amazon Bedrock is a totally managed functionality that helps you implement all the RAG workflow, from ingestion to retrieval and immediate augmentation. Nevertheless, details about one dataset could be in one other dataset, known as metadata. With out utilizing metadata, your retrieval course of may cause the retrieval of unrelated outcomes, thereby lowering FM accuracy and growing price within the FM immediate token.

On March 27, 2024, Amazon Bedrock introduced a key new characteristic known as metadata filtering and in addition modified the default engine. This transformation permits you to use metadata fields in the course of the retrieval course of. Nevertheless, the metadata fields must be configured in the course of the data base ingestion course of. Usually, you might need tabular information the place particulars about one subject can be found in one other subject. Additionally, you possibly can have a requirement to quote the precise textual content doc or textual content subject to stop hallucination. On this submit, we present you how you can use the brand new metadata filtering characteristic with Data Bases for Amazon Bedrock for such tabular information.

Resolution overview

The answer consists of the next high-level steps:

Put together information for metadata filtering.
Create and ingest information and metadata into the data base.
Retrieve information from the data base utilizing metadata filtering.

Put together information for metadata filtering

As of this writing, Data Bases for Amazon Bedrock helps Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, Redis Enterprise, and MongoDB Atlas as underlying vector retailer suppliers. On this submit, we create and entry an OpenSearch Serverless vector retailer utilizing the Amazon Bedrock Boto3 SDK. For extra particulars, see Arrange a vector index on your data base in a supported vector retailer.

For this submit, we create a data base utilizing the general public dataset Meals.com – Recipes and Opinions. The next screenshot exhibits an instance of the dataset.

The TotalTime is in ISO 8601 format. You may convert that to minutes utilizing the next logic:

# Operate to transform ISO 8601 length to minutes
def convert_to_minutes(length):
    hours = 0
    minutes = 0
    
    # Discover hours and minutes utilizing regex
    match = re.match(r'PT(?:(d+)H)?(?:(d+)M)?', length)
    
    if match:
        if match.group(1):
            hours = int(match.group(1))
        if match.group(2):
            minutes = int(match.group(2))
    
    # Convert complete time to minutes
    total_minutes = hours * 60 + minutes
    return total_minutes

df['TotalTimeInMinutes'] = df['TotalTime'].apply(convert_to_minutes)

After changing a few of the options like CholesterolContent, SugarContent, and RecipeInstructions, the information body seems like the next screenshot.

To allow the FM to level to a particular menu with a hyperlink (cite the doc), we break up every row of the tabular information in a single textual content file, with every file containing RecipeInstructions as the information subject and TotalTimeInMinutes, CholesterolContent, and SugarContent as metadata. The metadata ought to be saved in a separate JSON file with the identical title as the information file and .metadata.json added to its title. For instance, if the information file title is 100.txt, the metadata file title ought to be 100.txt.metadata.json. For extra particulars, see Add metadata to your information to permit for filtering. Additionally, the content material within the metadata file ought to be within the following format:

{
"metadataAttributes": {
"${attribute1}": "${value1}",
"${attribute2}": "${value2}",
...
}
}

For the sake of simplicity, we solely course of the highest 2,000 rows to create the data base.

After you import the required libraries, create a neighborhood listing utilizing the next Python code:

import pandas as pd
import os, json, tqdm, boto3

metafolder="multi_file_recipe_data"os.mkdir(metafolder)

Iterate excessive 2,000 rows to create information and metadata information to retailer within the native folder:

for i in tqdm.trange(2000):
    desc = str(df['RecipeInstructions'][i])
    meta = {
    "metadataAttributes": {
        "Identify": str(df['Name'][i]),
        "TotalTimeInMinutes": str(df['TotalTimeInMinutes'][i]),
        "CholesterolContent": str(df['CholesterolContent'][i]),
        "SugarContent": str(df['SugarContent'][i]),
    }
    }
    filename = metafolder+'/' + str(i+1)+ '.txt'
    f = open(filename, 'w')
    f.write(desc)
    f.shut()
    metafilename = filename+'.metadata.json'
    with open( metafilename, 'w') as f:
        json.dump(meta, f)

Create an Amazon Easy Storage Service (Amazon S3) bucket named food-kb and add the information:

# Add information to s3
s3_client = boto3.consumer("s3")
bucket_name = "recipe-kb"
data_root = metafolder+'/'
def uploadDirectory(path,bucket_name):
    for root,dirs,information in os.stroll(path):
        for file in tqdm.tqdm(information):
            s3_client.upload_file(os.path.be part of(root,file),bucket_name,file)

uploadDirectory(data_root, bucket_name)

Create and ingest information and metadata into the data base

When the S3 folder is prepared, you possibly can create the data base on the Amazon Bedrock console utilizing the SDK in response to this instance pocket book.

Retrieve information from the data base utilizing metadata filtering

Now let’s retrieve some information from the data base. For this submit, we use Anthropic Claude Sonnet on Amazon Bedrock for our FM, however you possibly can select from a wide range of Amazon Bedrock fashions. First, you might want to set the next variables, the place kb_id is the ID of your data base. The data base ID could be discovered programmatically, as proven within the instance pocket book, or from the Amazon Bedrock console by navigating to the person data base, as proven within the following screenshot.

Set the required Amazon Bedrock parameters utilizing the next code:

import boto3
import pprint
from botocore.consumer import Config
import json

pp = pprint.PrettyPrinter(indent=2)
session = boto3.session.Session()
area = session.region_name
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.consumer('bedrock-runtime', region_name = area)
bedrock_agent_client = boto3.consumer("bedrock-agent-runtime",
                              config=bedrock_config, region_name = area)
kb_id = "EIBBXVFDQP"
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

# retrieve api for fetching solely the related context.

question = " Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10 "

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'textual content': question
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 2 
        }
    }
)
pp.pprint(relevant_documents["retrievalResults"])

The next code is the output of the retrieval from the data base with out metadata filtering for the question “Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10.” As we are able to see, out of the 2 recipes, the preparation durations are 30 and 480 minutes, respectively, and the ldl cholesterol contents are 86 and 112.4, respectively. Due to this fact, the retrieval isn’t following the question precisely.

The next code demonstrates how you can use the Retrieve API with the metadata filters set to a ldl cholesterol content material lower than 10 and minutes of preparation lower than 30 for a similar question:

def retrieve(question, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'textual content': question
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                 "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        }
            }
        }
    ) 
question = "Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10" 
response = retrieve(question, kb_id, 2)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

As we are able to see within the following outcomes, out of the 2 recipes, the preparation instances are 27 and 20, respectively, and the ldl cholesterol contents are 0 and 0, respectively. With using metadata filtering, we get extra correct outcomes.

The next code exhibits how you can get correct output utilizing the identical metadata filtering with the retrieve_and_generate API. First, we set the immediate, then we arrange the API with metadata filtering:

immediate = f"""
Human: You've nice data about meals, so present solutions to questions by utilizing reality. 
If you do not know the reply, simply say that you do not know, do not attempt to make up a solution.

Assistant:"""

def retrieve_and_generate(question, kb_id,modelId, numberOfResults=10):
    return bedrock_agent_client.retrieve_and_generate(
        enter= {
            'textual content': question,
        },
        retrieveAndGenerateConfiguration={
        'knowledgeBaseConfiguration': {
            'generationConfiguration': {
                'promptTemplate': {
                    'textPromptTemplate': f"{immediate} $search_results$"
                }
            },
            'knowledgeBaseId': kb_id,
            'modelArn': model_id,
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': numberOfResults,
                    'overrideSearchType': 'HYBRID',
                     "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        },
                }
        }
                    },
        'sort': 'KNOWLEDGE_BASE'
    }
    )
    
question = "Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10"
response = retrieve_and_generate(question, kb_id,modelId, numberOfResults=10)
pp.pprint(response['output']['text'])

As we are able to see within the following output, the mannequin returns an in depth recipe that follows the instructed metadata filtering of lower than half-hour of preparation time and a ldl cholesterol content material lower than 10.

Clear up

Ensure that to remark the next part in case you’re planning to make use of the data base that you just created for constructing your RAG software. For those who solely wished to check out creating the data base utilizing the SDK, be certain to delete all of the sources that had been created as a result of you’ll incur prices for storing paperwork within the OpenSearch Serverless index. See the next code:

bedrock_agent_client.delete_data_source(dataSourceId = ds["dataSourceId"], knowledgeBaseId=kb['knowledgeBaseId'])
bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])
oss_client.indices.delete(index=index_name)
aoss_client.delete_collection(id=collection_id)
aoss_client.delete_access_policy(sort="information", title=access_policy['accessPolicyDetail']['name'])
aoss_client.delete_security_policy(sort="community", title=network_policy['securityPolicyDetail']['name'])
aoss_client.delete_security_policy(sort="encryption", title=encryption_policy['securityPolicyDetail']['name'])
# Delete roles and polices 
iam_client.delete_role(RoleName=bedrock_kb_execution_role)
iam_client.delete_policy(PolicyArn=policy_arn)

Conclusion

On this submit, we defined how you can break up a big tabular dataset into rows to arrange a data base with metadata for every of these data, and how you can then retrieve outputs with metadata filtering. We additionally confirmed how retrieving outcomes with metadata is extra correct than retrieving outcomes with out metadata filtering. Lastly, we confirmed how you can use the consequence with an FM to get correct outcomes.

To additional discover the capabilities of Data Bases for Amazon Bedrock, confer with the next sources:

In regards to the Writer

Tanay Chowdhury is a Knowledge Scientist at Generative AI Innovation Heart at Amazon Net Providers. He helps prospects to resolve their enterprise drawback utilizing Generative AI and Machine Studying.