Producing vogue product descriptions by fine-tuning a vision-language mannequin with SageMaker and Amazon Bedrock

On the planet of on-line retail, creating high-quality product descriptions for hundreds of thousands of merchandise is an important, however time-consuming job. Utilizing machine studying (ML) and pure language processing (NLP) to automate product description era has the potential to save lots of handbook effort and remodel the best way ecommerce platforms function. One of many essential benefits of high-quality product descriptions is the advance in searchability. Prospects can extra simply find merchandise which have appropriate descriptions, as a result of it permits the search engine to establish merchandise that match not simply the overall class but additionally the precise attributes talked about within the product description. For instance, a product that has an outline that features phrases resembling “lengthy sleeve” and “cotton neck” will likely be returned if a client is searching for a “lengthy sleeve cotton shirt.” Moreover, having factoid product descriptions can improve buyer satisfaction by enabling a extra customized shopping for expertise and enhancing the algorithms for recommending extra related merchandise to customers, which increase the likelihood that customers will make a purchase order.

With the development of Generative AI, we are able to use vision-language fashions (VLMs) to foretell product attributes straight from pictures. Pre-trained picture captioning or visible query answering (VQA) fashions carry out effectively on describing every-day pictures however can’t to seize the domain-specific nuances of ecommerce merchandise wanted to attain passable efficiency in all product classes. To unravel this drawback, this publish reveals you the way to predict domain-specific product attributes from product pictures by fine-tuning a VLM on a vogue dataset utilizing Amazon SageMaker, after which utilizing Amazon Bedrock to generate product descriptions utilizing the anticipated attributes as enter. So you’ll be able to observe alongside, we’re sharing the code in a GitHub repository.

Amazon Bedrock is a completely managed service that gives a alternative of high-performing basis fashions (FMs) from main AI corporations like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon by means of a single API, together with a broad set of capabilities that you must construct generative AI functions with safety, privateness, and accountable AI.

You should utilize a managed service, resembling Amazon Rekognition, to foretell product attributes as defined in Automating product description era with Amazon Bedrock. Nonetheless, if you happen to’re making an attempt to extract specifics and detailed traits of your product or your area (business), fine-tuning a VLM on Amazon SageMaker is critical.

Imaginative and prescient-language fashions

Since 2021, there was an increase in curiosity in vision-language fashions (VLMs), which led to the discharge of options resembling Contrastive Language-Picture Pre-training (CLIP) and Bootstrapping Language-Picture Pre-training (BLIP). On the subject of duties resembling picture captioning, text-guided picture era, and visible question-answering, VLMs have demonstrated state-of-the artwork efficiency.

On this publish, we use BLIP-2, which was launched in BLIP-2: Bootstrapping Language-Picture Pre-training with Frozen Picture Encoders and Massive Language Fashions, as our VLM. BLIP-2 consists of three fashions: a CLIP-like picture encoder, a Querying Transformer (Q-Former) and a big language mannequin (LLM). We use a model of BLIP-2, that accommodates Flan-T5-XL because the LLM.

The next diagram illustrates the overview of BLIP-2:

Determine 1: BLIP-2 overview

The pre-trained model of the BLIP-2 mannequin has been demonstrated in Construct an image-to-text generative AI utility utilizing multimodality fashions on Amazon SageMaker and Construct a generative AI-based content material moderation resolution on Amazon SageMaker JumpStart. On this publish, we reveal the way to fine-tune BLIP-2 for a domain-specific use case.

Answer overview

The next diagram illustrates the answer structure.

Determine 2: Excessive-level resolution structure

The high-level overview of the answer is:

An ML scientist makes use of Sagemaker notebooks to course of and break up the info into coaching and validation information.
The datasets are uploaded to Amazon Easy Storage Service (Amazon S3) utilizing the S3 shopper (a wrapper round an HTTP name).
Then the Sagemaker shopper is used to launch a Sagemaker Coaching job, once more a wrapper for an HTTP name.
The coaching job manages the copying of the datasets from S3 to the coaching container, the coaching of the mannequin, and the saving of its artifacts to S3.
Then, by means of one other name of the Sagemaker shopper, an endpoint is generated, copying the mannequin artifacts into the endpoint internet hosting container.
The inference workflow is then invoked by means of an AWS Lambda request, which first makes an HTTP request to the Sagemaker endpoint, after which makes use of that to make one other request to Amazon Bedrock.

Within the following sections, we reveal the way to:

Arrange the event setting
Load and put together the dataset
Fantastic-tune the BLIP-2 mannequin to be taught product attributes utilizing SageMaker
Deploy the fine-tuned BLIP-2 mannequin and predict product attributes utilizing SageMaker
Generate product descriptions from predicted product attributes utilizing Amazon Bedrock

Arrange the event setting

An AWS account is required with an AWS Identification and Entry Administration (IAM) position that has permissions to handle sources created as a part of the answer. For particulars, see Creating an AWS account.

We use Amazon SageMaker Studio with the ml.t3.medium occasion and the Information Science 3.0 picture. Nonetheless, it’s also possible to use an Amazon SageMaker pocket book occasion or any built-in improvement setting (IDE) of your alternative.

Word: You should definitely arrange your AWS Command Line Interface (AWS CLI) credentials accurately. For extra info, see Configure the AWS CLI.

An ml.g5.2xlarge occasion is used for SageMaker Coaching jobs, and an ml.g5.2xlarge occasion is used for SageMaker endpoints. Guarantee enough capability for this occasion in your AWS account by requesting a quota improve if required. Additionally test the pricing of the on-demand situations.

It’s essential to clone this GitHub repository for replicating the answer demonstrated on this publish. First, launch the pocket book essential.ipynb in SageMaker Studio by deciding on the Picture as Information Science and Kernel as Python 3. Set up all of the required libraries talked about within the necessities.txt.

Load and put together the dataset

For this publish, we use the Kaggle Style Pictures Dataset, which include 44,000 merchandise with a number of class labels, descriptions, and excessive decision pictures. On this publish we need to reveal the way to fine-tune a mannequin to be taught attributes resembling material, match, collar, sample, and sleeve size of a shirt utilizing the picture and a query as inputs.

Every product is recognized by an ID resembling 38642, and there’s a map to all of the merchandise in kinds.csv. From right here, we are able to fetch the picture for this product from pictures/38642.jpg and the entire metadata from kinds/38642.json. To fine-tune our mannequin, we have to convert our structured examples into a set of query and reply pairs. Our last dataset has the next format after processing for every attribute:

Id | Query | Reply
38642 | What's the material of the clothes on this image? | Material: Cotton

After we course of the dataset, we break up it into coaching and validation units, create CSV recordsdata, and add the dataset to Amazon S3.

Fantastic-tune the BLIP-2 mannequin to be taught product attributes utilizing SageMaker

To launch a SageMaker Coaching job, we want the HuggingFace Estimator. SageMaker begins and manages all the crucial Amazon Elastic Compute Cloud (Amazon EC2) situations for us, provides the suitable Hugging Face container, uploads the desired scripts, and downloads information from our S3 bucket to the container to /decide/ml/enter/information.

We fine-tune BLIP-2 utilizing the Low-Rank Adaptation (LoRA) approach, which provides trainable rank decomposition matrices to each Transformer construction layer whereas maintaining the pre-trained mannequin weights in a static state. This method can improve coaching throughput and cut back the quantity of GPU RAM required by 3 occasions and the variety of trainable parameters by 10,000 occasions. Regardless of utilizing fewer trainable parameters, LoRA has been demonstrated to carry out in addition to or higher than the total fine-tuning approach.

We ready entrypoint_vqa_finetuning.py which implements fine-tuning of BLIP-2 with the LoRA approach utilizing Hugging Face Transformers, Speed up, and Parameter-Environment friendly Fantastic-Tuning (PEFT). The script additionally merges the LoRA weights into the mannequin weights after coaching. Because of this, you’ll be able to deploy the mannequin as a standard mannequin with none extra code.

from peft import LoraConfig, get_peft_model
from transformers import Blip2ForConditionalGeneration
 
mannequin = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-flan-t5-xl",
        device_map="auto",
        cache_dir="/tmp",
        load_in_8bit=True,
    )

config = LoraConfig(
    r=8, # Lora consideration dimension.
    lora_alpha=32, # the alpha parameter for Lora scaling.
    lora_dropout=0.05, # the dropout likelihood for Lora layers.
    bias="none", # the bias sort for Lora.
    target_modules=["q", "v"],
)

mannequin = get_peft_model(mannequin, config)

We reference entrypoint_vqa_finetuning.py because the entry_point within the Hugging Face Estimator.

from sagemaker.huggingface import HuggingFace

hyperparameters = {
    'epochs': 10,
    'file-name': "vqa_train.csv",
}

estimator = HuggingFace(
    entry_point="entrypoint_vqa_finetuning.py",
    source_dir="../src",
    position=position,
    instance_count=1,
    instance_type="ml.g5.2xlarge", 
    transformers_version='4.26',
    pytorch_version='1.13',
    py_version='py39',
    hyperparameters = hyperparameters,
    base_job_name="VQA",
    sagemaker_session=sagemaker_session,
    output_path=f"{output_path}/fashions",
    code_location=f"{output_path}/code",
    volume_size=60,
    metric_definitions=[
        {'Name': 'batch_loss', 'Regex': 'Loss: ([0-9.]+)'},
        {'Title': 'epoch_loss', 'Regex': 'Epoch Loss: ([0-9.]+)'}
    ],
)

We are able to begin our coaching job by working with the .match() methodology and passing our Amazon S3 path for pictures and our enter file.

estimator.match({"pictures": images_input, "input_file": input_file})

Deploy the fine-tuned BLIP-2 mannequin and predict product attributes utilizing SageMaker

We deploy the fine-tuned BLIP-2 mannequin to the SageMaker actual time endpoint utilizing the HuggingFace Inference Container. It’s also possible to use the massive mannequin inference (LMI) container, which is described in additional element in Construct a generative AI-based content material moderation resolution on Amazon SageMaker JumpStart, which deploys a pre-trained BLIP-2 mannequin. Right here, we reference our fine-tuned mannequin in Amazon S3 as an alternative of the pre-trained mannequin obtainable within the Hugging Face hub. We first create the mannequin and deploy the endpoint.

from sagemaker.huggingface import HuggingFaceModel

mannequin = HuggingFaceModel(
   model_data=estimator.model_data,
   position=position,
   transformers_version="4.28",
   pytorch_version="2.0",
   py_version="py310",
   model_server_workers=1,
   sagemaker_session=sagemaker_session
)

endpoint_name = "endpoint-finetuned-blip2"
mannequin.deploy(initial_instance_count=1, instance_type="ml.g5.2xlarge", endpoint_name=endpoint_name )

When the endpoint standing turns into in service, we are able to invoke the endpoint for the instructed vision-to-language era job with an enter picture and a query as a immediate:

inputs = {
    "immediate": "What's the sleeve size of the shirt on this image?",
    "picture": picture # picture encoded in Base64
}

The output response seems to be like the next:

{"Sleeve Size": "Lengthy Sleeves"}

Generate product descriptions from predicted product attributes utilizing Amazon Bedrock

To get began with Amazon Bedrock, request entry to the foundational fashions (they don’t seem to be enabled by default). You possibly can observe the steps within the documentation to allow mannequin entry. On this publish, we use Anthropic’s Claude in Amazon Bedrock to generate product descriptions. Particularly, we use the mannequin anthropic.claude-3-sonnet-20240229-v1 as a result of it gives good efficiency and pace.

After creating the boto3 shopper for Amazon Bedrock, we create a immediate string that specifies that we need to generate product descriptions utilizing the product attributes.

You might be an professional in writing product descriptions for shirts. Use the info under to create product description for a web site. The product description ought to include all given attributes.
Present some inspirational sentences, for instance, how the material strikes. Take into consideration what a possible buyer desires to know concerning the shirts. Listed below are the information that you must create the product descriptions:
[Here we insert the predicted attributes by the BLIP-2 model]

The immediate and mannequin parameters, together with most variety of tokens used within the response and the temperature, are handed to the physique. The JSON response have to be parsed earlier than the ensuing textual content is printed within the last line.

bedrock = boto3.shopper(service_name="bedrock-runtime", region_name="us-west-2")

model_id = "anthropic.claude-3-sonnet-20240229-v1"

physique = json.dumps(
    {"system": immediate, "messages": attributes_content, "max_tokens": 400, "temperature": 0.1, "anthropic_version": "bedrock-2023-05-31"}
)

response = bedrock.invoke_model(
    physique=physique,
    modelId=model_id,
    settle for="utility/json",
    contentType="utility/json"
)

The generated product description response seems to be like the next:

"Basic Striped Shirt Calm down into comfy informal type with this basic collared striped shirt. With a daily match that's neither too slim nor too unfastened, this versatile high layers completely beneath sweaters or jackets."

Conclusion

We’ve proven you ways the mixture of VLMs on SageMaker and LLMs on Amazon Bedrock current a robust resolution for automating vogue product description era. By fine-tuning the BLIP-2 mannequin on a vogue dataset utilizing Amazon SageMaker, you’ll be able to predict domain-specific and nuanced product attributes straight from pictures. Then, utilizing the capabilities of Amazon Bedrock, you’ll be able to generate product descriptions from the anticipated product attributes, enhancing the searchability and personalization of ecommerce platforms. As we proceed to discover the potential of generative AI, LLMs and VLMs emerge as a promising avenue for revolutionizing content material era within the ever-evolving panorama of on-line retail. As a subsequent step, you’ll be able to strive fine-tuning this mannequin by yourself dataset utilizing the code offered within the GitHub repository to check and benchmark the outcomes in your use instances.

In regards to the Authors

Antonia Wiebeler is a Information Scientist on the AWS Generative AI Innovation Heart, the place she enjoys constructing proofs of idea for patrons. Her ardour is exploring how generative AI can resolve real-world issues and create worth for patrons. Whereas she just isn’t coding, she enjoys working and competing in triathlons.

Daniel Zagyva is a Information Scientist at AWS Skilled Companies. He makes a speciality of growing scalable, production-grade machine studying options for AWS clients. His expertise extends throughout totally different areas, together with pure language processing, generative AI, and machine studying operations.

Lun Yeh is a Machine Studying Engineer at AWS Skilled Companies. She makes a speciality of NLP, forecasting, MLOps, and generative AI and helps clients undertake machine studying of their companies. She graduated from TU Delft with a level in Information Science & Know-how.

Fotinos Kyriakides is an AI/ML Guide at AWS Skilled Companies specializing in growing production-ready ML options and platforms for AWS clients. In his free time Fotinos enjoys working and exploring.