Deploy DeepSeek-R1 Distilled Llama fashions in Amazon Bedrock

Open basis fashions (FMs) have develop into a cornerstone of generative AI innovation, enabling organizations to construct and customise AI purposes whereas sustaining management over their prices and deployment methods. By offering high-quality, brazenly accessible fashions, the AI group fosters speedy iteration, data sharing, and cost-effective options that profit each builders and end-users. DeepSeek AI, a analysis firm targeted on advancing AI expertise, has emerged as a big contributor to this ecosystem. Their DeepSeek-R1 fashions characterize a household of enormous language fashions (LLMs) designed to deal with a variety of duties, from code technology to common reasoning, whereas sustaining aggressive efficiency and effectivity.

Amazon Bedrock Customized Mannequin Import allows the import and use of your custom-made fashions alongside present FMs by means of a single serverless, unified API. You’ll be able to entry your imported customized fashions on-demand and with out the necessity to handle underlying infrastructure. Speed up your generative AI software improvement by integrating your supported customized fashions with native Bedrock instruments and options like Information Bases, Guardrails, and Brokers.

On this submit, we discover deploy distilled variations of DeepSeek-R1 with Amazon Bedrock Customized Mannequin Import, making them accessible to organizations trying to make use of state-of-the-art AI capabilities throughout the safe and scalable AWS infrastructure at an efficient price.

DeepSeek-R1 distilled variations

From the muse of DeepSeek-R1, DeepSeek AI has created a sequence of distilled fashions primarily based on each Meta’s Llama and Qwen architectures, starting from 1.5–70 billion parameters. The distillation course of entails coaching smaller, extra environment friendly fashions to imitate the habits and reasoning patterns of the bigger DeepSeek-R1 mannequin through the use of it as a instructor—primarily transferring the data and capabilities of the 671 billion parameter mannequin into extra compact architectures. The ensuing distilled fashions, resembling DeepSeek-R1-Distill-Llama-8B (from base mannequin Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base mannequin Llama-3.3-70B-Instruct), provide totally different trade-offs between efficiency and useful resource necessities. Though distilled fashions may present some discount in reasoning capabilities in comparison with the unique 671B mannequin, they considerably enhance inference velocity and cut back computational prices. As an illustration, smaller distilled fashions just like the 8B model can course of requests a lot sooner and eat fewer sources, making them less expensive for manufacturing deployments, whereas bigger distilled variations just like the 70B mannequin preserve nearer efficiency to the unique whereas nonetheless providing significant effectivity beneficial properties.

Answer overview

On this submit, we exhibit deploy distilled variations of DeepSeek-R1 fashions utilizing Amazon Bedrock Customized Mannequin Import. We concentrate on importing the variants presently supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which provide an optimum steadiness between efficiency and useful resource effectivity. You’ll be able to import these fashions from Amazon Easy Storage Service (Amazon S3) or an Amazon SageMaker AI mannequin repo, and deploy them in a totally managed and serverless atmosphere by means of Amazon Bedrock. The next diagram illustrates the end-to-end movement.

On this workflow, mannequin artifacts saved in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the mannequin mechanically. This serverless strategy eliminates the necessity for infrastructure administration whereas offering enterprise-grade safety and scalability.

You should utilize the Amazon Bedrock console for deploying utilizing the graphical interface and following the directions on this submit, or alternatively use the following pocket book to deploy programmatically with the Amazon Bedrock SDK.

Conditions

It’s best to have the next conditions:

Put together the mannequin bundle

Full the next steps to arrange the mannequin bundle:

Obtain the DeepSeek-R1-Distill-Llama mannequin artifacts from Hugging Face, from one of many following hyperlinks, relying on the mannequin you need to deploy:
1. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tree/most important
2. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tree/most important

For extra data, you’ll be able to observe the Hugging Face’s Downloading fashions or Obtain recordsdata from the hub directions.

You usually want the next recordsdata:

- Mannequin configuration file: config.json
- Tokenizer recordsdata: tokenizer.json, tokenizer_config.json, and tokenizer.mode
- Mannequin weights recordsdata in .safetensors format

Add these recordsdata to a folder in your S3 bucket, in the identical AWS Area the place you propose to make use of Amazon Bedrock. Be aware of the S3 path you’re utilizing.

Import the mannequin

Full the next steps to import the mannequin:

On the Amazon Bedrock console, select Imported fashions below Basis fashions within the navigation pane.

Select Import mannequin.

For Mannequin identify, enter a reputation to your mannequin (it’s really useful to make use of a versioning scheme in your identify, for monitoring your imported mannequin).
For Import job identify, enter a reputation to your import job.
For Mannequin import settings, choose Amazon S3 bucket as your import supply, and enter the S3 path you famous earlier (present the total path within the kind s3://<your-bucket>/folder-with-model-artifacts/).
For Encryption, optionally select to customise your encryption settings.
For Service entry function, select to both create a brand new IAM function or present your personal.
Select Import mannequin.

Importing the mannequin will take a number of minutes relying on the mannequin being imported (for instance, the Distill-Llama-8B mannequin may take 5–20 minutes to finish).

Watch this video demo for a step-by-step information.

Check the imported mannequin

After you import the mannequin, you’ll be able to take a look at it through the use of the Amazon Bedrock Playground or instantly by means of the Amazon Bedrock invocation APIs. To make use of the Playground, full the next steps:

On the Amazon Bedrock console, select Chat / Textual content below Playgrounds within the navigation pane.
From the mannequin selector, select your imported mannequin identify.
Regulate the inference parameters as wanted and write your take a look at immediate. For instance:
<｜start▁of▁sentence｜><｜Person｜>Given the next monetary knowledge: - Firm A's income grew from $10M to $15M in 2023 - Working prices elevated by 20% - Preliminary working prices have been $7M Calculate the corporate's working margin for 2023. Please motive step-by-step, and put your closing reply inside boxed{}<｜Assistant｜>

As we’re utilizing an imported mannequin within the playground, we should embody the “beginning_of_sentence” and “consumer/assistant” tags to correctly format the context for DeepSeek fashions; these tags assist the mannequin perceive the construction of the dialog and supply extra correct responses. In case you’re following the programmatic strategy within the following pocket book then that is being mechanically taken care of by configuring the mannequin.

Assessment the mannequin response and metrics supplied.

Be aware: Whenever you invoke the mannequin for the primary time, when you encounter a ModelNotReadyException error the SDK mechanically retries the request with exponential backoff. The restoration time varies relying on the on-demand fleet measurement and mannequin measurement. You’ll be able to customise the retry habits utilizing the AWS SDK for Python (Boto3) Config object. For extra data, see Dealing with ModelNotReadyException.

As soon as you’re able to import the mannequin, use this step-by-step video demo that can assist you get began.

Pricing

Customized Mannequin Import allows you to use your customized mannequin weights inside Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a totally managed method by means of On-Demand mode. Customized Mannequin Import doesn’t cost for mannequin import, you’re charged for inference primarily based on two components: the variety of lively mannequin copies and their period of exercise.

Billing happens in 5-minute home windows, ranging from the primary profitable invocation of every mannequin copy. The pricing per mannequin copy per minute varies primarily based on components together with structure, context size, area, and compute unit model, and is tiered by mannequin copy measurement. The Customized Mannequin Items required for internet hosting is dependent upon the mannequin’s structure, parameter rely, and context size, with examples starting from 2 Items for a Llama 3.1 8B 128K mannequin to eight Items for a Llama 3.1 70B 128K mannequin.

Amazon Bedrock mechanically manages scaling, sustaining zero to a few mannequin copies by default (adjustable by means of Service Quotas) primarily based in your utilization patterns. If there are not any invocations for five minutes, it scales to zero and scales up when wanted, although this may occasionally contain cold-start latency of tens of seconds. Extra copies are added if inference quantity constantly exceeds single-copy concurrency limits. The utmost throughput and concurrency per copy is set throughout import, primarily based on components resembling enter/output token combine, {hardware} sort, mannequin measurement, structure, and inference optimizations.

Take into account the next pricing instance: An software developer imports a custom-made Llama 3.1 sort mannequin that’s 8B parameter in measurement with a 128K sequence size in us-east-1 area and deletes the mannequin after 1 month. This requires 2 Customized Mannequin Items. So, the worth per minute will probably be $0.1570 and the mannequin storage prices will probably be $3.90 for the month.

For extra data, see Amazon Bedrock pricing.

Benchmarks

DeepSeek has revealed benchmarks evaluating their distilled fashions towards the unique DeepSeek-R1 and base Llama fashions, accessible within the mannequin repositories. The benchmarks present that relying on the duty DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the unique mannequin’s reasoning capabilities, whereas the 8B model achieves between 59-92% efficiency with considerably lowered useful resource necessities. Each distilled variations exhibit enhancements over their corresponding base Llama fashions in particular reasoning duties.

Different concerns

When deploying DeepSeek fashions in Amazon Bedrock, take into account the next facets:

Mannequin versioning is important. As a result of Customized Mannequin Import creates distinctive fashions for every import, implement a transparent versioning technique in your mannequin names to trace totally different variations and variations.
The present supported mannequin codecs concentrate on Llama-based architectures. Though DeepSeek-R1 distilled variations provide glorious efficiency, the AI ecosystem continues evolving quickly. Regulate the Amazon Bedrock mannequin catalog as new architectures and bigger fashions develop into accessible by means of the platform.
Consider your use case necessities fastidiously. Though bigger fashions like DeepSeek-R1-Distill-Llama-70B present higher efficiency, the 8B model may provide enough functionality for a lot of purposes at a decrease price.
Take into account implementing monitoring and observability. Amazon CloudWatch gives metrics to your imported fashions, serving to you observe utilization patterns and efficiency. You’ll be able to monitor prices with AWS Price Explorer.
Begin with a decrease concurrency quota and scale up primarily based on precise utilization patterns. The default restrict of three concurrent mannequin copies per account is appropriate for many preliminary deployments.

Conclusion

Amazon Bedrock Customized Mannequin Import empowers organizations to make use of highly effective publicly accessible fashions like DeepSeek-R1 distilled variations, amongst others, whereas benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing mannequin deployments and operations, permitting groups to concentrate on constructing purposes relatively than infrastructure. With options like auto scaling, pay-per-use pricing, and seamless integration with AWS companies, Amazon Bedrock gives a production-ready atmosphere for AI workloads. The mix of DeepSeek’s progressive distillation strategy and the Amazon Bedrock managed infrastructure provides an optimum steadiness of efficiency, price, and operational effectivity. Organizations can begin with smaller fashions and scale up as wanted, whereas sustaining full management over their mannequin deployments and benefiting from AWS safety and compliance capabilities.

The power to decide on between proprietary and open FMs Amazon Bedrock provides organizations the flexibleness to optimize for his or her particular wants. Open fashions allow cost-effective deployment with full management over the mannequin artifacts, making them best for eventualities the place customization, price optimization, or mannequin transparency are essential. This flexibility, mixed with the Amazon Bedrock unified API and enterprise-grade infrastructure, permits organizations to construct resilient AI methods that may adapt as their necessities evolve.

For extra data, seek advice from the Amazon Bedrock Person Information.

In regards to the Authors

Raj Pathak is a Principal Options Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance coverage, Capital Markets) clients throughout Canada and the US. Raj makes a speciality of Machine Studying with purposes in Generative AI, Pure Language Processing, Clever Doc Processing, and MLOps.

Yanyan Zhang is a Senior Generative AI Information Scientist at Amazon Internet Companies, the place she has been engaged on cutting-edge AI/ML applied sciences as a Generative AI Specialist, serving to clients use generative AI to realize their desired outcomes. Yanyan graduated from Texas A&M College with a PhD in Electrical Engineering. Exterior of labor, she loves touring, understanding, and exploring new issues.

Ishan Singh is a Generative AI Information Scientist at Amazon Internet Companies, the place he helps clients construct progressive and accountable generative AI options and merchandise. With a robust background in AI/ML, Ishan makes a speciality of constructing Generative AI options that drive enterprise worth. Exterior of labor, he enjoys taking part in volleyball, exploring native bike trails, and spending time together with his spouse and canine, Beau.

Morgan Rankey is a Options Architect primarily based in New York Metropolis, specializing in Hedge Funds. He excels in helping clients to construct resilient workloads throughout the AWS ecosystem. Previous to becoming a member of AWS, Morgan led the Gross sales Engineering staff at Riskified by means of its IPO. He started his profession by specializing in AI/ML options for machine asset administration, serving a few of the largest automotive corporations globally.

Harsh Patel is an AWS Options Architect supporting 200+ SMB clients throughout the US to drive digital transformation by means of cloud-native options. As an AI&ML Specialist, he focuses on Generative AI, Laptop Imaginative and prescient, Reinforcement Studying and Anomaly Detection. Exterior the tech world, he recharges by hitting the golf course and embarking on scenic hikes together with his canine.