Pre-training genomic language fashions utilizing AWS HealthOmics and Amazon SageMaker

Genomic language fashions are a brand new and thrilling area within the software of huge language fashions to challenges in genomics. On this weblog submit and open supply challenge, we present you how one can pre-train a genomics language mannequin, HyenaDNA, utilizing your genomic knowledge within the AWS Cloud. Right here, we use AWS HealthOmics storage as a handy and cost-effective omic knowledge retailer and Amazon Sagemaker as a completely managed machine studying (ML) service to coach and deploy the mannequin.

Genomic language fashions

Genomic language fashions signify a brand new strategy within the area of genomics, providing a method to perceive the language of DNA. These fashions use the transformer structure, a kind of pure language processing (NLP), to interpret the huge quantity of genomic data accessible, permitting researchers and scientists to extract significant insights extra precisely than with present in silico approaches and extra cost-effectively than with present in situ strategies.

By bridging the hole between uncooked genetic knowledge and actionable data, genomic language fashions maintain immense promise for varied industries and analysis areas, together with whole-genome evaluation, delivered care, prescribed drugs, and agriculture. They facilitate the invention of novel gene capabilities, the identification of disease-causing mutations, and the event of customized therapy methods, finally driving innovation and development in genomics-driven fields. The power to successfully analyze and interpret genomic knowledge at scale is the important thing to precision medication, agricultural optimization, and biotechnological breakthroughs, making genomic language fashions a doable new foundational expertise in these industries.

A few of the pioneering genomic language fashions embrace

DNABERT which was one of many first makes an attempt to make use of the transformer structure to be taught the language of DNA. DNABERT used a Bidirectional Encoder Representations from Transformers (BERT, encoder-only) structure pre-trained on a human reference genome and confirmed promising outcomes on downstream supervised duties.
Nucleotide transformer has the same structure to DNABERT and confirmed that pre-training on extra knowledge and rising the context window dimension improves the mannequin’s accuracy on downstream duties.
HyenaDNA makes use of the transformer structure, like different genomic fashions, besides that it replaces every self-attention layer with a Hyena operator. This widens the context window to permit processing of as much as 1 million tokens, considerably greater than prior fashions, permitting it to be taught longer-range interactions in DNA.

In our exploration of cutting-edge fashions that push the boundaries of genetic sequence evaluation, we targeted on HyenaDNA. Pretrained HyenaDNA fashions are readily accessible on Hugging Face. This availability facilitates straightforward integration into present tasks or the start line for brand spanking new explorations in genetic sequence evaluation.

AWS HealthOmics and sequence shops

AWS HealthOmics is a purpose-built service that helps healthcare and life science organizations and their software program companions retailer, question, and analyze genomic, transcriptomic, and different omics knowledge after which generate insights from that knowledge to enhance well being and drive deeper organic understanding. It helps large-scale evaluation and collaborative analysis by means of HealthOmics storage, analytics, and workflow capabilities.

With HealthOmics storage, a managed omics targeted findable accessible, interoperable, and reusable (FAIR) knowledge retailer, customers can affordably retailer, manage, share, and entry petabytes of bioinformatics knowledge effectively at a low value per gigabase. HealthOmics sequence shops ship value financial savings by means of computerized tiering and compression of recordsdata based mostly on utilization, allow sharing and findability by means of the biologically targeted metadata and provenance monitoring, and supply immediate entry to continuously used knowledge by means of low latency Amazon Easy Storage Service (Amazon S3) appropriate APIs or HealthOmics native APIs. All of that is delivered by HealthOmics, eradicating the burden of managing compression, tiering, metadata, and file group from prospects.

Amazon SageMaker

Amazon SageMaker is a completely managed ML service supplied by AWS, designed to cut back the time and price related to coaching and tuning ML fashions at scale.

With SageMaker Coaching, a managed batch ML compute service, customers can effectively practice fashions with out having to handle the underlying infrastructure. SageMaker notably helps widespread deep studying frameworks, together with PyTorch, which is integral to the options supplied right here.

SageMaker additionally supplies a broad choice of ML infrastructure and mannequin deployment choices to assist meet all of your ML inference wants.

Resolution overview

On this weblog submit we tackle pre-training a genomic language mannequin on an assembled genome. This genomic knowledge could possibly be both public (for instance, GenBank) or could possibly be your personal proprietary knowledge. The next diagram illustrates the workflow:

We begin with genomic knowledge. For the needs of this weblog submit, we’re utilizing a public non-reference Mouse genome from GenBank. The dataset is a part of The Mouse Genomes Challenge and represents a consensus genome sequence of inbred mouse strains. The sort of genomic knowledge might readily be interchanged with proprietary datasets that you simply could be working with in your analysis.
We use a SageMaker pocket book to course of the genomic recordsdata and to import these right into a HealthOmics sequence retailer.
A second SageMaker pocket book is used to start out the coaching job on SageMaker.
Contained in the managed coaching job within the SageMaker surroundings, the coaching job first downloads the mouse genome utilizing the S3 URI equipped by HealthOmics.
Then the coaching job retrieves the checkpoint weights of the HyenaDNA mannequin from Huggingface. These weights are pretrained on the human reference genome. This pretraining permits the mannequin to grasp and predict genomic sequences, offering a complete baseline for additional specialised coaching on quite a lot of genomic duties.
Utilizing these assets, the HyenaDNA mannequin is skilled, the place it makes use of the mouse genome to refine its parameters. After pre-training is full and validation outcomes are passable, the skilled mannequin is saved to Amazon S3.
Then we deploy that mannequin as a SageMaker real-time inference endpoint.
Lastly the mannequin is examined towards a set of recognized genome sequences utilizing some inference API calls.

Information preparation and loading into sequence retailer

The preliminary step in our machine studying workflow focuses on making ready the info. We begin by importing the genomic sequences right into a HealthOmics sequence retailer. Though FASTA recordsdata are the usual format for storing reference sequences, we convert these to FASTQ format. This conversion is carried out to raised replicate the format anticipated to retailer the assembled knowledge of a sequenced pattern.

Within the pattern Jupyter pocket book we present the way to obtain FASTA recordsdata from GenBank, convert them into FASTQ recordsdata, after which load them right into a HealthOmics sequence retailer. You may skip this step If you have already got your personal genomic knowledge in a sequence retailer.

Coaching on SageMaker

We use PyTorch and Amazon SageMaker script mode to coach this mannequin. Script mode’s compatibility with PyTorch was essential, permitting us to make use of our present scripts with minimal modifications. For the coaching, we extract the coaching knowledge from the sequence retailer by means of the sequence retailer’s supplied S3 URIs. You may, for instance, use the boto3 library to acquire this S3 URI.

seq_store_id = "4308389581“

seq_store_info = omics.get_sequence_store(id=seq_store_id)
s3_uri = seq_store_info["s3Access"]["s3Uri"]
s3_arn = seq_store_info["s3Access"]["s3AccessPointArn"]
key_arn = seq_store_info["sseConfig"]["keyArn"]
s3_uri, s3_arn, key_arn

S3_DATA_URI = f"{s3_uri}readSet/"
S3_DATA_URI

While you present this to the SageMaker estimator, the coaching job takes care of downloading the info from the sequence retailer by means of its S3 URI. Following Nguyen et al, we practice on chromosomes 2, 4, 6, 8, X, and 14–19; cross-validate on chromosomes 1, 3, 12, and 13; and take a look at on chromosomes 5, 7, and 9–11.

To maximise the coaching effectivity of our HyenaDNA mannequin, we use distributed knowledge parallel (DDP). DDP is a method that facilitates the parallel processing of our coaching duties throughout a number of GPUs. To effectively implement DDP, we used the Hugging Face Speed up library. Speed up simplifies working distributed coaching by abstracting away the complexity usually related to establishing DDP.

After you will have outlined your coaching script, you’ll be able to configure and submit a SageMaker coaching job.

First, let’s outline the hyperparameters, beginning with model_checkpoint. This parameter refers to a HuggingFace mannequin ID for a particular pre-trained mannequin. Notably, the HyenaDNA mannequin lineup contains checkpoints that may deal with as much as 1 million tokens. Nevertheless, for demonstration functions, we’re utilizing the hyenadna-small-32k-seqlen-hf mannequin, which has a context window of 32,000 tokens, indicated by the max_length setting. It’s important to grasp that completely different mannequin IDs and corresponding max_length settings might be chosen to make use of fashions with smaller or bigger context home windows, relying in your computational wants and goals.

The species parameter is about to mouse, specifying the kind of organism the genomic coaching knowledge represents.

hyperparameters = {
    "species" : "mouse",
    "epochs": 150,
    "model_checkpoint": MODEL_ID,
    "max_length": 32_000,
    "batch_size": 4,
    "learning_rate": 6e-4,
    "weight_decay" : 0.1,
    "log_level" : "INFO",
    "log_interval" : 100
}

Subsequent, outline what metrics, particularly the coaching and validation perplexity, to seize from the coaching logs:

metric_definitions = [
    {"Name": "epoch", "Regex": "Epoch: ([0-9.]*)"},
    {"Title": "step", "Regex": "Step: ([0-9.]*)"},
    {"Title": "train_loss", "Regex": "Prepare Loss: ([0-9.e-]*)"},
    {"Title": "train_perplexity", "Regex": "Prepare Perplexity: ([0-9.e-]*)"},
    {"Title": "eval_loss", "Regex": "Eval Common Loss: ([0-9.e-]*)"},
    {"Title": "eval_perplexity", "Regex": "Eval Perplexity: ([0-9.e-]*)"}
]

Lastly, outline a Pytorch estimator and submit a coaching job that refers back to the knowledge location obtained from the HealthOmics sequence retailer.

hyenaDNA_estimator = PyTorch(
    base_job_name=TRAINING_JOB_NAME,
    entry_point="train_hf_accelerate.py",
    source_dir="scripts/",
    instance_type="ml.g5.12xlarge",
    instance_count=1,
    image_uri=pytorch_image_uri,
    function=SAGEMAKER_EXECUTION_ROLE,
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    sagemaker_session=sagemaker_session,
    distribution={"torch_distributed": {"enabled": True}},
    tags=[{"Key": "project", "Value": "genomics-model-pretraining"}],
    keep_alive_period_in_seconds=1800,
    tensorboard_output_config=tensorboard_output_config,
)

with Run(
    experiment_name=EXPERIMENT_NAME,
    sagemaker_session=sagemaker_session,
) as run:
    hyenaDNA_estimator.match(
        {
            "knowledge": TrainingInput(
                s3_data=S3_DATA_URI, input_mode="File"
            ),
        },
        wait=True,
    )

Outcomes

In our coaching cycle for the mannequin, we processed a dataset consisting of 1 mouse genome with 10,000 entries. The computational assets included a cluster configured with one ml.g5.12xlarge occasion, which homes 4 Nvidia A10G GPUs. The 32k sequence size mannequin, was skilled utilizing a batch dimension of 4 per GPU (24 gigabit (Gb) of VRAM). With this setup we accomplished 150 epochs to report the outcomes under.

Analysis metrics: The analysis perplexity and loss graphs present a downward development on the outset, which then plateaus. The preliminary steep lower signifies that the mannequin quickly discovered from the coaching knowledge, enhancing its predictive efficiency. As coaching progressed, the speed of enchancment slowed, as evidenced by the plateau, which is typical within the later phases of coaching because the mannequin converges.

Coaching Metrics: Equally, the coaching perplexity and loss graphs point out an preliminary sharp enchancment adopted by a gradual plateau. This reveals that the mannequin successfully discovered from the info. The coaching loss’s slight fluctuations recommend that the mannequin continued to fine-tune its parameters in response to the inherent complexities within the coaching dataset.

Deployment

Upon the completion of coaching, we then deployed the mannequin on a SageMaker real-time endpoint. SageMaker real-time endpoints present an on-demand, scalable method to generate embeddings for genomic sequences.

In our SageMaker real-time endpoint setup, we have to regulate the default configurations to deal with massive payload sizes, particularly 32k context home windows for each requests and responses. As a result of the default payload dimension of 6.5 MB isn’t adequate, we’re rising it to a bit over 50 MB:

hyenaDNAModel = PyTorchModel(
    model_data=model_data,
    function=SAGEMAKER_EXECUTION_ROLE,
    image_uri=pytorch_deployment_uri,
    entry_point="inference.py",
    source_dir="scripts/",
    sagemaker_session=sagemaker_session,
    identify=endpoint_name,
    env = {
        'TS_MAX_RESPONSE_SIZE':'60000000',
        'TS_MAX_REQUEST_SIZE':'60000000',
    }
)

# deploy the endpoint endpoint
realtime_predictor = hyenaDNAModel.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    endpoint_name=endpoint_name,
    env=env,
)

By submitting a sequence to the endpoint, customers can shortly obtain the corresponding embeddings generated by HyenaDNA. These embeddings encapsulate the complicated patterns and relationships discovered throughout coaching, representing the genetic sequences in a kind that’s conducive to additional evaluation and predictive modeling. Right here is an instance of the way to invoke the mannequin.

import json
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

sample_genome_data = []
with open("./sample_mouse_data.json") as file:
    for line in file:
        sample_genome_data.append(json.hundreds(line))
len(sample_genome_data)

knowledge = [sample_genome_data[0]]
realtime_predictor.serializer = JSONSerializer()
realtime_predictor.deserializer = JSONDeserializer()
realtime_predictor.predict(knowledge=knowledge)

While you submit a pattern genomic sequence to the mannequin, it returns the embeddings of that sequence:

{'embeddings': [[-0.50390625, 0.447265625,-1.03125, 0.546875, 0.50390625, -0.53125, 0.59375, 0.71875, 0.349609375, -0.404296875, -4.8125, 0.84375, 0.359375, 1.2265625,………]]}

Conclusion

We’ve proven the way to pre-train a HyenaDNA mannequin with a 32k context window and to supply embeddings that can be utilized for downstream predictive duties. Utilizing the strategies proven right here it’s also possible to pre-train a HyenaDNA mannequin with context home windows of different sizes (for instance, 1 million tokens) and on different genomic knowledge (for instance, proprietary genomic sequence knowledge).

Pre-training genomic fashions on massive, various datasets is a foundational step in making ready them for downstream duties, similar to figuring out genetic variants linked to illnesses or predicting gene expression ranges. On this weblog submit, you’ve discovered how AWS facilitates this pre-training course of by offering a scalable and cost-efficient infrastructure by means of HealthOmics and SageMaker. Trying ahead, researchers can use these pre-trained fashions to fast-track their tasks, fine-tuning them with particular datasets to achieve deeper insights into genetic analysis.

To discover additional particulars and check out your hand at utilizing these assets, we invite you to go to our GitHub repository. Moreover, We encourage you to be taught extra by visiting the Amazon SageMaker documentation and the AWS HealthOmics documentation.

In regards to the authors

Shamika Ariyawansa, serving as a Senior AI/ML Options Architect within the World Healthcare and Life Sciences division at Amazon Net Companies (AWS), makes a speciality of Generative AI. He assists prospects in integrating Generative AI into their tasks, emphasizing the adoption of Massive Language Fashions (LLMs) for healthcare and life sciences domains with a concentrate on distributed coaching. Past his skilled commitments, Shamika passionately pursues snowboarding and off-roading adventures.

Simon Handley, PhD, is a Senior AI/ML Options Architect within the World Healthcare and Life Sciences group at Amazon Net Companies. He has greater than 25 years expertise in biotechnology and machine studying and is enthusiastic about serving to prospects clear up their machine studying and genomic challenges. In his spare time, he enjoys horseback using and enjoying ice hockey.