Construct a Hugging Face textual content classification mannequin in Amazon SageMaker JumpStart

Amazon SageMaker JumpStart supplies a set of built-in algorithms, pre-trained fashions, and pre-built resolution templates to assist knowledge scientists and machine studying (ML) practitioners get began on coaching and deploying ML fashions rapidly. You should use these algorithms and fashions for each supervised and unsupervised studying. They’ll course of varied varieties of enter knowledge, together with picture, textual content, and tabular.

This submit introduces utilizing the textual content classification and fill-mask fashions obtainable on Hugging Face in SageMaker JumpStart for textual content classification on a customized dataset. We additionally show performing real-time and batch inference for these fashions. This supervised studying algorithm helps switch studying for all pre-trained fashions obtainable on Hugging Face. It takes a bit of textual content as enter and outputs the chance for every of the category labels. You possibly can fine-tune these pre-trained fashions utilizing switch studying even when a big corpus of textual content isn’t obtainable. It’s obtainable within the SageMaker JumpStart UI in Amazon SageMaker Studio. You too can use it by means of the SageMaker Python SDK, as demonstrated within the instance pocket book Introduction to SageMaker HuggingFace – Textual content Classification.

Answer overview

Textual content classification with Hugging Face in SageMaker supplies switch studying on all pre-trained fashions obtainable on Hugging Face. In accordance with the variety of class labels within the coaching knowledge, a classification layer is connected to the pre-trained Hugging Face mannequin. Then both the entire community, together with the pre-trained mannequin, or solely the highest classification layer will be fine-tuned on the customized coaching knowledge. On this switch studying mode, coaching will be achieved even with a smaller dataset.

On this submit, we show methods to do the next:

Use the brand new Hugging Face textual content classification algorithm
Carry out inference with the Hugging Face textual content classification algorithm
Fantastic-tune the pre-trained mannequin on a customized dataset
Carry out batch inference with the Hugging Face textual content classification algorithm

Conditions

Earlier than you run the pocket book, it’s essential to full some preliminary setup steps. Let’s arrange the SageMaker execution function so it has permissions to run AWS companies in your behalf:

!pip set up sagemaker --upgrade --quiet

import sagemaker, boto3, json
from sagemaker.session import Session
sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

Run inference on the pre-trained mannequin

SageMaker JumpStart assist inference for any textual content classification mannequin obtainable by means of Hugging Face. The mannequin will be hosted for inference and assist textual content as the appliance/x-text content material sort. This is not going to solely permit you to use a set of pre-trained fashions, but in addition allow you to decide on different classification duties.

The output incorporates the chance values, class labels for all courses, and the expected label similar to the category index with the very best chance encoded in JSON format. The mannequin processes a single string per request and outputs just one line. The next is an instance of a JSON format response:

settle for: software/json;verbose
{"chances": [prob_0, prob_1, prob_2, ...],
"labels": [label_0, label_1, label_2, ...],
"predicted_label": predicted_label}

If settle for is about to software/json, then the mannequin solely outputs chances. For extra particulars on coaching and inference, see the pattern pocket book.

You possibly can run inference on the textual content classification mannequin by passing the model_id within the setting variable whereas creating the article of the Mannequin class. See the next code:

from sagemaker.jumpstart.mannequin import JumpStartModel

hub = {}
HF_MODEL_ID = 'distilbert-base-uncased-finetuned-sst-2-english' # Cross another HF_MODEL_ID from - https://huggingface.co/fashions?pipeline_tag=text-classification&kind=downloads
hub['HF_MODEL_ID'] = HF_MODEL_ID
hub['HF_TASK'] = 'text-classification'

mannequin = JumpStartModel(model_id=infer_model_id, env =hub, enable_network_isolation=False

Fantastic-tune the pre-trained mannequin on a customized dataset

You possibly can fine-tune every of the pre-trained fill-mask or textual content classification fashions to any given dataset made up of textual content sentences with any variety of courses. The pretrained mannequin attaches a classification layer to the textual content embedding mannequin and initializes the layer parameters to random values. The output dimension of the classification layer is decided primarily based on the variety of courses detected within the enter knowledge. The target is to attenuate classification errors on the enter knowledge. Then you’ll be able to deploy the fine-tuned mannequin for inference.

The next are the directions for the way the coaching knowledge must be formatted for enter to the mannequin:

Enter – A listing containing a knowledge.csv file. Every row of the primary column ought to have an integer class label between 0 and the variety of courses. Every row of the second column ought to have the corresponding textual content knowledge.
Output – A fine-tuned mannequin that may be deployed for inference or additional skilled utilizing incremental coaching.

The next is an instance of an enter CSV file. The file should have no header. The file must be hosted in an Amazon Easy Storage Service (Amazon S3) bucket with a path much like the next: s3://bucket_name/input_directory/. The trailing / is required.

|0 |cover new secretions from the parental items|
|0 |incorporates no wit , solely labored gags|
|1 |that loves its characters and communicates one thing quite lovely about human nature|
|...|...|

The algorithm additionally helps switch studying for Hugging Face pre-trained fashions. Every mannequin is recognized by a novel model_id. The next instance reveals methods to fine-tune a BERT base mannequin recognized by model_id=huggingface-tc-bert-base-cased on a customized coaching dataset. The pre-trained mannequin tarballs have been pre-downloaded from Hugging Face and saved with the suitable mannequin signature in S3 buckets, such that the coaching job runs in community isolation.

For switch studying in your customized dataset, you may want to vary the default values of the coaching hyperparameters. You possibly can fetch a Python dictionary of those hyperparameters with their default values by calling hyperparameters.retrieve_default, replace them as wanted, after which go them to the Estimator class. The hyperparameter Train_only_top_layer defines which mannequin parameters change throughout the fine-tuning course of. If train_only_top_layer is True, parameters of the classification layers change and the remainder of the parameters stay fixed throughout the fine-tuning course of. If train_only_top_layer is False, all parameters of the mannequin are fine-tuned. See the next code:

from sagemaker import hyperparameters# Retrieve the default hyper-parameters for fine-tuning the mannequin
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)# [Optional] Override default hyperparameters with customized values
hyperparameters["epochs"] = "5"

For this use case, we offer SST2 as a default dataset for fine-tuning the fashions. The dataset incorporates optimistic and unfavourable film critiques. It has been downloaded from TensorFlow beneath the Apache 2.0 License. The next code supplies the default coaching dataset hosted in S3 buckets:

# Pattern coaching knowledge is accessible on this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

We create an Estimator object by offering the model_id and hyperparameters values as follows:

# Create SageMaker Estimator occasion
tc_estimator = JumpStartEstimator(
hyperparameters=hyperparameters,
model_id=dropdown.worth,
instance_type=training_instance_type,
metric_definitions=training_metric_definitions,
output_path=s3_output_location,
enable_network_isolation=False if model_id == "huggingface-tc-models" else True
)

To launch the SageMaker coaching job for fine-tuning the mannequin, name .match on the article of the Estimator class, whereas passing the S3 location of the coaching dataset:

# Launch a SageMaker Coaching job by passing s3 path of the coaching knowledge
tc_estimator.match({"coaching": training_dataset_s3_path}, logs=True)

You possibly can view efficiency metrics comparable to coaching loss and validation accuracy/loss by means of Amazon CloudWatch whereas coaching. You too can fetch these metrics and analyze them utilizing TrainingJobAnalytics:

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe() #It'll produce a dataframe with completely different metrics
df.head(10)

The next graph reveals completely different metrics collected from the CloudWatch log utilizing TrainingJobAnalytics.

For extra details about methods to use the brand new SageMaker Hugging Face textual content classification algorithm for switch studying on a customized dataset, deploy the fine-tuned mannequin, run inference on the deployed mannequin, and deploy the pre-trained mannequin as is with out first fine-tuning on a customized dataset, see the next instance pocket book.

Fantastic-tune any Hugging Face fill-mask or textual content classification mannequin

SageMaker JumpStart helps the fine-tuning of any pre-trained fill-mask or textual content classification Hugging Face mannequin. You possibly can obtain the required mannequin from the Hugging Face hub and carry out the fine-tuning. To make use of these fashions, the model_id is offered within the hyperparameters as hub_key. See the next code:

HF_MODEL_ID = "distilbert-base-uncased" # Specify the HF_MODEL_ID right here from https://huggingface.co/fashions?pipeline_tag=fill-mask&kind=downloads or https://huggingface.co/fashions?pipeline_tag=text-classification&kind=downloads
hyperparameters["hub_key"] = HF_MODEL_ID

Now you’ll be able to assemble an object of the Estimator class by passing the up to date hyperparameters. You name .match on the article of the Estimator class whereas passing the S3 location of the coaching dataset to carry out the SageMaker coaching job for fine-tuning the mannequin.

Fantastic-tune a mannequin with automated mannequin tuning

SageMaker automated mannequin tuning (ATM), also called hyperparameter tuning, finds the very best model of a mannequin by operating many coaching jobs in your dataset utilizing the algorithm and ranges of hyperparameters that you just specify. It then chooses the hyperparameter values that lead to a mannequin that performs the very best, as measured by a metric that you just select. Within the following code, you employ a HyperparameterTuner object to work together with SageMaker hyperparameter tuning APIs:

from sagemaker.tuner import ContinuousParameter
# Outline goal metric primarily based on which the very best mannequin shall be chosen.
amt_metric_definitions = {
"metrics": [{"Name": "val_accuracy", "Regex": "'eval_accuracy': ([0-9.]+)"}],
"sort": "Maximize",
}
# You possibly can choose from the hyperparameters supported by the mannequin, and configure ranges of values to be looked for coaching the optimum mannequin.(https://docs.aws.amazon.com/sagemaker/newest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
"learning_rate": ContinuousParameter(0.00001, 0.0001, scaling_type="Logarithmic")
}
# Improve the whole variety of coaching jobs run by AMT, for elevated accuracy (and coaching time).
max_jobs = 6
# Change parallel coaching jobs run by AMT to cut back complete coaching time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2

After you’ve got outlined the arguments for the HyperparameterTuner object, you go it the Estimator and begin the coaching. It will discover the best-performing mannequin.

Carry out batch inference with the Hugging Face textual content classification algorithm

If the purpose of inference is to generate predictions from a skilled mannequin on a big dataset the place minimizing latency isn’t a priority, then the batch inference performance could also be most easy, extra scalable, and extra acceptable.

Batch inference is beneficial within the following situations:

Preprocess datasets to take away noise or bias that interferes with coaching or inference out of your dataset
Get inference from massive datasets
Run inference once you don’t want a persistent endpoint
Affiliate enter information with inferences to help the interpretation of outcomes

For operating batch inference on this use case, you first obtain the SST2 dataset domestically. Take away the category label from it and add it to Amazon S3 for batch inference. You create the article of Mannequin class with out offering the endpoint and create the batch transformer object from it. You employ this object to supply batch predictions on the enter knowledge. See the next code:

batch_transformer = mannequin.transformer(
instance_count=1,
instance_type=inference_instance_type,
output_path=output_path,
assemble_with="Line",
settle for="textual content/csv"
)

batch_transformer.remodel(
input_path, content_type="textual content/csv", split_type="Line"
)

batch_transformer.wait()

After you run batch inference, you’ll be able to examine the predication accuracy on the SST2 dataset.

Conclusion

On this submit, we mentioned the SageMaker Hugging Face textual content classification algorithm. We offered instance code to carry out switch studying on a customized dataset utilizing a pre-trained mannequin in community isolation utilizing this algorithm. We additionally offered the performance to make use of any Hugging Face fill-mask or textual content classification mannequin for inference and switch studying. Lastly, we used batch inference to run inference on massive datasets. For extra data, try the instance pocket book.

In regards to the authors

Hemant Singh is an Utilized Scientist with expertise in Amazon SageMaker JumpStart. He obtained his grasp’s from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has expertise in engaged on a various vary of machine studying issues inside the area of pure language processing, pc imaginative and prescient, and time collection evaluation.

Rachna Chadha is a Principal Options Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that the moral and accountable use of AI can enhance society sooner or later and convey financial and social prosperity. In her spare time, Rachna likes spending time along with her household, mountain climbing, and listening to music.

Dr. Ashish Khetan is a Senior Utilized Scientist with Amazon SageMaker built-in algorithms and helps develop machine studying algorithms. He obtained his PhD from College of Illinois Urbana-Champaign. He’s an energetic researcher in machine studying and statistical inference, and has revealed many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.