Common use cases of Transformers Library

By hugging face Compile |VK Source: Github

This chapter describes the most common use cases when using the Transformers library. The available models allow for many different configurations and are highly generic in use cases. This paper introduces the simplest method and shows the usage of tasks such as question answering, sequence classification, named entity recognition, etc.

These examples take advantage of Auto Model, which instantiates the model according to a given checkpoint and automatically selects the correct model architecture. For more information, see the: AutoModel documentation. Please feel free to modify the code to make it more specific and adapt it to your specific use case.

In order to make the model perform well on the task, the model must be loaded from the checkpoint corresponding to the task. These checkpoints are usually pre trained on a large amount of data and fine tuned for specific tasks. This means that not all models are fine tuned for all tasks. If you want to fine tune the model of a specific task, you can use the run$task.py script in the examples directory.
The fine-tuning model is fine-tuning on a specific dataset. This dataset may or may not overlap with your use cases and domains. As mentioned earlier, you can use sample scripts to fine tune the model, or you can create your own training scripts.

In order to reason tasks, the library provides several mechanisms:

Pipelines are very easy to use abstractions, requiring only two lines of code.
Use the model directly with Tokenizer(PyTorch/TensorFlow) to use the complete reasoning of the model. This mechanism is slightly more complex, but more powerful.

Two methods are shown here.

Note that all of the tasks described here take advantage of the model after the pre training model is fine tuned for a specific task. When loading checkpoint s that are not fine tuned for a specific task, only the transformer layer will be loaded instead of the additional layer used for the task, thus randomly initializing the weight of the additional layer. This will produce random output.

Sequence classification

Sequence classification is a task that classifies a sequence according to a given category. An example of sequence classification is the GLUE dataset, which is completely based on this task. If you want to fine tune the model on the GLUE sequence classification task, you can use either the run GLUE.py or run TF GLUE.py script.

Here's an example of using pipes for sentiment analysis: to identify whether the sequence is positive or negative. It takes advantage of the fine-tuning model on sst2, which is a GLUE task.

from transformers import pipeline

nlp = pipeline("sentiment-analysis")

print(nlp("I hate you"))
print(nlp("I love you"))

This returns a label ("positive" or "negative") and a score as follows:

[{'label': 'NEGATIVE', 'score': 0.9991129}]
[{'label': 'POSITIVE', 'score': 0.99986565}]

The following is an example of using a model for sequence classification to determine whether two sequences are interpretations of each other. The process is as follows:

Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loaded with the weight stored in the checkpoint.
Build a sequence from these two sentences and mark the type id and attention mask with the correct model specific separators (encode() and encode plus() handle this problem)
Pass this sequence into the model to classify it into one of two available classes: 0 (not interpretation) and 1 (Interpretation)
The probability of obtaining class by softmax of calculation results
Print results

Python code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

print("Should be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(paraphrase_results[i] * 100)}%")

print("\nShould not be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(not_paraphrase_results[i] * 100)}%")

TensorFlow code

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

print("Should be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(paraphrase_results[i] * 100)}%")

print("\nShould not be paraphrase")
for i in range(len(classes)):
    print(f"{classes[i]}: {round(not_paraphrase_results[i] * 100)}%")

This will output the following results:

Should be paraphrase
not paraphrase: 10%
is paraphrase: 90%

Should not be paraphrase
not paraphrase: 94%
is paraphrase: 6%

Extract Q & A

Abstract Q & A is the task of extracting answers from the text of a given question. An example of a Q & a dataset is the SQuAD dataset, which is completely based on the task. If you want to fine tune the model in a team task, you can use run \.

Here's an example of using pipes for Q & A: extracting answers from the text of a given question. It takes advantage of a small team's fine-tuning model.

from transformers import pipeline

nlp = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
"""

print(nlp(question="What is extractive question answering?", context=context))
print(nlp(question="What is a good example of a question answering dataset?", context=context))

This returns the answer extracted from the text, a confidence level, and the start and end values, which are the position of the extracted answer in the text.

{'score': 0.622232091629833, 'start': 34, 'end': 96, 'answer': 'the task of extracting an answer from a text given a question.'}
{'score': 0.5115299158662765, 'start': 147, 'end': 161, 'answer': 'SQuAD dataset,'}

Here is an example of using the model and Tokenizer to answer questions. The process is as follows:

Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loaded with the weight stored in the checkpoint.
Define a paragraph of text and a few questions.
Traverse the problem and build a sequence based on the text and the current problem, passing the sequence to the model using the correct model specific separator tag type id and attention mask. This will output a series of scores for the start and end positions of the entire sequence of tags (questions and text).
Calculate the softmax of the result to obtain the probability corresponding to the start position and stop position of the marker
Convert these tags to strings.
Print results

Python code

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

TensorFlow code

from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet...) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="tf")
    input_ids = inputs["input_ids"].numpy()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(inputs)

    answer_start = tf.argmax(
        answer_start_scores, axis=1
    ).numpy()[0]  # Get the most likely beginning of answer with the argmax of the score
    answer_end = (
        tf.argmax(answer_end_scores, axis=1) + 1
    ).numpy()[0]  # Get the most likely end of answer with the argmax of the score
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

This will output the question after the predicted answer:

Question: How many pretrained models are available in Transformers?
Answer: over 32 +

Question: What does Transformers provide?
Answer: general - purpose architectures

Question: Transformers provides interoperability between which frameworks?
Answer: tensorflow 2 . 0 and pytorch

Language modeling

Language modeling is the task of matching a model with a corpus in a specific field. All popular transformer based models are trained with language modeling variants, such as BERT for mask language modeling and GPT-2 for causal language modeling.

Language modeling is also useful in addition to pre training, such as converting model distribution to specific areas: using language models trained on very large corpora and then fine tuning them to news datasets or scientific papers, such as lysandejik / arXiv NLP( https://huggingface.co/lysandre/arxiv-nlp).

Mask language modeling

Mask language modeling is to mask the tags in the sequence with mask tags and prompt the model to fill the mask with appropriate tags. This allows the model to handle both the right context (the tag to the right of the mask) and the left context (the tag to the left of the mask). Such training lays a solid foundation for downstream tasks (such as SQuAD) that need two-way background.

Here is an example of using pipes to replace masks in a sequence:

from transformers import pipeline

nlp = pipeline("fill-mask")
print(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))

This will output the sequence, confidence score, and tag id populated with the mask in the Tokenizer Vocabulary:

[
    {'sequence': '<s> HuggingFace is creating a tool that the community uses to solve NLP tasks.</s>', 'score': 0.15627853572368622, 'token': 3944},
    {'sequence': '<s> HuggingFace is creating a framework that the community uses to solve NLP tasks.</s>', 'score': 0.11690319329500198, 'token': 7208},
    {'sequence': '<s> HuggingFace is creating a library that the community uses to solve NLP tasks.</s>', 'score': 0.058063216507434845, 'token': 5560},
    {'sequence': '<s> HuggingFace is creating a database that the community uses to solve NLP tasks.</s>', 'score': 0.04211743175983429, 'token': 8503},
    {'sequence': '<s> HuggingFace is creating a prototype that the community uses to solve NLP tasks.</s>', 'score': 0.024718601256608963, 'token': 17715}
]

The following is an example of mask language modeling using the model and Tokenizer. The process is as follows:

Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and loaded with the weight stored in the checkpoint.
Define a sequence marked with a mask. Instead of using words, select tokenizer.mask'u token to place (mask).
Encode the sequence as an id and find the location of the mask tag in the id list.
Retrieve the prediction at the index of the mask tag: this tensor is the same size as the glossary, and the value is the score of each tag. The model gives a higher score for the markers he thinks may appear in this case.
Use the pytorch top k or tensorflow top? K method to retrieve the first five tags.
Replace mask tags with predicted tags and print results

Python code

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")

sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

TensorFlow code

from transformers import TFAutoModelWithLMHead, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelWithLMHead.from_pretrained("distilbert-base-cased")

sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

input = tokenizer.encode(sequence, return_tensors="tf")
mask_token_index = tf.where(input == tokenizer.mask_token_id)[0, 1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

This prints five sequences, the first five of which are predicted by the model:

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

Causal language modeling

Causal language modeling is based on a series of tags to predict the tasks of tags. In this case, the model focuses only on the context on the left (the tag on the left side of the mask). Such training is useful for generating tasks.

At present, there is no pipeline for causal language modeling / generation. Here is an example of using Tokenizer and the model. Use the generate() method to generate tags according to the initial sequence in PyTorch, and create a simple loop in TensorFlow.

Python code

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and is"

input = tokenizer.encode(sequence, return_tensors="pt")
generated = model.generate(input, max_length=50)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)

TensorFlow code

from transformers import TFAutoModelWithLMHead, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelWithLMHead.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and is"
generated = tokenizer.encode(sequence)

for i in range(50):
    predictions = model(tf.constant([generated]))[0]
    token = tf.argmax(predictions[0], axis=1)[-1].numpy()
    generated += [token]

resulting_string = tokenizer.decode(generated)
print(resulting_string)

This will output the (desired) corresponding string from the original sequence, and obtain the result of the sample of generate() using the top ﹣ P / tok ﹣ k distribution:

Hugging Face is based in DUMBO, New York City, and is a live-action TV series based on the novel by John
Carpenter, and its producers, David Kustlin and Steve Pichar. The film is directed by!

Named entity recognition

Named entity recognition (NER) is the task of classifying tags by category, such as identifying tags as individuals, organizations, or locations. An example of a named entity recognition dataset is the connll-2003 dataset, which is completely based on this task. If you want to fine tune the model of the NER task, you can use the NER / run ﹣ ner.py (pytorch), NER / run ﹣ pl ﹣ ner.py (pytorch lighting), or ner / run ﹣ TF ﹣ ner.py (tensorflow) script.

The following is an example of using pipes for named entity recognition, trying to identify tags as belonging to one of nine classes:

O. Not a named entity
B-MIS, beginning of a miscellaneous entity
I-MIS, miscellaneous entities
B-PER, the beginning of a person's name
I-PER, person name
B-ORG, the beginning of an organization
I-ORG, organization
B-LOC, the beginning of a location
I-LOC, location

It makes use of the last fine-tuning model of connll-2003, which is fine-tuning by @ Stefan it of dbmdz.

from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

This will output a list of all the words identified as entities in the nine classes defined above. The following are the expected results:

[
    {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

Notice how "Hugging Face" is identified as an organization, and how "New York City", "DUMBO" and "Manhattan Bridge" are identified as locations.

Here is an example of named entity recognition using the model and Tokenizer. The process is as follows:

Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loaded with the weight stored in the checkpoint.
Define a list of tags to use for the training model.
Define a sequence of known entities, such as "Hugging Face" as an organization and "New York City" as a location.
Split the words into tags so that they map to the forecast. Using a trick, we first encode and decode the sequence completely, leaving a string with special tags.
Encode the sequence as ID (special tags are added automatically).
Retrieve the forecast by passing the input to the model and getting the first output. This causes each tag to be distributed over nine possible classes. We use argmax to retrieve the most likely class for each tag.
Each tag and its prediction are put together and printed out.

Python code

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Not a named entity
    "B-MISC",  # The beginning of a miscellaneous entity
    "I-MISC",  # miscellaneous
    "B-PER",   # The beginning of a person's name
    "I-PER",   # Name
    "B-ORG",   # The beginning of an organization
    "I-ORG",   # organization
    "B-LOC",   # The beginning of a place
    "I-LOC"    # place
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

TensorFlow code

from transformers import TFAutoModelForTokenClassification, AutoTokenizer
import tensorflow as tf

model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Not a named entity
    "B-MISC",  # The beginning of a miscellaneous entity
    "I-MISC",  # miscellaneous
    "B-PER",   # The beginning of a person's name
    "I-PER",   # Name
    "B-ORG",   # The beginning of an organization
    "I-ORG",   # organization
    "B-LOC",   # The beginning of a place
    "I-LOC"    # place
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

#A skill of using special mark to get mark
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="tf")

outputs = model(inputs)[0]
predictions = tf.argmax(outputs, axis=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])

This maps the output to a list of each tag it predicts. Unlike the pipeline, each tag here has a prediction because we did not delete the "O" class, which means that no specific entity can be found on the tag. The following array should be output:

[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]

Original link: https://huggingface.co/transformers/usage.html

Welcome to pioneer AI blog: http://panchuang.net/

OpenCV official document in Chinese: [http://woshicver.com/]http://woshicver.com/)

Welcome to pioneer blog Resource Hub: http://docs.panchuang.net/

Posted by rel on Mon, 23 Mar 2020 23:54:18 -0700

Programmer Group

Common use cases of Transformers Library

Sequence classification

Extract Q & A

Language modeling

Mask language modeling

Causal language modeling

Named entity recognition

Hot Keywords