20210925_NLP transformer_ Text classification of NLP

6, Text classification

source

Datewhle29 issue__ NLP transformer:

Erenup (more notes), Peking University, principal
Zhang Fan, Datawhale, Tianjin University, Chapter 4
Zhang Xian, Harbin Institute of technology, Chapter 2
Li luoqiu, Zhejiang University, Chapter 3
CAI Jie, Peking University, Chapter 4
hlzhang, McGill University, Chapter 4
Taiyunpeng Chapter 2
Zhang Hongxu Chapter 2

Learning materials address:
https://datawhalechina.github.io/learn-nlp-with-transformers/#/
github address:
https://github.com/datawhalechina/learn-nlp-with-transformers

1.1 partial classification tasks

The GLUE list contains 9 sentence level classification tasks, which are:
- CoLA (Corpus of Linguistic Acceptability) identifies whether a sentence is grammatically correct
- Mnli (multi gene natural language influence) gives a hypothesis and judges the relationship between another sentence and the hypothesis: entries, contracts or unrelated.
- MRPC (Microsoft Research Paraphrase Corpus) judges whether two sentences are paraphrases to each other
- Qnli (question answering natural language inference) determines whether the second sentence contains the answer to the first question.
- QQP (Quora Question Pairs2) determines whether two questions have the same semantics.
- RTE (recognizing textual entry) determines whether a sentence has an entail relationship with a hypothesis.
- Sst-2 (Stanford sentient treebank) judges the positive and negative emotion of a sentence
- STS-B (Semantic Textual Similarity Benchmark) judges the similarity of two sentences (score is 1-5).
- Wnli (Winograd natural language influence) judges whether a sentence with an anonymous pronoun and a sentence with the pronoun replaced contain

GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

1.2 loading data

Data loading and evaluation loading only need to use load_dataset and load_metric is enough.

from datasets import load_dataset, load_metric

Except mnli mm, other tasks can be loaded directly by task name. Data is automatically cached after loading.

actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

Evaluation metric is an example of datasets.Metric:

metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}
""", stored examples: 0)

Directly call the compute method of metric and pass in labels and predictions to get the value of metric:
Metric's compute is done by using the datasets.Metric.compute() method. Forecasts and recommendations: you can add forecasts and recommendations (if you use datasets.Metric.add() or datasets.Metric, they will be added to the end of the cache. Add_batch() before) specific parameters, which can be required, or you can modify the behavior of some indicators (print indicator input description to view the use of print (metric) or print (metric.inputs_description) Details of).

import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

The corresponding metric of each text classification task is different, as follows:

    for CoLA: Matthews Correlation Coefficient
    for MNLI (matched or mismatched): Accuracy
    for MRPC: Accuracy and F1 score
    for QNLI: Accuracy
    for QQP: Accuracy and F1 score
    for RTE: Accuracy
    for SST-2: Accuracy
    for STS-B: Pearson Correlation Coefficient and Spearman's_Rank_Correlation_Coefficient
    for WNLI: Accuracy

So be sure to align metric with the task

1.3 data preprocessing

The preprocessing tool is tokenizer. Tokenizer first tokenizes the input, then converts the tokens into the corresponding token ID in the pre model, and then into the input format required by the model.
For the purpose of data preprocessing, we use autotokenizer. From_ The pretrained method instantiates our tokenizer to ensure that:
- We get a tokenizer corresponding to the pre training model one by one.
- When using the tokenizer corresponding to the specified model checkpoint, we also downloaded the thesaurus vocabulary required by the model, which is exactly tokens vocabulary.

The downloaded tokens volatile will be cached so that they will not be downloaded again when they are used again.

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)  #use_fast=True requires tokenizer to be of type transformers.PreTrainedTokenizerFast

Because some special features of fast tokenizer (such as multi-threaded fast tokenizer) need to be used in preprocessing, use_fast=True requires tokenizer to be of type transformers.PreTrainedTokenizerFast. If the corresponding model does not have a fast tokenizer, remove this option.
Tokenizer can preprocess either a single text or a pair of text. The data obtained by tokenizer after preprocessing meets the input format of pre training model

tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In order to preprocess our data, we need to know different data and corresponding data formats, so we define the following dict.

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

#Data format check
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.

#Put the preprocessed code into a function:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

The preprocessing function can process a single sample or multiple samples. If the input is multiple samples, a list is returned

preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Preprocess all samples in the dataset datasets by using the map function to prepare the preprocessing function_ train_ Features are applied to all samples.

encoded_dataset = dataset.map(preprocess_function, batched=True)

1.4 fine tuning the pre training model

Use the AutoModelForSequenceClassification class. Similar to tokenizer, the from_trained method can also help us download and load models
STS-B is a regression problem, and MNLI is a 3 classification problem:

#The model is cached
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

In order to get a Trainer training tool, we also need three elements, the most important of which is the training setting / parameter TrainingArguments. This training setting contains all the attributes that can define the training process.

metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

Define a function to get the evaluation method according to the task name and pass it to Trainer:

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)


validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Start training:

trainer.train()

#assessment
trainer.evaluate()

1.5 hyper parameter search

Trainer also supports hyper parameter search and uses optuna or Ray Tune code base.
During hyper parameter search, the Trainer will return multiple trained models, so a defined model needs to be passed in so that the Trainer can continuously reinitialize the passed in model:
Call the method hyperparameter_search. Note that this process may take a long time. We can use part of the data set for hyperparameter search first, and then conduct full training. For example, use 1 / 10 of the data for search:


best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

hyperparameter_search will return parameters related to the best model:

best_run

Set trainer as the best parameter found for training:

for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Posted by kbc1 on Sat, 25 Sep 2021 02:32:28 -0700

Programmer Group