[PaddleNLP] thousand words data set: emotion analysis - SKEP

Keywords: NLP paddlepaddle

[PaddleNLP] thousand words data set: emotion analysis - SKEP

This project uses the pre training model SKEP complete Thousand words data set: emotion analysis match
It contains three sub tasks: sentence level emotion classification, evaluation object level emotion classification and viewpoint extraction
Welcome to praise, fork and pay attention!

Heavy update!!!

2021 / 6 / 24   V1.5   NEW 👉 Ernie_ Integration of gram and SKEP prediction results (evaluation object extraction task) 👈   remarkable effect!!!

If necessary, Ernie_gram view extraction tutorial, please refer to Jiang Liu: [paddlenlp] thousand words data set: emotion analysis
Don't forget to order a star and fork for him too!!!

preface

hello everyone
The competition is NLP punch in camp Big homework. The following is the code to realize the emotion analysis task through the content learned in this course.




Previous results:





Just getting started + there is no basis for machine learning, and there is still a lot of room for tuning and processing.

We strongly recommend that you refer to the following source code:

Found good things, but not yet used: VisualDL2.2 It can realize hyperparametric visualization

Dataset preparation

Put all data set compressed packages under the work folder

# Unzip the data set to the data folder. Note that it needs to be executed every time you open it. Restart does not need to be executed
!unzip -q work/ChnSentiCorp.zip -d data
!unzip -q work/NLPCC14-SC.zip -d data
!unzip -q work/SE-ABSA16_CAME.zip -d data
!unzip -q work/SE-ABSA16_PHNS.zip -d data
!unzip -q work/COTE-BD.zip -d data
!unzip -q work/COTE-DP.zip -d data
!unzip -q work/COTE-MFW.zip -d data
# Update paddlenlp
!pip install --upgrade paddlenlp -i https://pypi.org/simple 

1, Sentence level affective analysis

The emotional polarity classification of a given text is often used in film review analysis, online forum public opinion analysis and other scenes.

Data read in – sentence level data

contain:

  • load_ds: the verification set can be divided from the training set
  • load_test: read in test
import os
import random
from paddlenlp.datasets import MapDataset

# for train and dev sets
def load_ds(datafiles, split_train=False, dev_size=0):
    '''
    intput:
        datafiles -- str or list[str] -- the path of train or dev sets
        split_train -- Boolean -- split from train or not
        dev_size -- int -- split how much data from train 

    output:
        MapDataset
    '''

    datas = []

    def read(ds_file):
        with open(ds_file, 'r', encoding='utf-8') as fp:
            next(fp)  # Skip header
            for line in fp.readlines():
                data = line[:-1].split('\t')
                if len(data)==2:
                    yield ({'text':data[1], 'label':int(data[0])})
                elif len(data)==3:
                    yield ({'text':data[2], 'label':int(data[1])})
    
    def write_tsv(tsv, datas):
        with open(tsv, mode='w', encoding='UTF-8') as f:
            for line in datas:
                f.write(line)
    
    # Cut a part from the train to dev
    def spilt_train4dev(train_ds, dev_size):
        with open(train_ds, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                datas.append(line)
        datas_tmp=datas[1:] # title line should not shuffle
        random.shuffle(datas_tmp) 
        if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
            os.mkdir(os.path.dirname(train_ds)+'/tem')
        # remember the title line
        write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
        write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
        

    if split_train:
        if 1-isinstance(datafiles, str):
            print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
            return None
        if dev_size == 0:
            print("Please set size of dev set, as dev_size=...")
            return None
        spilt_train4dev(datafiles, dev_size)
        datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
    
    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]



def load_test(datafile):
    '''
    intput:
        datafile -- str -- the path of test set 

    output:
        MapDataset
    '''
    
    def read(test_file):
        with open(test_file, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                if i==0:
                    continue
                data = line[:-1].split('\t')
                yield {'text':data[1], 'label':'', 'qid':data[0]}

    return MapDataset(list(read(datafile)))

1. ChnSentiCorp

Only one dataset can be executed at a time!!!
32 256 2e-5 6

# You can use the official load directly_ Dataset is more convenient
# In this paper, a part of the train is divided into dev, which is in a unified format, so the load function defined by ourselves is adopted

# from paddlenlp.datasets import load_dataset
# train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
train_ds, dev_ds= load_ds(datafiles=['./data/ChnSentiCorp/train.tsv', './data/ChnSentiCorp/dev.tsv'])
print(train_ds[0])
print(dev_ds[0])
print(type(train_ds[0]))

test_ds = load_test(datafile='./data/ChnSentiCorp/test.tsv')
print(test_ds[0])

2. NLPCC14-SC

train_ds, dev_ds = load_ds(datafiles='./data/NLPCC14-SC/train.tsv', split_train=True, dev_size=1000)
print(train_ds[0])
print(dev_ds[0])
test_ds = load_test(datafile='./data/NLPCC14-SC/test.tsv')
print(test_ds[0])

SKEP model construction

PaddleNLP has implemented the SKEP pre training model, and SKEP loading can be realized through one line of code.

Sentence level affective analysis model is a common model for SKEP fine tune text classification. Firstly, the sentence semantic features are extracted by SKEP, and then the semantic features are classified.

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# num_classes: two categories, 0 and 1, indicate negative and positive
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=2)
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")

SkepForSequenceClassification can be used for sentence level emotion analysis and target level emotion analysis tasks. It obtains the representation of the input text through the pre training model SKEP, and then classifies the text representation.

  • pretrained_model_name_or_path: model name. "Skip_ernie_1.0_large_ch", "skip_ernie_2.0_large_en" are supported.

    • "skep_ernie_1.0_large_ch": it refers to SKEP model in pre training Ernie_ 1.0_ large_ On the basis of CH, the Chinese pre training model is obtained by continuous pre training on massive Chinese data;
    • "skep_ernie_2.0_large_en": it refers to SKEP model in pre training Ernie_ 2.0_ large_ An English pre training model based on en and continuous pre training on massive English data;
  • num_classes: number of data set classification categories.

For details about SKEP model implementation, refer to: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep

import os
from functools import partial


import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`int`, optional): The input label if not is_test.
    """
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid


batch_size = 32
max_seq_length = 256
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]

train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

Model training and evaluation (store the results of each data set separately!!!)

After defining the loss function, optimizer and evaluation index, you can start training.

Recommended super parameter settings:

  • max_seq_length=256
  • batch_size=32
  • learning_rate=2e-5
  • epochs=10

In actual operation, batch can be adjusted according to the size of video memory_ Size and max_seq_length size.

import time

from utils import evaluate


epochs = 10
ckpt_dir = "skep_sentence"
num_training_steps = len(train_data_loader) * epochs

# optimizer = paddle.optimizer.AdamW(
#     learning_rate=2e-5,
#     parameters=model.parameters())

decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-6,
    parameters=model.parameters(),
    weight_decay=0.01, # test weight_decay
    apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
    )

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        if global_step % 100 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            evaluate(model, criterion, metric, dev_data_loader) 
            model.save_pretrained(save_dir)
            tokenizer.save_pretrained(save_dir)

Prediction results (pay attention to changing the path of the model)

The trained model can also predict the emotion of the text.

import numpy as np
import paddle

batch_size=24

trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack() # qid
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# Change the loaded parameter path according to the actual operation
params_path = 'skep_sentence2t_weight_2e-6/model_1800_83700/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

label_map = {0: '0', 1: '1'}
results = []
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids, qids = batch
    logits = model(input_ids, token_type_ids)
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    labels = [label_map[i] for i in idx]
    qids = qids.numpy().tolist()
    results.extend(zip(qids, labels))

res_dir = "./results/2_weight_2e-6/1800_83700"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

with open(os.path.join(res_dir, "NLPCC14-SC.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for qid, label in results:
        f.write(str(qid[0])+"\t"+label+"\n")

2, Goal level emotion analysis

In the e-commerce product analysis scenario, in addition to analyzing the emotional polarity of the overall commodity, it is also refined to take the specific "aspects" of the commodity as the analysis subject for emotional analysis (aspect level), as follows:

  • This potato chip tastes a little salty and too spicy, but it tastes crisp.

There is a negative evaluation on the taste of potato chips (salty, too spicy), but a positive evaluation on the taste (very crisp).

  • I like Hawaii very much, but the seafood here is too expensive.

There is a positive evaluation about Hawaii (like it), but there is a negative evaluation about Hawaiian seafood (the price is too expensive).

Data read in – target level dataset

Similar to the sentence level approach

import os
import random

from paddlenlp.datasets import MapDataset


def load_ds(datafiles, split_train=False, dev_size=0):
    datas = []
    def read(ds_file):
        with open(ds_file, 'r', encoding='utf-8') as fp:
            # if fp.readline().split('\t')[0] == 'label':
            next(fp)  # Skip header
            for line in fp.readlines():
                data = line[:-1].split('\t')
                yield ({'text':data[1], 'text_pair':data[2], 'label':int(data[0])})
    
    def write_tsv(tsv, datas):
        with open(tsv, mode='w', encoding='UTF-8') as f:
            for line in datas:
                f.write(line)

    def spilt_train4dev(train_ds, dev_size):
        with open(train_ds, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                datas.append(line)
        datas_tmp=datas[1:] # title line should not shuffle
        random.shuffle(datas_tmp) 
        if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
            os.mkdir(os.path.dirname(train_ds)+'/tem')
        # remember the title line
        write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
        write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
        

    if split_train:
        if 1-isinstance(datafiles, str):
            print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
            return None
        if dev_size == 0:
            print("Please set size of dev set, as dev_size=...")
            return None
        spilt_train4dev(datafiles, dev_size)
        datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
    
    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]

def load_test(datafile):
    
    def read(test_file):
        with open(test_file, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                if i==0:
                    continue
                data = line[:-1].split('\t')
                yield {'text':data[1], 'text_pair':data[2]}

    return MapDataset(list(read(datafile)))

3. SE-ABSA16_PHNS

Only one dataset can be executed at a time!!!

train_ds, dev_ds = load_ds(datafiles='./data/SE-ABSA16_PHNS/train.tsv', split_train=True, dev_size=100)
print(train_ds[0])
print(dev_ds[0])

test_ds = load_test(datafile='./data/SE-ABSA16_PHNS/test.tsv')
print(test_ds[0])

4. SE-ABSA16_CAME

train_ds, dev_ds = load_ds(datafiles='./data/SE-ABSA16_CAME/train.tsv', split_train=True, dev_size=100)
print(train_ds[0])
print(dev_ds[0])

test_ds = load_test(datafile='./data/SE-ABSA16_CAME/test.tsv')
print(test_ds[0])

SKEP model construction

The target level affective analysis model also uses the SkepForSequenceClassification model, but the input of the target level affective analysis model is not only a sentence, but a sentence pair. One sentence describes the "aspect of the evaluation object" and the other describes the "comment on this aspect", as shown in the figure below.

from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# num_classes: two categories, 0 and 1, indicating negative and positive
model = SkepForSequenceClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=2)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
from functools import partial
import os
import time

import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader


def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False,
                    dataset_name="chnsenticorp"):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    
    note: There is no need token type ids for skep_roberta_large_ch model.


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
        dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
    """
    encoded_inputs = tokenizer(
        text=example["text"],
        text_pair=example["text_pair"],
        max_seq_len=max_seq_length)

    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids


max_seq_length=512
batch_size=16
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack(dtype="int64")  # labels
): [data for data in fn(samples)]

train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

Model training and evaluation

After defining the loss function, optimizer and evaluation index, you can start training.

from utils import evaluate


epochs = 12
num_training_steps = len(train_data_loader) * epochs

# optimizer = paddle.optimizer.AdamW(
#     learning_rate=2e-6,
#     parameters=model.parameters())

decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
optimizer = paddle.optimizer.AdamW(
    learning_rate=1e-6,
    parameters=model.parameters(),
    weight_decay=0.01, # test weight_decay
    apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
    )

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

ckpt_dir = "skep_aspect"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        if global_step % 50 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            #The evaluate here directly copies the above items, which may not be correct
            evaluate(model, criterion, metric, dev_data_loader)
            model.save_pretrained(save_dir)
            tokenizer.save_pretrained(save_dir)

Prediction results

The trained model can also predict the emotion of the evaluation object.

@paddle.no_grad()
def predict(model, data_loader, label_map):
    """
    Given a prediction dataset, it gives the prediction results.

    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        label_map(obj:`dict`): The label id (key) to label str (value) map.
    """
    model.eval()
    results = []
    for batch in data_loader:
        input_ids, token_type_ids = batch
        logits = model(input_ids, token_type_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results

label_map = {0: '0', 1: '1'}
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
): [data for data in fn(samples)]

batch_size=16

test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# Change the loaded parameter path according to the actual operation
params_path = 'skep_aspect4_weight_1e-6/model_500_73000/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

results = predict(model, test_data_loader, label_map)

res_dir = "./results/4_weight_1e-6/500_73000"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

with open(os.path.join(res_dir, "SE-ABSA16_CAME.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for idx, label in enumerate(results):
        f.write(str(idx)+"\t"+label+"\n")

3, Evaluation object extraction

Data read in – object dataset

Object = []
😦

import os
import random
from paddlenlp.datasets import MapDataset


def load_ds(datafiles, split_train=False, dev_size=0):
    datas = []

    def read(ds_file):
        with open(ds_file, 'r', encoding='utf-8') as fp:
            # if fp.readline().split('\t')[0] == 'label':
            next(fp)  # Skip header
            for line in fp.readlines():
                # print('1\n')
                line_stripped = line.strip().split('\t')
                if not line_stripped:
                    continue

                # In the dataset, the entity and text are not on the same line.....
                try:
                    example = [line_stripped[indice] for indice in (0,1)]
                    entity, text = example[0], example[1]
                    start_idx = text.index(entity)
                except:
                    # drop the dirty data
                    continue

                labels = [2] * len(text)
                labels[start_idx] = 0
                for idx in range(start_idx + 1, start_idx + len(entity)):
                        labels[idx] = 1 
                yield {
                        "tokens": list(text),
                        "labels": labels,
                        "entity": entity
                    }
    
    def write_tsv(tsv, datas):
        with open(tsv, mode='w', encoding='UTF-8') as f:
            for line in datas:
                f.write(line)


    def spilt_train4dev(train_ds, dev_size):
        with open(train_ds, 'r', encoding='UTF-8') as f:
            for i, line in enumerate(f):
                datas.append(line)
        datas_tmp=datas[1:] # title line should not shuffle
        random.shuffle(datas_tmp) 
        if 1-os.path.exists(os.path.dirname(train_ds)+'/tem'):
            os.mkdir(os.path.dirname(train_ds)+'/tem')
        # remember the title line
        write_tsv(os.path.dirname(train_ds)+'/tem/train.tsv', datas[0:1]+datas_tmp[:-dev_size])
        write_tsv(os.path.dirname(train_ds)+'/tem/dev.tsv', datas[0:1]+datas_tmp[-dev_size:])
        
        
    if split_train:
        if 1-isinstance(datafiles, str):
            print("If you want to split the train, make sure that \'datafiles\' is a train set path str.")
            return None
        if dev_size == 0:
            print("Please set size of dev set, as dev_size=...")
            return None
        spilt_train4dev(datafiles, dev_size)
        datafiles = [os.path.dirname(datafiles)+'/tem/train.tsv', os.path.dirname(datafiles)+'/tem/dev.tsv']
    
    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]


def load_test(datafile):
    
    def read(test_file):
        with open(test_file, 'r', encoding='UTF-8') as fp:
             # if fp.readline().split('\t')[0] == 'label':
            next(fp)  # Skip header
            for line in fp.readlines():
                # print('1\n')
                line_stripped = line.strip().split('\t')
                if not line_stripped:
                    continue
                # example = [line_stripped[indice] for indice in field_indices]
                # example = [line_stripped[indice] for indice in (0,1)]

                # In the dataset, the entity and text are not on the same line.....
                try:
                    example = [line_stripped[indice] for indice in (0,1)]
                    entity, text = example[0], example[1]
                except:
                    # drop the dirty data
                    continue

                yield {"tokens": list(text)}

    return MapDataset(list(read(datafile)))

5. COTE-DP

train_ds = load_ds(datafiles='./data/COTE-DP/train.tsv')
print(train_ds[0])
test_ds = load_test(datafile='./data/COTE-DP/test.tsv')
print(test_ds[0])

6. COTE-BD

train_ds = load_ds(datafiles='./data/COTE-BD/train.tsv')
print(train_ds[0])
test_ds = load_test(datafile='./data/COTE-BD/test.tsv')
print(test_ds[0])

7. COTE-MFW

train_ds = load_ds(datafiles='./data/COTE-MFW/train.tsv')
print(train_ds[1])
test_ds = load_test(datafile='./data/COTE-MFW/test.tsv')
print(test_ds[0])

SKEP model construction

It's a little different from the one above

from paddlenlp.transformers import SkepCrfForTokenClassification, SkepTokenizer, SkepModel
# num_classes: Class III, B/I/O
skep = SkepModel.from_pretrained('skep_ernie_1.0_large_ch')
model = SkepCrfForTokenClassification(skep, num_classes=3)
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
# # The COTE_DP dataset labels with "BIO" schema.
# label_map = {label: idx for idx, label in enumerate(train_ds.label_list)}
# # `no_entity_label` represents that the token isn't an entity.
# # print(type(no_entity_label_idx))
no_entity_label_idx = 2

from functools import partial
import os
import time

import numpy as np
import paddle

import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader


def convert_example_to_feature(example,
                               tokenizer,
                               max_seq_len=512,
                               no_entity_label="O",
                               is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::
        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |
    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
            Sequences longer than this will be truncated, sequences shorter will be padded.
        no_entity_label(obj:`str`, defaults to "O"): The label represents that the token isn't an entity. 
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`list[int]`, optional): The input label if not test data.
    """
    
    tokens = example['tokens']
    labels = example['labels']
    tokenized_input = tokenizer(
        tokens,
        return_length=True,
        is_split_into_words=True,
        max_seq_len=max_seq_len)

    input_ids = tokenized_input['input_ids']
    token_type_ids = tokenized_input['token_type_ids']
    seq_len = tokenized_input['seq_len']

    if is_test:
        return input_ids, token_type_ids, seq_len
    else:
        labels = labels[:(max_seq_len - 2)]
        encoded_label = np.array([no_entity_label] + labels + [no_entity_label], dtype="int64")
        return input_ids, token_type_ids, seq_len, encoded_label


max_seq_length=256
batch_size=40
trans_func = partial(
    convert_example_to_feature,
    tokenizer=tokenizer,
    max_seq_len=max_seq_length,
    no_entity_label=no_entity_label_idx,
    is_test=False)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # input ids
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # token type ids
    Stack(dtype='int64'),  # sequence lens
    Pad(axis=0, pad_val=no_entity_label_idx)  # labels
): [data for data in fn(samples)]

train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
# dev_data_loader = create_dataloader(
#     dev_ds,
#     mode='dev',
#     batch_size=batch_size,
#     batchify_fn=batchify_fn,
#     trans_fn=trans_func)

model training

import time
from utils import evaluate
from paddlenlp.metrics import ChunkEvaluator

epochs = 10
num_training_steps = len(train_data_loader) * epochs

# # test weight_decay
# decay_params = [
#         p.name for n, p in model.named_parameters()
#         if not any(nd in n for nd in ["bias", "norm"])
#     ]
# optimizer = paddle.optimizer.AdamW(
#     learning_rate=1e-5,
#     parameters=model.parameters(),
#     weight_decay=0.01, # test weight_decay
#     apply_decay_param_fun=lambda x: x in decay_params # test weight_decay
#     )

optimizer = paddle.optimizer.AdamW(
    learning_rate=1e-6,
    parameters=model.parameters(),
    )


# metric = ChunkEvaluator(label_list=train_ds.label_list, suffix=True)
metric = ChunkEvaluator(label_list=['B', 'I', 'O'], suffix=True)

ckpt_dir = "skep_opinion7_1e-6"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, seq_lens, labels = batch
        loss = model(
                input_ids, token_type_ids, seq_lens=seq_lens, labels=labels)
        avg_loss = paddle.mean(loss)

        global_step += 1
        if global_step % 10 == 0:
            print(
                    "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
                    % (global_step, epoch, step, avg_loss,
                       10 / (time.time() - tic_train)))
            tic_train = time.time()
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        if global_step % 200 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            
            file_name = os.path.join(save_dir, "model_state.pdparam")
            # Need better way to get inner model of DataParallel
            paddle.save(model.state_dict(), file_name)

Prediction results

import re

def convert_example_to_feature(example, tokenizer, max_seq_length=512, is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::
        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |
    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask. 
    """
    tokens = example["tokens"]
    encoded_inputs = tokenizer(
        tokens,
        return_length=True,
        is_split_into_words=True,
        max_seq_len=max_seq_length)
    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]
    seq_len = encoded_inputs["seq_len"]

    return input_ids, token_type_ids, seq_len


def parse_predict_result(predictions, seq_lens, label_map):
    """
    Parses the prediction results to the label tag.
    """
    pred_tag = []
    for idx, pred in enumerate(predictions):
        seq_len = seq_lens[idx]
        # drop the "[CLS]" and "[SEP]" token
        tag = [label_map[i] for i in pred[1:seq_len - 1]]
        pred_tag.append(tag)
    return pred_tag

@paddle.no_grad()
def predict(model, data_loader, label_map):
    """
    Given a prediction dataset, it gives the prediction results.
    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        label_map(obj:`dict`): The label id (key) to label str (value) map.
    """
    model.eval()
    results = []
    for input_ids, token_type_ids, seq_lens in data_loader:
        preds = model(input_ids, token_type_ids, seq_lens=seq_lens)
        tags = parse_predict_result(preds.numpy(), seq_lens.numpy(), label_map)
        results.extend(tags)
    return results

# The COTE_DP dataset labels with "BIO" schema.
label_map = {0: "B", 1: "I", 2: "O"}
# `no_entity_label` represents that the token isn't an entity. 
no_entity_label_idx = 2

batch_size=96

trans_func = partial(
    convert_example_to_feature,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # input ids
    Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]),  # token type ids
    Stack(dtype='int64'),  # sequence lens
): [data for data in fn(samples)]

test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# Change the loaded parameter path according to the actual operation,!!! Pay attention to whether the loading is successful!!!
params_path = 'skep_opinion7_1e-6/model_10200_10/model_state.pdparam'
if params_path and os.path.isfile(params_path):
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)
else:
    print("MODEL LOAD FAILURE")
    exit

results = predict(model, test_data_loader, label_map)

# Processing symbols
punc = '~`!#$%^&*()_+-=|\';":/.,?><~·!@#¥%......&*()-+-=": ';,. ,?><{}'

res_dir = "./results/7_1e-6/10200_10"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)

with open(os.path.join(res_dir, "COTE_MFW.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for idx, example in enumerate(test_ds.data):
        tag = []
        # to find "B...I...I"
        for idx1, letter in enumerate(results[idx]):
            if letter == 'B':
                i = idx1
                try:
                    while(results[idx][i+1]=='I'):
                        i = i+1
                except:
                    pass
                tag.append(re.sub(r"[%s]+" %punc, "", "".join(example['tokens'][idx1:i+1])))
        if tag == []:
            # If no entity is found, it should be predicted as none, otherwise the competition comparison will be abnormal
            tag.append('nothing')
        f.write(str(idx)+"\t"+"\x01".join(tag)+"\n")

Submit

Compress the forecast results into a zip file and submit Thousand words competition website

summary

This project is a major operation of PaddleNLP punch in camp, including the following deficiencies:
Instead of cross checking, take a part from the training set as the verification set
The data were not analyzed and processed
Can't adjust parameters (inexperienced)

This project is my first project completed with Paddle and the first deep learning project.
I have learned a lot through practice and found that there is still a lot to learn. I hope you can point out the deficiencies in the comment area.
You are also welcome to like, fork and pay attention!

I got silver level in AI Studio and lit 4 badges to turn on each other~ https://aistudio.baidu.com/aistudio/personalcenter/thirdview/815060

Posted by davidklonski on Sat, 06 Nov 2021 03:35:18 -0700