资源

⭐ ⭐ ⭐ 欢迎点个小小的Star支持!⭐ ⭐ ⭐

开源不易,希望大家多多支持~

  • 更多CV和NLP中的transformer模型(BERT、ERNIE、ViT、DeiT、Swin Transformer等)、深度学习资料,请参考:awesome-DeepLearning

  • 更多的预训练语言模型,请参考paddleNLP: https://github.com/PaddlePaddle/PaddleNLP

  • 飞桨框架相关资料,请参考:飞桨深度学习平台

一、竞赛介绍

CCF大数据与计算智能大赛(CCF Big Data & Computing Intelligence Contest,简称CCF BDCI)由中国计算机学会于2013年创办。大赛由国家自然科学基金委员会指导,是大数据与人工智能领域的算法、应用和系统大型挑战赛事。大赛面向重点行业和应用领域征集需求,以前沿技术与行业应用问题为导向,以促进行业发展及产业升级为目标,以众智、众包的方式,汇聚海内外产学研用多方智慧,为社会发现和培养了大量高质量数据人才。
  大赛迄今已成功举办八届,累计吸引全球1500余所高校、1800家企事业单位及80余所科研机构的12万余人参与,已成为中国大数据与人工智能领域最具影响力的活动之一,是中国大数据综合赛事第一品牌。
  2021年第九届大赛以“数引创新,竞促汇智”为主题,立足余杭、面向全球,于9月至12月举办。大赛将致力于解决来自政府、企业真实场景中的痛点、难点问题,邀请全球优秀团队参与数据资源开发利用,广泛征集信息技术应用解决方案。

1.1 赛题任务

比赛的地址为https://www.datafountain.cn/competitions/529

观点提取旨在从非结构化的评论文本中提取标准化、结构化的信息,如产品名、评论维度、评论观点等。此处希望大家能够通过自然语言处理的语义情感分析技术判断出一段银行产品评论文本的情感倾向,并能进一步通过语义分析和实体识别,标识出评论所讨论的产品名,评价指标和评价关键词。

因此我们就可以分为命名实体识别和情感分类两个任务来做,然后把这两个任务的结果合并提交到官网就行了。

二、命名实体识别部分

import argparse
import os
import random
import time
import math
from functools import partial
import inspect
import numpy as np
import collections
import pandas as pd
from tqdm import tqdm 
import json

2.1 数据处理


# !unzip -o data/data110473/产品评论观点提取.zip -d data
!unzip -o data/data114938/产品评论观点提取-new.zip -d data
Archive:  data/data114938/产品评论观点提取-new.zip
  inflating: data/submit_example.csv  
  inflating: data/__MACOSX/._submit_example.csv  
  inflating: data/test_public.csv    
  inflating: data/__MACOSX/._test_public.csv  
  inflating: data/train_data_public.csv  
  inflating: data/__MACOSX/._train_data_public.csv  
file_name='data/train_data_public.csv'
data=pd.read_csv(file_name,index_col=0)
data.head()
textBIO_annoclass
id
0交行14年用过,半年准备提额,却直接被降到1K,半年期间只T过一次三千,其它全部真实消费,第...B-BANK I-BANK O O O O O O O O O O B-COMMENTS_N...0
1单标我有了,最近visa双标返现活动好B-PRODUCT I-PRODUCT O O O O O O B-PRODUCT I-PR...1
2建设银行提额很慢的……B-BANK I-BANK I-BANK I-BANK B-COMMENTS_N I-COM...0
3我的怎么显示0.25费率,而且不管分多少期都一样费率,可惜只有69kO O O O O O O O O O B-COMMENTS_N I-COMMENTS_N ...2
4利率不错,可以撸B-COMMENTS_N I-COMMENTS_N B-COMMENTS_ADJ I-COM...1
data.columns
Index(['text', 'BIO_anno', 'class'], dtype='object')
import os

# 导入paddle库
import paddle
import paddle.nn.functional as F
import paddle.nn as nn
from paddle.io import DataLoader
from paddle.dataset.common import md5file
# 导入paddlenlp的库
import paddlenlp as ppnlp
from paddlenlp.transformers import LinearDecayWithWarmup
from paddlenlp.metrics import ChunkEvaluator

from paddlenlp.data import Stack, Tuple, Pad, Dict
from paddlenlp.datasets import DatasetBuilder,get_path_from_url

class ProductCommentDataset(DatasetBuilder):

    SPLITS = {
        'train': os.path.join('data','train_data_public.csv'),
        'test': os.path.join('data', 'test_public.csv'),
    }

    def _get_data(self, mode, **kwargs):
        default_root = '.'
        filename = self.SPLITS[mode]
        fullname = os.path.join(default_root, filename)
        self.mode=mode
        return fullname

    def _read(self, filename, *args):
        df=pd.read_csv(filename)
        for idx,row in df.iterrows():
            text=row['text']
            if(type(text)==float):
                print(text)
                continue
            tokens=list(row['text'])
            if(self.mode=='test'):
                tags=[]
            else:
                tags=row['BIO_anno'].split()

            yield {"tokens": tokens, "labels": tags}

    def get_labels(self):

        return ["B-BANK", "I-BANK", "B-PRODUCT", "I-PRODUCT",'B-COMMENTS_N','I-COMMENTS_N','B-COMMENTS_ADJ','I-COMMENTS_ADJ','O']
def load_dataset(path_or_read_func,
                 name=None,
                 data_files=None,
                 splits=None,
                 lazy=None,
                 **kwargs):
    reader_instance = ProductCommentDataset(lazy=lazy, name=name, **kwargs)
    datasets = reader_instance.read_datasets(data_files=data_files, splits=splits)
    return datasets
# Create dataset, tokenizer and dataloader.
train_ds, test_ds = load_dataset('ProductCommentDataset', splits=('train', 'test'), lazy=False)
INFO:paddle.utils.download:unique_endpoints {''}

def tokenize_and_align_labels(example, tokenizer, no_entity_id,
                              max_seq_len=512):
    labels = example['labels']
    example = example['tokens']
    tokenized_input = tokenizer(
        example,
        return_length=True,
        is_split_into_words=True,
        max_seq_len=max_seq_len)

    # -2 for [CLS] and [SEP]
    if len(tokenized_input['input_ids']) - 2 < len(labels):
        labels = labels[:len(tokenized_input['input_ids']) - 2]
    tokenized_input['labels'] = [no_entity_id] + labels + [no_entity_id]
    tokenized_input['labels'] += [no_entity_id] * (
        len(tokenized_input['input_ids']) - len(tokenized_input['labels']))
    return tokenized_input
max_seq_length=128
batch_size=64
label_list = train_ds.label_list
label_num = len(label_list)
no_entity_id = label_num - 1
# num_train_epochs=3

from paddlenlp.transformers import BertTokenizer,BertPretrainedModel,BertForTokenClassification
from paddlenlp.transformers import ErnieModel,ErnieForTokenClassification,ErnieTokenizer

# model_name_or_path='macbert-base-chinese'
model_name_or_path='bert-base-chinese'
tokenizer = BertTokenizer.from_pretrained(model_name_or_path)
# Define the model netword and its loss
# last_step = num_train_epochs * len(train_data_loader)
model = BertForTokenClassification.from_pretrained(model_name_or_path, num_classes=label_num)

# model_name_or_path='ernie-1.0'
# tokenizer = ErnieTokenizer.from_pretrained(model_name_or_path)
# model = ErnieForTokenClassification.from_pretrained(model_name_or_path, num_classes=label_num)
[2021-11-09 17:49:52,741] [    INFO] - Found /home/aistudio/.paddlenlp/models/bert-base-chinese/bert-base-chinese-vocab.txt
[2021-11-09 17:49:52,757] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/bert-base-chinese/bert-base-chinese.pdparams
W1109 17:49:52.761127  2731 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W1109 17:49:52.765357  2731 device_context.cc:465] device: 0, cuDNN Version: 7.6.
trans_func = partial(
        tokenize_and_align_labels,
        tokenizer=tokenizer,
        no_entity_id=no_entity_id,
        max_seq_len=max_seq_length)
        
train_ds = train_ds.map(trans_func)

ignore_label = -100

batchify_fn = lambda samples, fn=Dict({
        'input_ids': Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype='int32'),  # input
        'token_type_ids': Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype='int32'),  # segment
        'seq_len': Stack(dtype='int64'),  # seq_len
        'labels': Pad(axis=0, pad_val=no_entity_id, dtype='int64')  # label
    }): fn(samples)


train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=batch_size, shuffle=True, drop_last=True)

train_data_loader = DataLoader(
        dataset=train_ds,
        collate_fn=batchify_fn,
        num_workers=0,
        batch_sampler=train_batch_sampler,
        return_list=True)

test_ds = test_ds.map(trans_func)

test_data_loader = DataLoader(
        dataset=test_ds,
        collate_fn=batchify_fn,
        num_workers=0,
        batch_size=batch_size,
        return_list=True)

2.2 模型构建

class BertForTokenClassification(BertPretrainedModel):

    def __init__(self, bert, num_classes=2, dropout=None):
        super(BertForTokenClassification, self).__init__()
        self.num_classes = num_classes
        self.bert = bert  # allow bert to be config
        self.dropout = nn.Dropout(dropout if dropout is not None else
                                  self.bert.config["hidden_dropout_prob"])
        self.classifier = nn.Linear(self.bert.config["hidden_size"],
                                    num_classes)
        self.apply(self.init_weights)

    def forward(self,
                input_ids,
                token_type_ids=None,
                position_ids=None,
                attention_mask=None):
        sequence_output, _ = self.bert(
            input_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            attention_mask=attention_mask)

        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)
        return logits
num_train_epochs=3
warmup_steps=0

max_steps=-1
learning_rate=5e-5
adam_epsilon=1e-8
weight_decay=0.0
device='gpu'
paddle.set_device(device)


logging_steps=200


save_steps=100
output_dir='checkpoint'
os.makedirs(output_dir,exist_ok=True)


num_training_steps = max_steps if max_steps > 0 else len(
        train_data_loader) * num_train_epochs

last_step = num_train_epochs * len(train_data_loader)

lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps,
                                         warmup_steps)

# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]

optimizer = paddle.optimizer.AdamW(
        learning_rate=lr_scheduler,
        epsilon=adam_epsilon,
        parameters=model.parameters(),
        weight_decay=weight_decay,
        apply_decay_param_fun=lambda x: x in decay_params)

loss_fct = nn.loss.CrossEntropyLoss(ignore_index=ignore_label)

metric = ChunkEvaluator(label_list=label_list)

2.3 模型训练

def evaluate(model, loss_fct, metric, data_loader, label_num):
    model.eval()
    metric.reset()
    avg_loss, precision, recall, f1_score = 0, 0, 0, 0
    for batch in data_loader:
        input_ids, token_type_ids, length, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = loss_fct(logits, labels)
        avg_loss = paddle.mean(loss)
        preds = logits.argmax(axis=2)
        num_infer_chunks, num_label_chunks, num_correct_chunks = metric.compute(
            length, preds, labels)
        metric.update(num_infer_chunks.numpy(),
                      num_label_chunks.numpy(), num_correct_chunks.numpy())
        precision, recall, f1_score = metric.accumulate()
    print("eval loss: %f, precision: %f, recall: %f, f1: %f" %
          (avg_loss, precision, recall, f1_score))
    model.train()
def do_train(model,train_data_loader):
    global_step = 0
    tic_train = time.time()
    for epoch in range(num_train_epochs):
        for step, batch in enumerate(train_data_loader):
            global_step += 1
            input_ids, token_type_ids, _, labels = batch
            logits = model(input_ids, token_type_ids)
            loss = loss_fct(logits, labels)
            avg_loss = paddle.mean(loss)
            if global_step % logging_steps == 0:
                print("global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s"
                        % (global_step, epoch, step, avg_loss,
                        logging_steps / (time.time() - tic_train)))
                tic_train = time.time()
            avg_loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()
        if global_step % save_steps == 0 or global_step == last_step:
            model_path=os.path.join(output_dir,"model_ner_%d.pdparams" % global_step)
            paddle.save(model.state_dict(),model_path)

# do_train(model,train_data_loader)

2.4 模型预测


def parse_decodes(input_words, id2label, decodes, lens):
    decodes = [x for batch in decodes for x in batch]
    lens = [x for batch in lens for x in batch]

    outputs = []
    for idx, end in enumerate(lens):
        sent = "".join(input_words[idx]['tokens'])
        tags = [id2label[x] for x in decodes[idx][1:end-1]]
        outputs.append([sent,tags])
       
    return outputs
state_dict=paddle.load('checkpoint/model_ner_351.pdparams')
print('loading model')
model.load_dict(state_dict)

id2label = dict(enumerate(test_ds.label_list))
raw_data = test_ds.data

model.eval()
pred_list = []
len_list = []
for step, batch in enumerate(test_data_loader):
    input_ids, token_type_ids, length, labels = batch
    logits = model(input_ids, token_type_ids)
    pred = paddle.argmax(logits, axis=-1)
    pred_list.append(pred.numpy())
    len_list.append(length.numpy())
preds = parse_decodes(raw_data, id2label, pred_list, len_list)
loading model
print(preds[:10])
[['共享一个额度,没啥必要,四个卡不要年费吗?你这种人头,银行最喜欢,广发是出了名的风控严,套现就给你封...', ['O', 'O', 'O', 'O', 'B-COMMENTS_N', 'I-COMMENTS_N', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COMMENTS_N', 'I-COMMENTS_N', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COMMENTS_ADJ', 'I-COMMENTS_ADJ', 'I-COMMENTS_ADJ', 'O', 'B-BANK', 'I-BANK', 'O', 'O', 'O', 'O', 'O', 'B-COMMENTS_N', 'I-COMMENTS_N', 'B-COMMENTS_ADJ', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']], ['炸了,就2000.浦发没那么好心,草', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-BANK', 'I-BANK', 'O', 'O', 'O', 'B-COMMENTS_ADJ', 'I-COMMENTS_ADJ', 'O', 'O']], ['挂了电话自己打过去分期提额可以少分一点的', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PRODUCT', 'I-PRODUCT', 'B-COMMENTS_N', 'I-COMMENTS_N', 'O', 'O', 'O', 'O', 'O', 'O', 'O']], ['比如你首卡10k,二卡也10k,信报上显示邮政总共给你的授信额度是20k', ['O', 'O', 'O', 'B-PRODUCT', 'I-PRODUCT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-BANK', 'I-BANK', 'O', 'O', 'O', 'O', 'O', 'B-COMMENTS_N', 'I-COMMENTS_N', 'B-COMMENTS_N', 'I-COMMENTS_N', 'O', 'O', 'O', 'O']], ['3000吗,浦发总是这样', ['O', 'O', 'O', 'O', 'O', 'O', 'B-BANK', 'I-BANK', 'O', 'O', 'O', 'O']], ['直接抛答案,没有最低只有更低,没什么可比性。', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COMMENTS_ADJ', 'I-COMMENTS_ADJ', 'B-COMMENTS_ADJ', 'I-COMMENTS_ADJ', 'B-COMMENTS_ADJ', 'I-COMMENTS_ADJ', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']], ['那个网点激活的,能顺便办储蓄卡吗', ['O', 'O', 'O', 'O', 'B-COMMENTS_N', 'I-COMMENTS_N', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PRODUCT', 'I-PRODUCT', 'I-PRODUCT', 'O']], ['3000,那要养到啥是才有3万额度啊', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COMMENTS_N', 'I-COMMENTS_N', 'O']], ['30周年庆是随便开吗,额度也随便批么', ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-COMMENTS_N', 'I-COMMENTS_N', 'O', 'O', 'O', 'O', 'O']], ['哦,我搞错了?我的是中信的', ['O', 'O', 'O', 'O', 'I-COMMENTS_ADJ', 'O', 'O', 'O', 'O', 'O', 'B-BANK', 'I-BANK', 'O']]]
bio_label=[' '.join(item[1]) for item in preds]

三、文本分类部分

3.1 数据处理

from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer
from paddlenlp.transformers import LinearDecayWithWarmup
def read(data_path):
    df=pd.read_csv(data_path)
    for idx,row in df.iterrows():
        words=row['text']
        labels=row['class']
        yield {'text': words, 'label': labels}

# data_path为read()方法的参数
train_ds = load_dataset(read, data_path='data/train_data_public.csv',lazy=False)

print(train_ds[:4])
[{'text': '交行14年用过,半年准备提额,却直接被降到1K,半年期间只T过一次三千,其它全部真实消费,第六个月的时候为了增加评分提额,还特意分期两万,但降额后电话投诉,申请提...', 'label': 0}, {'text': '单标我有了,最近visa双标返现活动好', 'label': 1}, {'text': '建设银行提额很慢的……', 'label': 0}, {'text': '我的怎么显示0.25费率,而且不管分多少期都一样费率,可惜只有69k', 'label': 2}]
# 转换成id的函数
def convert_example(example, tokenizer):
    encoded_inputs = tokenizer(text=example["text"], max_seq_len=512, pad_to_max_seq_len=True)
    return tuple([np.array(x, dtype="int64") for x in [
            encoded_inputs["input_ids"], encoded_inputs["token_type_ids"], [example["label"]]]])
# 加载BERT的分词器
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
# 把训练集合转换成id
train_ds = train_ds.map(partial(convert_example, tokenizer=tokenizer))

# 构建训练集合的dataloader
train_batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=32, shuffle=True)
train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=train_batch_sampler, return_list=True)

[2021-11-09 17:50:07,143] [    INFO] - Found /home/aistudio/.paddlenlp/models/bert-base-chinese/bert-base-chinese-vocab.txt

3.2 模型构建

num_classes=3
model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_classes=num_classes)
[2021-11-09 17:50:07,172] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/bert-base-chinese/bert-base-chinese.pdparams
class FocalLoss(nn.Layer):
    def __init__(self, alpha=0.5, gamma=2, weight=None, ignore_index=255):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        # 参数可调
        self.weight = paddle.to_tensor(np.array([1.063, 4.468, 1.021]))
        self.ignore_index = ignore_index
        self.ce_fn = nn.CrossEntropyLoss(weight=self.weight, soft_label=False) 
 
    def forward(self, preds, labels):
        logpt = -self.ce_fn(preds, labels)
        pt = paddle.exp(logpt)
        loss = -((1 - pt) ** self.gamma) * self.alpha * logpt
        return loss

3.3 模型训练

num_train_epochs=3
num_training_steps = len(train_data_loader) * num_train_epochs

# 定义 learning_rate_scheduler,负责在训练过程中对 lr 进行调度
lr_scheduler = LinearDecayWithWarmup(5E-5, num_training_steps, 0.0)

# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]

# 定义 Optimizer
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=0.0,
    apply_decay_param_fun=lambda x: x in decay_params)

# 交叉熵损失和Focal 损失,两者可以切换
# criterion = paddle.nn.loss.CrossEntropyLoss()
criterion=FocalLoss()
# 评估的时候采用准确率指标
metric = paddle.metric.Accuracy()
# 接下来,开始正式训练模型,训练时间较长,可注释掉这部分
def do_train(model,train_data_loader):
    global_step = 0
    tic_train = time.time()

    for epoch in range(1, num_train_epochs + 1):
        for step, batch in enumerate(train_data_loader, start=1):

            input_ids, token_type_ids, labels = batch
            probs = model(input_ids=input_ids, token_type_ids=token_type_ids)

            probs=paddle.to_tensor(probs, dtype="float64")
            loss = criterion(probs, labels)
            correct = metric.compute(probs, labels)
            metric.update(correct)
            acc = metric.accumulate()

            global_step += 1
            
            # 每间隔 100 step 输出训练指标
            if global_step % 100 == 0:
                print(
                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                    % (global_step, epoch, step, loss, acc,
                        10 / (time.time() - tic_train)))
                tic_train = time.time()
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()

            if global_step % save_steps == 0 or global_step == last_step:
                model_path=os.path.join(output_dir,"model_classfication_%d.pdparams" % global_step)
                paddle.save(model.state_dict(),model_path)
# 正常训练
do_train(model,train_data_loader)
global step 100, epoch: 1, batch: 100, loss: 0.23843, accu: 0.55563, speed: 0.29 step/s
global step 200, epoch: 1, batch: 200, loss: 0.22878, accu: 0.56625, speed: 0.28 step/s
global step 300, epoch: 2, batch: 64, loss: 0.21256, accu: 0.56882, speed: 0.28 step/s
global step 400, epoch: 2, batch: 164, loss: 0.18757, accu: 0.56786, speed: 0.27 step/s
global step 500, epoch: 3, batch: 28, loss: 0.23577, accu: 0.56432, speed: 0.28 step/s
global step 600, epoch: 3, batch: 128, loss: 0.19547, accu: 0.56172, speed: 0.28 step/s
global step 700, epoch: 3, batch: 228, loss: 0.17413, accu: 0.56201, speed: 0.28 step/s
class FGM():
    """针对embedding层梯度上升干扰的对抗训练方法,Fast Gradient Method(FGM)"""

    def __init__(self, model):
        self.model = model
        self.backup = {}

    def attack(self, epsilon=1., emb_name='emb'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if not param.stop_gradient and emb_name in name:  # 检验参数是否可训练及范围
                self.backup[name] = param.numpy()  # 备份原有参数值
                grad_tensor = paddle.to_tensor(param.grad)  # param.grad是个numpy对象
                norm = paddle.norm(grad_tensor)  # norm化
                if norm != 0:
                    r_at = epsilon * grad_tensor / norm
                    param.add(r_at)  # 在原有embed值上添加向上梯度干扰

    def restore(self, emb_name='emb'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if not param.stop_gradient and emb_name in name:
                assert name in self.backup
                param.set_value(self.backup[name])  # 将原有embed参数还原
        self.backup = {}
# print(model.named_parameters())
# for k,v in model.named_parameters():
#     print(k)
# 接下来,开始正式训练模型,训练时间较长,可注释掉这部分
def do_adversarial_train(model,train_data_loader):

    fgm = FGM(model)
    global_step = 0
    tic_train = time.time()

    for epoch in range(1, num_train_epochs + 1):
        for step, batch in enumerate(train_data_loader, start=1):

            input_ids, token_type_ids, labels = batch
            probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
            loss = criterion(probs, labels)
            loss.backward()

            correct = metric.compute(probs, labels)
            metric.update(correct)
            acc = metric.accumulate()

            global_step += 1
            
            # 每间隔 100 step 输出训练指标
            if global_step % 100 == 0:
                print(
                    "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                    % (global_step, epoch, step, loss, acc,
                        10 / (time.time() - tic_train)))
                tic_train = time.time()
            

            # 对抗训练
            fgm.attack() # 在embedding上添加对抗扰动
            loss_adv = model(input_ids=input_ids, token_type_ids=token_type_ids)
            loss_adv.backward() # 反向传播,并在正常的grad基础上,累加对抗训练的梯度
            fgm.restore() # 恢复embedding参数

            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_grad()

            if global_step % save_steps == 0 or global_step == last_step:
                model_path=os.path.join(output_dir,"model_classfication_%d.pdparams" % global_step)
                paddle.save(model.state_dict(),model_path)
# 对抗训练
# do_adversarial_train(model,train_data_loader)

3.4 模型预测

def read_text(data_path):
    df=pd.read_csv(data_path)
    for idx,row in df.iterrows():
        words=row['text']
        labels=0
        yield {'text': words, 'label': labels}

test_ds = load_dataset(read_text, data_path='data/test_public.csv',lazy=False)
print(test_ds[:4])
test_ds = test_ds.map(partial(convert_example, tokenizer=tokenizer))
test_batch_sampler = paddle.io.BatchSampler(test_ds, batch_size=8, shuffle=False)

test_data_loader = paddle.io.DataLoader(
        dataset=test_ds,
        batch_sampler=test_batch_sampler,
        return_list=True) 
[{'text': '共享一个额度,没啥必要,四个卡不要年费吗?你这种人头,银行最喜欢,广发是出了名的风控严,套现就给你封...', 'label': 0}, {'text': '炸了,就2000.浦发没那么好心,草', 'label': 0}, {'text': '挂了电话自己打过去分期提额可以少分一点的', 'label': 0}, {'text': '比如你首卡10k,二卡也10k,信报上显示邮政总共给你的授信额度是20k', 'label': 0}]
@paddle.no_grad()
def predict(model,test_data_loader):
    model.eval()
    metric.reset()
    losses=[]
    result=[]
    for step, batch in enumerate(test_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        probs = model(input_ids=input_ids, token_type_ids=token_type_ids)
        # print(probs)
        out2 = paddle.argmax(probs, axis=1)
        result.extend(out2.numpy().tolist())
    return result

static_dict=paddle.load('checkpoint/model_classfication_700.pdparams')
model.load_dict(static_dict)
result=predict(model,test_data_loader)
result_data=[]
for idx,(bio,cls) in enumerate(zip(bio_label,result)):
    result_data.append([idx,bio,cls])

submit=pd.DataFrame(result_data,columns=['id','BIO_anno','class'])
submit.to_csv('submission_v1.csv',index=False)
submit.head(10)
idBIO_annoclass
00O O O O B-COMMENTS_N I-COMMENTS_N O O O O O O ...2
11O O O O O O O O O B-BANK I-BANK O O O B-COMMEN...1
22O O O O O O O O O B-PRODUCT I-PRODUCT B-COMMEN...2
33O O O B-PRODUCT I-PRODUCT O O O O O O O O O O ...2
44O O O O O O B-BANK I-BANK O O O O1
55O O O O O O O O B-COMMENTS_ADJ I-COMMENTS_ADJ ...1
66O O O O B-COMMENTS_N I-COMMENTS_N O O O O O O ...2
77O O O O O O O O O O O O O O O B-COMMENTS_N I-C...2
88O O O O O O O O O O O B-COMMENTS_N I-COMMENTS_...2
99O O O O I-COMMENTS_ADJ O O O O O B-BANK I-BANK O1

四、模型优化思路

1.数据增强: 中文数据增强工具、回译等

2.尝试不同的预训练模型、调参优化等。

3.5fodls交叉验证、多模型结果融合等

4.能力较强的可以尝试下在数据上重新预训练或者魔改网络:尝试在命名实体识别任务中,在bert后添加lstm,crf;在情感分析任务中,尝试使用各种各样的预训练语言模型。

关于paddlenlp:在具体使用时建议多看相关文档 PaddleNLP文档

paddlenlp的github地址:https://github.com/PaddlePaddle/PaddleNLP 有问题的话可以在github上提issue。

五、更多PaddleEdu信息内容

1. PaddleEdu一站式深度学习在线百科awesome-DeepLearning中还有其他的能力,大家可以敬请期待:

  • 深度学习入门课
  • 深度学习百问
  • 特色课
  • 产业实践

PaddleEdu使用过程中有任何问题欢迎在awesome-DeepLearning提issue,同时更多深度学习资料请参阅飞桨深度学习平台

记得点个Star⭐收藏噢~~

2. 飞桨PaddleEdu技术交流群(QQ)

目前QQ群已有2000+同学一起学习,欢迎扫码加入

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐