Python代写-PM2022|学霸联盟

Python代写-PM2022

时间：2022-06-22

6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 1 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
datasci-w266 /2022-summer-main Public
2022-summer-main / assignment / a3 /
Multiclass_text_classification.ipynb
Mark H Butler Release a3 Latest commit 0afc15b 10 days ago History
0 contributors
Code Issues Pull requests Actions Projects Wiki Security
master Go to file
4050 lines (4050 sloc) 139 KB
Assignment 3: Fine tuning a multiclass
classification BERT model
Description: This assignment covers fine-tuning of a multiclass classification.
You will compare two different types of solutions using BERT-based models.
You should also be able to develop an intuition for:
Working with BERT
The effects of using different model checkpoints and fine-tuning some
hyperparameters
Different metrics to measure the effectiveness of your model
The effect of partially cleaning/normalizing your training data
The assignment notebook closely follows the lesson notebooks. We will use the
20 newsgroups dataset and will leverage some of the models, or part of the
code, for our current investigation.
You are strongly encouraged to read through the entire notebook before
answering any questions or writing any code.
The initial part of the notebook is purely setup. We will then generate our BERT
model and see if and how we can improve it.
Raw Blame
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 2 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
model and see if and how we can improve it.
Do not try to run this entire notebook on your GCP instance as the training of
models requires a GPU to work in a timely fashion. This notebook should be run
on a Google Colab leveraging a GPU. By default, when you open the notebook
in Colab it will try to use a GPU. Total runtime of the entire notebook (with
solutions and a Colab GPU) should be about 1h.
Open in Colab
The overall assignment structure is as follows:
1. Setup
1.1 Libraries & Helper Functions
1.2 Data Acquisition
1.3 Training/Test/Validation Sets for BERT-based models
2. Classification with a fine tuned BERT model
2.1 Create the specified BERT model
2.2 Fine tune the BERT model as directed
2.3 Examine the predictions with various metrics
3. Classification with some preprocessed data and the BERT model
3.1 Clean up the data a bit
3.2 Regenerate the data with the appropriate tokenizer
3.3 Regenerate the BERT model
3.4. Rerun the data and examine the predictions
4. Try again with a different mini batch size to see if that improves
performance
INSTRUCTIONS::
Questions are always indicated as QUESTION:, so you can search for this
string to make sure you answered all of the questions. You are expected to
fill out, run, and submit this notebook, as well as to answer the questions in
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 3 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
fill out, run, and submit this notebook, as well as to answer the questions in
the answers file as you did in a1 and a2.
### YOUR CODE HERE indicates that you are supposed to write code.
If you want to, you can run all of the cells in section 1 in bulk. This is setup
work and no questions are in there. At the end of section 1 we will state all
of the relevant variables that were defined and created in section 1.
1. Setup
Lets get all our libraries and download and process our data.
In [1]: !pip install -q transformers
In [2]: !pip install pydot --quiet
In [3]: from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
In [4]: from collections import Counter
import numpy as np
import tensorflow as tf
from tensorflow import keras
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import pprint
In [5]: from transformers import BertTokenizer, TFBertModel
In [6]: # 4-window plot. Small modification from matplotlib examples.
def make_plot(axs, history1,
history2,
y_lim_loss_lower=0.4,
y_lim_loss_upper=1.6,
y_lim_accuracy_lower=0.4,
y_lim_accuracy_upper=0.9,
model_1_name='model 1',
model_2_name='model 2',

6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 4 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
Take a look at the records. We basically have a long string of text and an
associated label. That label is the Usenet group where the posting occured. The
records are the raw text. They vary significantly in size.
Notice the "labels" are just integers that are an offset into the list of target
names.
):
box = dict(facecolor='yellow', pad=5, alpha=0.2)
ax1 = axs[0, 0]
ax1.plot(history1.history['loss'])
ax1.plot(history1.history['val_loss'])
ax1.set_title('loss - ' + model_1_name)
ax1.set_ylabel('loss', bbox=box)
ax1.set_ylim(y_lim_loss_lower, y_lim_loss_upper)
ax3 = axs[1, 0]
ax3.set_title('accuracy - ' + model_1_name)
ax3.plot(history1.history['accuracy'])
ax3.plot(history1.history['val_accuracy'])
ax3.set_ylabel('accuracy', bbox=box)
ax3.set_ylim(y_lim_accuracy_lower, y_lim_accuracy_upper)
ax2 = axs[0, 1]
ax2.set_title('loss - ' + model_2_name)
ax2.plot(history2.history['loss'])
ax2.plot(history2.history['val_loss'])
ax2.set_ylim(y_lim_loss_lower, y_lim_loss_upper)
ax4 = axs[1, 1]
ax4.set_title('accuracy - ' + model_2_name)
ax4.plot(history2.history['accuracy'])
ax4.plot(history2.history['val_accuracy'])
ax4.set_ylim(y_lim_accuracy_lower, y_lim_accuracy_upper)
In [7]: def read_20newsgroups(test_size=0.1):
# download & load 20newsgroups dataset from sklearn's repos
dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers"
documents = dataset.data
labels = dataset.target
# split into training & testing a return data as well as label names
return train_test_split(documents, labels, test_size=test_size), dataset

# call the function
(train_texts, test_texts, train_labels, test_labels), target_names =
In [8]: train_texts[:2]
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 5 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
names.
The variable ''target_names'' stores all of the names of the labels.
We already have a test set and a train set. Let's explicitly set aside part of our
training set for validation purposes.
The validation set will always have 961 records.
The training set will always have 16000 records.
train_texts - an array of text strings for training
test_texts - an array of text strings for testing
valid texts - an array of text strings for validation
train_labels - an array of integers representing the labels associated with
train_texts
test_labels - an array of integers representing the labels associated with
test_texts
valid_labels - an array of integers representing the labels associated with
valid_texts
target_names - an array of label strings that correspond to the integers in
the *_labels arrays
In [9]: train_labels[:2]
In [10]: print(target_names)
In [11]: len(train_texts)
valid_texts = train_texts[16000:]
valid_labels = train_labels[16000:]
train_texts = train_texts[:16000]
train_labels = train_labels[:16000]
In [12]: len(valid_texts)
In [13]: len(train_texts)
In [14]: #get the labels in a needed data format for validation
npvalid_labels = np.asarray(valid_labels)
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 6 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
the *_labels arrays
2. Classification with a fine tuned BERT model
Let's pick our BERT model. We'll start with the base BERT model and we'll use
the cased version since our data has capital and lower case letters.
We're setting our maximum training record length to 200. BERT models can
handle more and after you've completed the assignment you're welcome to try
larger and small sized records.
Now we'll tokenize our three data slices. This will take a minute or two.
Notice our input_ids for the first training record and their padding. The
train_encodings also includes an array of token_type_ids and an
attention_mask array.
Write a function to create this multiclass bert model.
Keep in mind the following:
Each record can have one of n labels where n = the size of target_names.
We'll still want a hidden size layer of size 100
We'll also want dropout
Our classification layer will need to be appropriately sized and use the
In [15]: #make it easier to use a variety of BERT subword models
model_checkpoint = 'bert-base-cased'
In [16]: bert_tokenizer = BertTokenizer.from_pretrained(model_checkpoint)
bert_model = TFBertModel.from_pretrained(model_checkpoint)
In [17]: max_length = 200
In [18]: # tokenize the dataset, truncate when passed `max_length`,
# and pad with 0's when less than `max_length` and return a tf Tensor
train_encodings = bert_tokenizer(train_texts, truncation=True, padding
valid_encodings = bert_tokenizer(valid_texts, truncation=True, padding
test_encodings = bert_tokenizer(test_texts, truncation=True, padding
In [19]: train_encodings.input_ids[:1]
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 7 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
correct non-linearity for a multi-class problem.
Since we have multiple labels we can no longer use binary cross entropy.
Instead we need to change our loss metric to a categorical cross entropy.
Which of the two categorical cross entropy metrics will work best here?
QUESTION: 2.1 How many trainable parameters are in your dense hidden layer?
QUESTION: 2.2 How many trainable parameters are in your classification layer?
In [20]: def create_bert_multiclass_model(train_layers=-1,
hidden_size = 100,
dropout=0.3,
learning_rate=0.00005):
"""
Build a simple classification model with BERT. Use the Pooled Output for classification purposes.
"""
### YOUR CODE HERE
#restrict training to the train_layers outer transformer layers

### END YOUR CODE
return classification_model
In [21]: pooled_bert_model = create_bert_multiclass_model()
In [22]: pooled_bert_model.summary()
In [23]: keras.utils.plot_model(pooled_bert_model, show_shapes=True, dpi=90)
In [24]: #It takes 10 to 14 minutes to complete an epoch when using a GPU
pooled_bert_model_history = pooled_bert_model.fit([train_encodings.input_ids
train_labels,
validation_data=([
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 8 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
Now we need to run evaluate against our fine-tuned model. This will give us an
overall accuracy based on the test set.
QUESTION: 2.3 What is the Test accuracy score you get from your model with
a batch size of 8? (Just copy and paste the value into the answers sheet and
round to five significant digits.)
There are two ways to see what's going on with our classifier. Overall accuracy
is interesting but it can be misleading. We need to make sure that each of our
categories' prediction performance is operating at an equal or higher level than
the overall.
Here we'll use the classification report from scikit learn. It expects two inputs as
arrays. One is the ground truth (y_true) and the other is the associated
prediction (y_pred). This is based on gethering all the predictions from our our
test set.
validation_data=([
npvalid_labels),
batch_size=16,
epochs=1)
In [ ]: #batch 8, ML=200
score = pooled_bert_model.evaluate([test_encodings.input_ids, test_encodings
test_labels)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
In [26]: #run predict for the first three elements in the test data set
predictions = pooled_bert_model.predict([test_encodings.input_ids[:3
In [27]: predictions
In [28]: #run and capture all predictions from our test set using model.predict
### YOUR CODE HERE
### END YOUR CODE
#now we need to get the highest probability in the distribution for each prediction
#and store that in a tf.Tensor
predictions = tf.argmax(predictions, axis=-1)
predictions
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 9 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
QUESTION: 2.4 What is the macro average f1 score you get from the
classification report for batch size 8?
Now we'll generate another very valuable visualization of what's happening with
our classifier -- a confusion matrix.
And now we'll display it!
3. Classification with some preprocessed data and the
BERT model
Okay, not bad. As you saw there are a lot of odd characters in our input so
maybe cleaning some of those out and forcing everything to lower case while
running a bert-base-uncased model will give us some imporvement in our
prediciotns. Let's give that a shot. First let's clean out our text a bit. Remember,
it is critical that we preform identical preprocessing on our training, text, and
validation sets.
In [29]: print(classification_report(test_labels, predictions.numpy(), target_names
In [30]: cm = tf.math.confusion_matrix(test_labels, predictions)
cm = cm/cm.numpy().sum(axis=1)[:, tf.newaxis]
In [31]: plt.figure(figsize=(20,7))
sns.heatmap(
cm, annot=True,
xticklabels=target_names,
yticklabels=target_names)
plt.xlabel("Predicted")
plt.ylabel("True")
In [ ]:
In [32]: def preprocess(sentence):
sentence=str(sentence)
sentence = sentence.lower()
sentence = sentence.replace('\n', ' ')
#what other characters or strings might you replace to clean up this data
#we don't expect a full set. Please enter six of them here.
### YOUR CODE HERE

### END YOUR CODE
return sentence
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 10 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
Call the function to recreate our BERT model only this time it will use the
model_checkpoint of bert-base-uncased.
This will only display a plot if we've run for more than one epoch. We're not
asking you to run more than one in this assignment but when you're done you
might try running another just to see how much more the model learns.
return sentence
cleantrain_texts = list(map(preprocess, train_texts))
#you need to make sure you apply the same preprocessing to the test and validation sets
### YOUR CODE HERE
### END YOUR CODE
In [33]: cleantrain_texts[:2]
In [34]: cleantest_texts[:2]
In [35]: model_checkpoint = 'bert-base-uncased'
bert_uctokenizer = BertTokenizer.from_pretrained(model_checkpoint)
bert_model = TFBertModel.from_pretrained(model_checkpoint)
In [36]: # tokenize the dataset, truncate when passed `max_length`,
# and pad with 0's when less than `max_length`
cleantrain_encodings = bert_uctokenizer(cleantrain_texts, truncation
cleanvalid_encodings = bert_uctokenizer(cleanvalid_texts, truncation
cleantest_encodings = bert_uctokenizer(cleantest_texts, truncation=True
In [37]: cleanvalid_encodings.input_ids[:2]
In [38]: clean_pooled_bert_model = create_bert_multiclass_model()
In [39]: clean_pooled_bert_model_history = clean_pooled_bert_model.fit([cleantrain_encodings
train_labels,
validation_data=([
npvalid_labels
batch_size=8,
epochs=1)
In [ ]: fig, axs = plt.subplots(2, 2)
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 11 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
QUESTION:
3.1 What is the test accuracy you get when you run the cleaned model with
batch size 8?
fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(left=0.2, wspace=0.6)
make_plot(axs,
pooled_bert_model_history,
clean_pooled_bert_model_history,
model_1_name='raw',
model_2_name='clean',
y_lim_accuracy_lower=0.42,
y_lim_accuracy_upper=0.82)
fig.align_ylabels(axs[:, 1])
fig.set_size_inches(18.5, 10.5)
plt.show()
In [ ]:
In [40]: #Evaluate the fine tuned clean model against the cleaned test data
### YOUR CODE HERE
### END YOUR CODE
print('Test loss:', score[0])
print('Test accuracy:', score[1])
In [41]: #run and capture all the predictions from the clean test data
### YOUR CODE HERE
### END YOUR CODE
predictions
In [42]: #Generate a confusion matrix using your new clean test predictions
# ccm = ...
### YOUR CODE HERE
### END YOUR CODE
In [43]: #display that new confusion matrix
plt.figure(figsize=(20,7))
sns.heatmap(
ccm, annot=True,
xticklabels=target_names,
yticklabels=target_names)
plt.xlabel("Predicted")
plt.ylabel("True")
6/21/22, 2:47 PM2022-summer-main/Multiclass_text_classification.ipynb at master · datasci-w266/2022-summer-main
Page 12 of 12https://github.com/datasci-w266/2022-summer-main/blob/master/assignment/a3/Multiclass_text_classification.ipynb
QUESTION:
3.2 What is the weighted avg F1 score in the classification when you run the
cleaned model with batch size of 8?
4. Try again with a different mini batch size to see if that
In [44]: # Run the sklearn classification_report again with the new predictions
### YOUR CODE HERE
### END YOUR CODE