程序代写案例-PM2022|学霸联盟

程序代写案例-PM2022

时间：2022-06-06

6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 1 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
datasci-w266 /2022-summer-assignment-marciayyl Private
2022-summer-assignment-marciayyl /
assignment / a2 / Text_classification.ipynb
Mark H Butler Release a2 Latest commit 3a1408e 15 days ago History
0 contributors
Code Issues Pull requests Actions Projects Security Insights
a2-submit Go to file
6443 lines (6443 sloc) 217 KB
Assignment 2: Text Classification with
Various Neural Networks
Description: This assignment covers various neural network architectures
and components, largely used in the context of classification. You will
compare Deep Averaging Networks, Deep Weighted Averaging Networks
using Attention, and BERT-based models. You should also be able to develop
an intuition for:
The effects of fine-tuning word vectors or starting with random word
vectors
How various networks behave when the training set size changes
The effect of shuffling your training data
The benefits of Attention calculations
Working with BERT
The assignment notebook closely follows the lesson notebooks. We will use
the IMDB dataset and will leverage some of the models, or part of the code,
for our current investigation.
The initial part of the notebook is purely setup. We will then evaluate how
Raw Blame
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 2 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
The initial part of the notebook is purely setup. We will then evaluate how
Attention can make Deep Averaging networks better.
Do not try to run this entire notebook on your GCP instance as the training of
models requires a GPU to work in a timely fashion. This notebook should be
run on a Google Colab leveraging a GPU. By default, when you open the
notebook in Colab it will try to use a GPU. Total runtime of the entire
notebook (with solutions and a Colab GPU) should be about 1h.
Open in Colab
The overall assignment structure is as follows:
1. Setup
1.1 Libraries, Embeddings, & Helper Functions
1.2 Data Acquisition
1.3. Data Preparation
1.3.1 Training/Test Sets using Word2Vec
1.3.2 Training/Test Sets for BERT-based models
2. Classification with various Word2Vec-based Models
2.1 The Role of Shuffling of the Training Set
2.2 DAN vs Weighted Averaging Models using Attention
2.2.1 Warm-Up
2.2.2 The WAN Model
2.3 Approaches for Training of Embeddings
3. Classification with BERT
3.1. BERT Basics
3.2 CLS-Token-based Classification
3.3 Averaging of BERT Outputs
3.4. Adding a CNN on top of BERT
INSTRUCTIONS::
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 3 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
INSTRUCTIONS::
Questions are always indicated as QUESTION, so you can search for this
string to make sure you answered all of the questions. You are expected
to fill out, run, and submit this notebook, as well as to answer the
questions in the answers file as you did in a1.
### YOUR CODE HERE indicates that you are supposed to write code.
If you want to, you can run all of the cells in section 1 in bulk. This is
setup work and no questions are in there. At the end of section 1 we will
state all of the relevant variables that were defined and created in
section 1.
1. Setup
1.1. Libraries and Helper Functions
This notebook requires the TensorFlow dataset and other prerequisites that
you must download.
Now we are ready to do the imports.
In [1]: #@title Imports
!pip install pydot --quiet
!pip install gensim==3.8.3 --quiet
!pip install tensorflow-datasets --quiet
!pip install -U tensorflow-text==2.8.2 --quiet
!pip install transformers --quiet
!pip install pydot --quiet
In [2]: #@title Imports
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Embedding, Input, Dense, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import tensorflow_datasets as tfds
import tensorflow_text as tf_text
from transformers import BertTokenizer, TFBertModel
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 4 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Below is a helper function to plot histories.
import sklearn as sk
import os
import nltk
from nltk.corpus import reuters
from nltk.data import find
import matplotlib.pyplot as plt
import re
#This continues to work with gensim 3.8.3. It doesn't yet work with 4.x.
#Make sure your pip install command specifies gensim==3.8.3
import gensim
import numpy as np
In [3]: #@title Plotting Function
# 4-window plot. Small modification from matplotlib examples.
def make_plot(axs, history1,
history2,
y_lim_loss_lower=0.4,
y_lim_loss_upper=0.6,
y_lim_accuracy_lower=0.7,
y_lim_accuracy_upper=0.8,
model_1_name='model 1',
model_2_name='model 2',

):
box = dict(facecolor='yellow', pad=5, alpha=0.2)
ax1 = axs[0, 0]
ax1.plot(history1.history['loss'])
ax1.plot(history1.history['val_loss'])
ax1.set_title('loss - ' + model_1_name)
ax1.set_ylabel('loss', bbox=box)
ax1.set_ylim(y_lim_loss_lower, y_lim_loss_upper)
ax3 = axs[1, 0]
ax3.set_title('accuracy - ' + model_1_name)
ax3.plot(history1.history['accuracy'])
ax3.plot(history1.history['val_accuracy'])
ax3.set_ylabel('accuracy', bbox=box)
ax3.set_ylim(y_lim_accuracy_lower, y_lim_accuracy_upper)
ax2 = axs[0, 1]
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 5 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Next, we get the word2vec model from nltk.
Now here we have the embedding model defined, let's see how many words
are in the vocabulary:
How do the word vectors look like? As expected:
We can now build the embedding matrix and a vocabulary dictionary:
ax2.set_title('loss - ' + model_2_name)
ax2.plot(history2.history['loss'])
ax2.plot(history2.history['val_loss'])
ax2.set_ylim(y_lim_loss_lower, y_lim_loss_upper)
ax4 = axs[1, 1]
ax4.set_title('accuracy - ' + model_2_name)
# small adjustment to account for the 2 accuracy measures in the Weighted Averging Model with Attention
if 'classification_accuracy' in history2.history.keys():
ax4.plot(history2.history['classification_accuracy'])
else:
ax4.plot(history2.history['accuracy'])

if 'val_classification_accuracy' in history2.history.keys():
ax4.plot(history2.history['val_classification_accuracy'])
else:
ax4.plot(history2.history['val_accuracy'])
ax4.set_ylim(y_lim_accuracy_lower, y_lim_accuracy_upper)
In [4]: #@title NLTK & Word2Vec
nltk.download('word2vec_sample')
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample
In [5]: len(model.vocab)
In [6]: model['great'][:20]
In [7]: #@title Embedding Matrix Creation
EMBEDDING_DIM = len(model['university']) # we know... it's 300
# initialize embedding matrix and word-to-id map:
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 6 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
The last row consists of all zeros. We will use that for the UNK token, the
placeholder token for unknown words.
1.2 Data Acquisition
We will use the IMDB dataset delivered as part of the TensorFlow-datasets
library, and split into training and test sets. For expedience, we will limit
ourselves in terms of train and test examples.
It is always highly recommended to look at the data.
For convenience, in this assignment we will define a maximum length and
only keep the examples that are longer than that length
For simplicity, we will also limit ourselves to examples that actually have at
# initialize embedding matrix and word-to-id map:
embedding_matrix = np.zeros((len(model.vocab.keys()) + 1, EMBEDDING_DIM
vocab_dict = {}
# build the embedding matrix and the word-to-id map:
for i, word in enumerate(model.vocab.keys()):
embedding_vector = model[word]
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
vocab_dict[word] = i
In [8]: embedding_matrix
In [9]: train_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:80%]', 'test[80%:]'),
as_supervised=True)
train_examples_batch, train_labels_batch = next(iter(train_data.batch
test_examples_batch, test_labels_batch = next(iter(test_data.batch
In [10]: train_examples_batch[2:4]
In [11]: train_labels_batch[2:4]
In [12]: MAX_SEQUENCE_LENGTH = 100
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 7 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
For simplicity, we will also limit ourselves to examples that actually have at
least MAX_SEQUENCE_LENGTH tokens.
1.3. Data Preparation
1.3.1. Training/Test Sets for Word2Vec-based Models
First, we tokenize the data:
Does this look right?
Yup... looks right. Of course we will need to take care of the encoding later.
Next, we define a simple function that converts the tokens above into the
appropriate word2vec index values.
In [13]: tokenizer = tf_text.WhitespaceTokenizer()
train_tokens = tokenizer.tokenize(train_examples_batch)
test_tokens = tokenizer.tokenize(test_examples_batch)
In [14]: train_tokens[0]
In [15]: #@title Definition of sents_to_ids function
def sents_to_ids(token_list_list, label_list, num_examples=100000000
"""
converting a list of strings to a list of lists of word ids
"""
text_ids = []
text_labels = []
valid_example_list = []
example_count = 0
use_token_list_list = token_list_list[:num_examples]
for i, token_list in enumerate(use_token_list_list):
if i < num_examples:
try:
example = []
for token in list(token_list.numpy()):
decoded = token.decode('utf-8').replace('.',''
try:
example.append(vocab_dict[decoded])

except:
example.append(43981)
if len(example) >= MAX_SEQUENCE_LENGTH:
text_ids.append(example[:MAX_SEQUENCE_LENGTH])
text_labels.append(label_list[i])
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 8 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Now we can create training and test data that can be fed into the models of
interest.
The variable 'train_valid_example_list' contains the list of chosen examples
that we can use later for the construction of the BERT training and test sets.
Examples 3 and 4 were apparently shorten than our target length.
We will also create a reduced training dataset with only 1000 examples to
study the effect of the dataset size.
Let's convince ourselves that the data looks correct:
1.3.2. Training/Test Sets for BERT-based models
We already imported the BERT model and the Tokenizer libraries. Now, we
create the tokenizer:
text_labels.append(label_list[i])
if example_count % 5000 == 0:
print('Examples processed: ', example_count
valid_example_list.append(i)
example_count += 1
else:
pass
except:
pass

print('Number of examples retained: ', example_count)
return (np.array(text_ids), np.array(text_labels), valid_example_list
In [16]: train_input_ids, train_input_labels, train_valid_example_list = sents_to_ids
test_input_ids, test_input_labels, test_valid_example_list = sents_to_ids
In [17]: train_valid_example_list[:5]
In [18]: REDUCED_TRAINING_SIZE = 1000
train_input_ids_reduced = train_input_ids[:REDUCED_TRAINING_SIZE]
train_input_labels_reduced = train_input_labels[:REDUCED_TRAINING_SIZE
In [19]: train_input_ids[:2]
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 9 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
create the tokenizer:
Since the Tokenizer of BERT is not a whitespace tokenizer, each sentence will
almost certainly result in more BERT tokens than whitespace tokens. Since
we don't want to cheat by showing BERT more examples than other models
we should restrict ourselves to the data that will also be seen by the other
models:
Next, we will create our training and test sets for BERT models.
In [20]: bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
In [21]: #@title Limit BERT data to the set used with word2vec
all_train_examples = [x.decode('utf-8') for x in train_examples_batch
all_test_examples = [x.decode('utf-8') for x in test_examples_batch
bert_valid_train_examples_text = []
bert_valid_train_examples_labels = []
bert_valid_test_examples_text = []
bert_valid_test_examples_labels = []
for valid_example in train_valid_example_list:
bert_valid_train_examples_text.append(all_train_examples[valid_example
bert_valid_train_examples_labels.append(train_labels_batch[valid_example
for valid_example in test_valid_example_list:
bert_valid_test_examples_text.append(all_test_examples[valid_example
bert_valid_test_examples_labels.append(test_labels_batch[valid_example
In [22]: #@title BERT Tokenization of training and test data
num_train_examples = 2500000
num_test_examples = 500000
max_length = MAX_SEQUENCE_LENGTH
x_train = bert_tokenizer(bert_valid_train_examples_text[:num_train_examples
max_length=max_length,
truncation=True,
padding='max_length',
return_tensors='tf')
y_train = bert_valid_train_examples_labels[:num_train_examples]
x_test = bert_tokenizer(bert_valid_test_examples_text[:num_test_examples
max_length=max_length,
truncation=True,
padding='max_length',
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 10 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Next, we will simplify our lives for the purpose of the bulk of the assignment.
We know that 1) all inputs have at least MAX_SEQUENCE_LENGTH tokens,
and 2) the input has one section not 2. Therefore, BERT will produce
consistent results if we only use the 'input_ids'.
Let us create the corresponding data sets:
How many training examples do we have?
Great. Looks like the same size training set that we used for the word2vec-
based models.
We also want to create again a reduced set of size
REDUCED_TRAINING_SIZE:
Overall, here are the key variables and sets that we created, and that may be
used moving forward. If the variable naming does not make it obvious, we
also state the purpose:
Parameters:
padding='max_length',
return_tensors='tf')
y_test = bert_valid_test_examples_labels[:num_test_examples]
def select_min_length_examples(x_data, y_data):
x_input_ids = []
y_labels = []
for ((input_ids, masks), label) in zip(zip(x_data['input_ids'],
if masks[-1] == 1:
x_input_ids.append(input_ids)
y_labels.append(label)
return np.array(x_input_ids), np.array(y_labels)
In [23]: bert_train_input_ids, bert_train_labels = select_min_length_examples
bert_test_input_ids, bert_test_labels = select_min_length_examples
In [24]: bert_train_input_ids.shape
In [25]: bert_train_input_ids_reduced = bert_train_input_ids[:REDUCED_TRAINING_SIZE
bert_train_labels_reduced = bert_train_labels[:REDUCED_TRAINING_SIZE
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 11 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Parameters:
MAX_SEQUENCE_LENGTH (100)
REDUCED_TRAINING_SIZE (1000)
Word2vec-based models:
train(/test)_input_ids: input ids for the training(/test) sets for word2vec
models
train(/test)_input_labels: the corresponding labels
train(/test)_input_ids_reduced: input ids for the reduced training(/test)
sets for word2vec models
train(/test)_input_labels_reduced: the corresponding labels for the
reduced set
BERT:
bert_train(/test)_input_ids: input ids for the training(/test) sets for BERT
models
bert_train(/test)_labels: the corresponding labels for BERT
bert_train(/test)_input_ids_reduced : input ids for the reduced
training(/test) sets for BERT models
bert_train(/test)_labels: the corresponding labels for the reduced set for
BERT
NOTE: We recommend to inspect these variables if you have not gone
through the code.
2. Classification with various Word2Vec-based
Models
QUESTION:
2.a. Revisit the dataset. Is it balanced? Find the ratio of positive examples for
the training sets.
2.b. Find the ratio of positive examples for both the test set.
In [26]: ### YOUR CODE HERE
### END YOUR CODE
In [27]: ### YOUR CODE HERE
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 12 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
2.1 The Role of Shuffling of the Training Set
We will first revisit the DAN model.
1. Reuse the code from the class notebook to build a DAN network with one
hidden layer of dimension 100. The optimizer should be Adam. Wrap the
model creation in a function according to this API:
Let us create a sorted dataset to run our simulations:
In [27]: ### YOUR CODE HERE
### END YOUR CODE
In [28]: def create_dan_model(retrain_embeddings=False,
max_sequence_length=MAX_SEQUENCE_LENGTH,
hidden_dim=100,
dropout=0.3,
embedding_initializer='word2vec',
learning_rate=0.001):
"""
Construct the DAN model including the compilation and return it. Parametrize it using the arguments.
:param retrain_embeddings: boolean, indicating whether the word embeddings are trainable
:param hidden_dim: dimension of the hidden layer
:param dropout: dropout applied to the hidden layer
:returns: the compiled model
"""
if embedding_initializer == 'word2vec':
embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix
else:
embeddings_initializer='uniform'

### YOUR CODE HERE
# start by creating the dan_embedding_layer. Use the embeddings_initializer. variable defined above.
### END YOUR CODE
return dan_model
In [29]: sorted_train_input_data = [(x, y) for (x, y) in zip(list(train_input_ids
sorted_train_input_data.sort(key = lambda x: x[1])
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 13 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Next, try to create your DAN model using the default parameters and train it
by:
1. Using the sorted dataset
2. Using 'shuffle=False' as one of the model.fit parameters.
Make sure you store the history (name it 'dan_sorted_history') as we did in
the lesson notebooks.
QUESTION:
2.1.a Which number (in percent) is closest to the highest validation accuracy
that you observed?
35
50
65
Next, recreate the same model and train with 'shuffle=True'. (Note that this
is also the default.). Use 'dan_suffled_history' for the history.
QUESTION:
2.1.b Which number (in percent) is closest to the highest validation accuracy
that you observed for the shuffled run?
sorted_train_input_data.sort(key = lambda x: x[1])
sorted_training_input_ids = np.array([x[0] for x in sorted_train_input_data
sorted_training_labels = np.array([x[1] for x in sorted_train_input_data
In [30]: ### YOUR CODE HERE
dan_model_sorted = create_dan_model()
#use dan_sorted_history = ... below
### END YOUR CODE
In [31]: ### YOUR CODE HERE
dan_model_shuffled = create_dan_model()
#use dan_suffled_history = ... below
### END YOUR CODE
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 14 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
65
75
80
85
Compare the 2 histories in a plot.
2.2 DAN vs Weighted Averaging Models using Attention
2.2.1. Warm-Up: Manual Attention Calculation
QUESTION:
2.2.1.a Calculate the context vector for the following query and key/value
vectors. You can do this manually, or you can use
tf.keras.layers.Attention()
2.2.1.b What are the weights for the key/value vectors?
2.2.2 The 'WAN' Model
In [32]: fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(left=0.2, wspace=0.6)
make_plot(axs,
dan_sorted_history,
dan_suffled_history,
model_1_name='sorted',
model_2_name='shuffled',
y_lim_accuracy_lower=0.40,
y_lim_accuracy_upper=0.82)
fig.align_ylabels(axs[:, 1])
fig.set_size_inches(18.5, 10.5)
plt.show()
In [33]: q = [1, 2., 1]
k1 = v1 = [-1, -1, 3.]
k2 = v2 = [1, 2, -5.]
In [34]: ### YOUR CODE HERE
### END YOUR CODE
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 15 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
2.2.2 The 'WAN' Model
Next, we would like to improve our DAN by attempting to train a neural net
that learns to put more weight on some words than others. How could we do
that? Attention is the answer!
Here, we will build a model that you can call "Weighted Averaging Models
using Attention". You should construct a network that uses attention to weigh
the input tokens for a given example.
The core structure is the same as for the DAN network, but there are
obviously some critical changes:
1) How do I create a learnable query vector for the attention calculation, that
is supposed to generate the suitable token probabilities? And what is its
size?
2) What are the key vectors for the attention calculation?
3) How does the averaging change?
First, the key vectors should be the incoming word vectors.
The query vector needs to have the size of the word vectors, as it needs to
attend to them. A good way to create the query vector is to generate an
embedding for it that you then make trainable:
wan_query_embedding_layer = Embedding(1,

embedding_matrix.shape[1],

input_length=1,

trainable=True)
would create an embedding of the proper size for 1 vector.
That sounds great... but how do I use this embedding to have a vector
available in my calculation? And... make this vector available to all examples
in the batch?
What you can use is a 'fake input-like layer' that creates for each incoming
batch example a '0', that then the query embedding layer can get applied to.
Assuming that the input layer for your network is wan_input_layer, this
could be done with
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 16 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
wan_zero_input = tf.cast((wan_input_layer[:, :1] <
-1), 'int64')
You could then have the query vector available for each example through:
wan_query_vector =
wan_query_embedding_layer(wan_zero_input)
You will see that this structure is essentially the same as what we did for
word vectors, except that we had to replace the input layer with our fake
layer, as there is no actual input. We will also have 2 outputs (discussed in a
bit.)
How does the averaging change? You should use:
tf.keras.layers.Attention()
and make sure you consider the proper inputs and outputs for that
calculation.
So why 2 outputs, and how do we do that? First off, we need the output that
makes the classification, as always. What is the second output? We also
would like our model to provide us with the attention weights it calculated.
This will tell us which words were considered how much for the context
creation.
We can we implement 2 outputs? You need to have a list of the two outputs.
But note that you may also want to have a a list of 2 cost function and 2
metrics. You can use 'None' both times to account for our new second
output, and you can ignore the corresponding values that the model report.
(In general, the total loss will be a sum of the individual losses. So one would
rather construct a loss that always returns zero for the second loss, but as it
is very small we can ignore this here.)
In [35]: def create_wan_model(retrain_embeddings=False,
max_sequence_length=MAX_SEQUENCE_LENGTH,
hidden_dim=100,
dropout=0.3,
learning_rate=0.001):
"""
Construct the DAN model including the compilation and return it. Parametrize it using the arguments.
:param retrain_embeddings: boolean, indicating whether the word embeddings are trainable
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 17 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Now run the model for the same dataset as we did for the DAN model and
save it in its history 'wan_history':
QUESTION:
2.2.2.a Which number (in percent) is closest to the highest validation
accuracy that you observed for the wan training?
78
81
84
Now compare the results of the initial dan_model training and the wan_model
training:
:param hidden_dim: dimension of the hidden layer
:param dropout: dropout applied to the hidden layer
:returns: the compiled model
"""
### YOUR CODE HERE
### END YOUR CODE

return wan_model
In [36]: ### YOUR CODE HERE
wan_model = create_wan_model()
# use wan_history = ... below
### END YOUR CODE
In [37]:
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 18 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Next, let us see for the wan_model which words did matter for the
classification and which ones did less so. How can we tell? We can look at
the attention weights!
Consider the first training example:
The corresponding input ids that are suitable formatted, i.e. with max length
100, are these:
and the first 10 corresponding tokens are:
Identify the 5 words with the highest impact and the 5 words with the lowest
impact on the score, i.e., identify the 5 words with the largest and smallest
weights, respectively. (Not that multiple occurances of the same word count
separately for the exercise).
HINT: You should create a list of (word/weight) pairs, and then sort by the
second argument. Python's '.sort()' function may come in handy.
In [37]: fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(left=0.2, wspace=0.6)
make_plot(axs,
dan_suffled_history,
wan_history,
model_1_name='dan',
model_2_name='wan',
y_lim_accuracy_lower=0.70,
y_lim_accuracy_upper=0.82)
fig.align_ylabels(axs[:, 1])
fig.set_size_inches(18.5, 10.5)
plt.show()
In [38]: train_examples_batch[0].numpy().decode('utf-8')
In [39]: probe_input_ids = train_input_ids[0]
probe_input_ids
In [40]: probe_tokens = [x.decode('utf-8') for x in train_tokens[0].numpy()][:
probe_tokens[:10]
In [41]: ### YOUR CODE HERE
# 'pairs' should be the variable that holds the token/weight pairs.
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 19 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
QUESTION:
2.2.2.b List the 5 most important words, with the most important first. (Again,
if a word appears twice, note it twice.)
2.2.2.c List the 5 least important words, with the most important of those
first. (Again, if a word appears twice, note it twice.)
2.3 Approaches for Training of Embeddings
Rerun the DAN Model in 3 configurations:
1. embedding_initializer = 'word2vec' and retrain_embeddings=False
2. embedding_initializer = 'word2vec' and retrain_embeddings=True
3. embedding_initializer = 'uniform' and retrain_embeddings=True
NOTE: Train the models with static embeddings for 10 epochs and the ones
with trainable embeddings for 3 epochs.
What do you observe?
QUESTION:
2.3.a Which number (in percent) is closest to the highest validation accuracy
that you observed for the static model?
70
73
# 'pairs' should be the variable that holds the token/weight pairs.

### END YOUR CODE
print('most important tokens:')
print('\t', pairs[:10])
print('\nleast important tokens:')
print('\t', pairs[-10:])
In [42]: ### YOUR CODE HERE
### END YOUR CODE
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 20 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
76
QUESTION:
2.3.b Which number (in percent) is closest to the highest validation accuracy
that you observed for the model where you initialized with word2vec vectors
but allow them to retrain?
77
81
84
QUESTION:
2.3.c Which number (in percent) is closest to the highest validation accuracy
that you observed for the model where you initialized randomly and then
trained?
75
78
80
83
3. BERT-based Classification Models
Now we turn to classification with BERT. We will perform classifications with
various models that are based on pre-trained BERT models.
In [43]: ### YOUR CODE HERE
### END YOUR CODE
In [44]: ### YOUR CODE HERE
### END YOUR CODE
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 21 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
various models that are based on pre-trained BERT models.
3.1. Basics
Let us first explore some basics of BERT.
You first need to define the tokenizer and the model for the pre-trained
configuration you need.
We already got the tokenizer during setup with:
bert_tokenizer =
BertTokenizer.from_pretrained('bert-base-cased')
Now, we also need to get the model:
(Ignore the warnings.)
Next, consider this input:
Now apply the tokenizer to tokenize it:
QUESTION:
3.1.a Why do the attention_masks have 4 and 1 zeros, respectively?
For the first example the last four tokens belong to a different segment.
For the second one it is only the last token.
For the first example 5 positions are padded while for the second one it
is only one.
In [45]: bert_model = TFBertModel.from_pretrained('bert-base-cased')
In [46]: test_input = ['this bank is closed on Sunday', 'the steepest bank of the river is dangerous'
In [47]: tokenized_input = bert_tokenizer(test_input,
max_length=12,
truncation=True,
padding='max_length',
return_tensors='tf')
tokenized_input
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 22 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Next, let us look at the BERT outputs for these 2 sentences:
QUESTION:
3.1.b How many outputs are there?
3.1.c Which output do we need to use to get token-level embeddings?
the first
the second
3.1.d Which token number corresponds to 'bank' in the first sentence?
('bert_tokenizer.tokenize()' may come in handy.. and don't forget the
CLS token! )
3.1.e Which token number corresponds to 'bank' in the second
sentence?
3.1.f What is the cosine similarity between the BERT outputs for the two
occurances of 'bank' in the two sentences?
3.1.g How does this relate to the cosine similarity of 'this' (sentence 1)
and 'the' (sentence 2). Compute the cosine similarity.
Enter your code below.
In [48]: ### YOUR CODE HERE
# bert_output = ...
### END YOUR CODE
In [49]: ### YOUR CODE HERE
#1. -> print it out
#2. -> answer in answer file
#3. -> Look at tokenization
#4. -> Look at tokenization
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 23 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
3.2 CLS-Token-based Classification
In the live session we discussed classification with BERT using the pooled
token. We now will do the same but extract the [CLS] token output for each
example and use that for classification purposes.
Consult the model from the live session and change accordingly.
HINT: You will want to extract the output of the [CLS] token from the BERT
output similarly to what we did above to get the output for 'bank', etc.
Now create the model and run for 2 epochs. Use batch size 8 and the
appropriate validation/test set. (We don't make a distinction here.)
#4. -> Look at tokenization
#5. -> get the vectors and calclate cosine similarity
#6. -> get the vectors and calculate cosine similarity
### END YOUR CODE
In [50]: def create_bert_cls_model(hidden_size = 100,
dropout=0.3,
learning_rate=0.00005):
"""
Build a simple classification model with BERT. Use the CLS Token output for classification purposes.
"""
### YOUR CODE HERE

### END YOUR CODE

return classification_model
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 24 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
QUESTION:
3.2.a Which number (in percent) is closest to the highest validation accuracy
that you observed for the [CLS]-classification model?
78
80
82
84
86
3.3 Classification by Averaging the BERT outputs
Instead of using the [CLS] token, we will now average all of the output tokens
that correspond to actual tokens. I.e., ignore the [CLS] and [SEP] tokens.
Where are they? First and last for us.
HINT: You will want to extract all of the relevant tokens and then apply an
average across the tokens. You may want to use:
tf.math.reduce_mean()
but you can also do it in other ways.
In [51]: ### YOUR CODE HERE
### END YOUR CODE
In [52]: def create_bert_avg_model(hidden_size = 100,
dropout=0.3,
learning_rate=0.00005):
"""
Build a simple classification model with BERT. Use the average of the BERT output tokens
"""
### YOUR CODE HERE
6/5/22, 10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at a2-submit · datasci-w266/2022-summer-assignment-marciayyl
Page 25 of 25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb