VERSITY 1-无代写|学霸联盟

VERSITY 1-无代写

时间：2023-08-15

DURHAM UNIVERSITY 1
A Deep Learning Based Method for Identifying
Protein Carbamylation Sites
Student Name: Hao Man
Supervisor Names: Prof. Martin Cann and Dr. Matteo Degiacomi
Abstract—Carbamylation is a lysine based post-translational modification (PTM) that plays a critical role in various biological
mechanisms, such as CO2 sensing. It has long been understudied due to the difficulties in detecting carbamate sites in proteins. Its
unstable nature has rendered conventional experimental approaches ineffective. While computational methods were regarded as
alternative ways, the lack of training data has limited the development of tools from popular computational frameworks. Recent
advancements in experimental methodologies have uncovered a number of solvent accessible carbamate sites. These new discoveries
made the application of computational models possible. The present work presented a curated dataset composed of newly-identified
carbamate sites. Features were extracted from residue 3-d microenvironments, along with two biologically important measurements,
pKa and solvent accessible surface area (SASA). A deep learning model—a neural network featured by a Self-Attention mechanism
based architecture was then trained. Using an independent test set, the network demonstrated its performance characterized by a
78.8% Sensitivity, a 70.6% Specificity, and a 75.0% accuracy. To evaluate the prevalence of Carbamylation in human and
cyanobacteria proteins, large-scale model inferences were carried out. Results revealed that nearly 30% of the lysine residues in
human and cyanobacteria proteins were predicted to be modifiable.
Index Terms—bioinformatics, post-translational modification, carbamylation, deep learning
✦
1 INTRODUCTION
P ROTEINS are chains of amino acids that fold into aspecific 3-d structure based on their sequences (Branden
and Tooze 2012). Post-translational modifications (PTMs)
are local modifications of protein structures after their
biosynthesis, in which modifying groups were attached to
amino acid side chains (Ramazi and Zahiri 2021). PTMs
are important because by slightly altering the structure of
a protein, they can modulate its physicochemical properties
and functions (K.-Y. Huang et al. 2019). As one of the most
biologically important PTMs (Jimenez-Morales et al. 2014),
Carbamylation is a modification that involves the covalent
binding of carbon dioxide (CO2) to the lysine side chain
(Blake and Cann 2022). It is recognized as the key to “regu-
lating oxygen-binding in haemoglobin and the activation of
the CO2-fixing enzyme RuBisCO” (Blake and Cann 2022).
The study of this PTM facilitates the understanding of the
biochemical mechanisms by which proteins interact with
CO2 (King et al. 2022).
In order for biologists to properly study such a mod-
ification, as many carbamate sites as possible should be
determined. However, only on a limited number of proteins
has this PTM been experimentally identified (King et al.
2022). It is suggested that Carbamylation is by no means
well-studied due to the “lability and ready reversibility” of
the modification (Linthwaite and Cann 2021). The lack of ex-
perimental hits partly stems from the spontaneous decom-
position of the carbamylated lysine residues which makes
the identification difficult (King et al. 2022). As pointed
out by Linthwaite and Cann 2021, traditional experimental
methods for identifying PTMs are too aggressive to be able
to catch the labile carbamates. However, in the Protein Data
Bank (PDB), a number of protein structures that contain
carbamylated lysines can be found. These modification sites
are mostly special cases that the vast majority of them are
metal co-ordinated ones with adjacent metal ions in the
structures. In other words, these lysines are the ”buried”
ones. Nevertheless, they are the only cases that can be
relatively easily observed at atomic resolution.
Computational methods can potentially make an impor-
tant contribution to the field (Jimenez-Morales et al. 2014).
However, despite the advantages of developing models that
are faster than experimental ways and more suitable for
large scale applications, there still remains a paucity of
available tools in the literature. The major reason for the lack
of proposed models is simply the scarcity of data as popular
frameworks in Machine Learning or Deep Learning require
big data.
It is the pivotal study carried out by Linthwaite, Janus,
et al. 2018 that first proposed an experimental methodology
for identifying carbamates that are solvent accessible (i.e.
not buried), before that, systematical determination was not
possible. Since then, more data has been experimentally
uncovered. These discoveries, combined with a number of
newly-identified sites in the recent work of King et al. 2022
have formed the basis of the application of computational
methods. In light of the importance of the determination of
this PTM and the accumulation of new data, the present
work attempts to enrich the identification tools in the liter-
ature by proposing a computational method that leverages
the Deep Learning technology.
PTM site identification can be viewed as a Deep Learning
binary classification task in which a computational model
is trained using known carbamate sites (Ground Truth) to
classify a given lysine residue into two classes, i.e. mod-
ifiable or not modifiable. It is important to note that it is
the 3-d structures of proteins that determines their biolog-
DURHAM UNIVERSITY 2
ical functions (Senior et al. 2020). Therefore, a protein 3-d
structure (see Figure 1), specifically the microenvironment
of a lysine residue, is the ideal source of information upon
which a prediction can be made. Unlike past models in the
literature that focused on features in protein sequences, the
model architecture proposed by the present work is the first
one that can exploit protein 3-d structure based features to
a great extent.
Fig. 1: 3-d structure of PDB 2KQK in the cartoon representation
with lysine 124 highlighted.
2 RELATED WORK
THE application of computational methods such as Ma-chine Learning or Deep Learning in predicting protein
PTM sites is by no means novel. For simplification, a car-
bamylated site is called a positive and an uncarbamylated
one is called a negative. G. Huang et al. 2013 proposed a
Machine Learning model, one-class K Nearest Neighbour
(KNN) for predicting carbamate sites with protein primary
and secondary structure based features. The model was
trained by 40 positive sites gathered from the Uniprot
database with 70% sequence identity reduction to learn a
discriminative boundary enclosing the positive class; any
new sample that falls inside of the boundary in the feature
space would be classified as a positive. A Leave-One-Out
Cross Validation (LOOCV) was carried out, and the model
obtained a 82.5% Sensitivity (Sn) and a 96.37% Specificity
(Sp). On the hold-out test set, the model achieved a 66.67%
Sn and a 100% Sp. However, the number of positives in this
set was only 6.
Jimenez-Morales et al. 2014 presented the first model that
used features extracted from protein 3-d structures. From
the Protein Data Bank (PDB), they initially gathered 251 pos-
itive sites from proteins with at least one subunit containing
a carboxylated lysine residue and 4259 negative sites which
were the remaining lysine residues from the same proteins.
Two sub-sets with redundancy reduction of the proteins
was constructed using 40% and 90% sequence identity cut-
offs. They designed the feature vector that was composed
of occurrence numbers of amino acids, water molecules,
and metal ions within a 5 A˚ Euclidean distance of a lysine
residue. Three Native Bayesian classifiers were trained and
tested using the three datasets. They achieved high per-
formance in the LOOCV with a 83.72% Sn and a 99.58%
Sp using the set with 90% sequence identity reduction,
performance using the other two sets was even higher. Yet,
as reported by the authors, most of the carbamylated lysines
they obtained were buried metal-stabilized ones. These sites
are not biologically important in some cases, such as in
the study of protein-CO2 binding. Also, as suggested by
Linthwaite and Cann 2021, models trained with mostly
buried carbamates may only capture the characteristics of
a subset of the possible carbamate sites.
Ning et al. 2021 presented another Machine Learning
based method called pQLyCar that used protein sequence
based features. They constructed a dataset which included
723 positive sites and 1197 negative sites with 60% se-
quence identity reduction from Uniprot, PLMD and dbPTM
database. The idea behind pQLyCar is similar to the stan-
dard two-class KNN that when predicting for a query sam-
ple, K ”nearest” samples will be selected to train a Support
Vector Machine (SVM) classifier to classify the query one.
The selection of the K samples is based on their Euclidean
distances to the query sample in the feature space. They
also obtained high performance in the LOOCV featured by
a 96.49% Sn and 99.59% Sp, but no independent test set was
used. While their dataset was much bigger in terms of the
size of the positive class, a spreadsheet documenting the
exact sites they used is unavailable. Therefore, the amount
of buried lysines in the set is unclear.
A limitation of these past research is that they did not
have a benchmark dataset to use. Hence, it is not possible
to directly compare the performance of different predictors
when they were trained and tested with drastically dif-
ferent data. Also, due to the lack of effective experimen-
tal methodologies, previous studies tended to use buried
carbamylated lysines, which could potentially limit their
models’ generalizability. Moreover, far too little attention
was paid to the data imbalance issue. In fact, none of the
past works proposed a suitable strategy to directly deal with
this problem. Furthermore, no previous study discussed
the possible existence of false-positives or false-negatives
in their datasets, and it may be owing to the fact that
they mostly used the buried carbamate sites which were
clearly observed at atomic resolution. Nevertheless, such
consideration is necessary when the new data is applied.
Finally, a potentially relevant discussion about high model
performance can be viewed in 7.3.
While the existing predictors for identifying Carbamy-
lation sites are all Machine Learning based, many Deep
Learning models have been proposed in predicting other
PTMs. For instance, to detect Succinylation sites, H. Wang
et al. 2021 trained a neural network that applied both
convolutional layers and Attention Mechanisms to make
inferences based on protein sequences, physicochemical
properties, and structures; Ahmed et al. 2021 attempted to
use an LSTM Recurrent Neural Network (RNN) to process
protein sequences to determine Phosphorylation sites. These
Deep Learning methods are featured by their ability to learn
features given certain inputs, instead of accepting features
manually extracted by researchers. However, few of these
studies have proposed a network architecture that is suitable
DURHAM UNIVERSITY 3
for protein 3-d structure based inputs.
The present work contributes to this area of research by:
• curating a benchmark dataset with new solvent ac-
cessible carbamate sites;
• exploring a neural network architecture that is ideal
for 3-d structure based inputs;
• demonstrating the effectiveness of a method for deal-
ing with data imbalance;
• discussing a possible solution to transform the un-
certainty in the Ground Truth labels to additional
knowledge and integrate it into the network training;
• presenting a trained neural network and its perfor-
mance.
3 METHOD
3.1 Preparation of the Data
THE carbamate sites used in the present work come fromthree sources:
• unpublished experimental hits identified by the re-
search group led by Prof. Martin Cann at Durham
University;
• high confidence Lys-CO2 sites from Supplementary
Table 4 published by King et al. 2022;
• candidate Lys-CO2 sites from Supplementary Data 1
published by King et al. 2022.
It should be pointed out that in the work of King et al.
2022, a statistical test (one-sample one-tailed student t test)
was used to partially determine the confidence level of a
candidate site. The high confidence ones were determined
by various other criteria in addition to the outcome of
the t test. To maximize the true-positive rate, the selection
among the candidate sites was conducted only to the ones
that obtained a p-value less than 0.05. The negatives were
also gathered from the Supplementary Data 1 supplied by
King et al. 2022. To refine the selection, same criteria was
followed that only the lysines that obtained a p-value less
than 0.05 from within a peptide chain that did not contain a
candidate carbamate site were considered. In total 202 (with
duplicates) carbamate and 1881 non-carbamate sites could
be gathered from the sources.
The availability of the 3-d structures of these sites is
critical since the structure based feature is one of the focuses
of this study. All these proteins have structures predicted by
a computational model—AlphaFold (Jumper et al. 2021) and
stored in the AlphaFold Database. A number of them also
have one or more structures archived in the Protein Data
Bank (Berman et al. 2000). These structures were observed
using experimental methods, such as X-ray crystallography,
NMR spectroscopy, and electron microscopy. For the struc-
tures found in the PDB, alternate conformations (usually
from NMR ensemble) and alternate side chain rotamers
for each lysine were all considered useful. The reason for
extracting and including multiple structural representatives
is that proteins are flexible in terms of their structures, and
including multiple structural arrangements could capture
this flexibility. Here, a benchmark dataset can be constructed
using these structures (see Table 1).
Data Imbalance—one remaining problem is how to deal
with the non-negligible size gap between the data of the
positive and negative sites? Without a doubt, way more
structures from the PDB and AlphaFold can be gathered
for the negative class, resulting in an imbalanced dataset.
As pointed out by Dou et al. 2021, data imbalance is a
natural difficulty in PTM identification tasks that apply
computational methods, and it usually results in the pre-
diction bias towards the majority class. Methods for dealing
with imbalanced data have long been proposed by the
Machine Learning community (Chawla et al. 2002). In a
data-level method, the minority class can be up-sampled
using duplicates or synthetic samples generated by certain
algorithms, such as SMOTE (Chawla et al. 2002), and the
majority class can be down-sampled by random selection.
One preliminary solution is to down-sample the negative
class, but not randomly. Specifically, for the negative class,
only the structures predicted by AlphaFold were included
into the benchmark set. Although the flexibility in the struc-
tures was not captured, the diversity of the negative class
was still preserved. On the other hand, multiple structural
arrangements for each positive lysine were all included into
the set. In other words, in the benchmark set, the positive
class is on the order of n to 1 while the negative one is on the
order of 1 to 1. In total, the balanced benchmark set consists
of 1376 positive samples and 1342 negative samples. The
detailed composition of the set is displayed in Table 1. The
training set were randomly selected from the benchmark
set, containing 1107 positive and 1107 negative samples.
The remaining 269 positive and 235 negative samples
were treated as the test set. However, such a training set
should be treated with caution due to the complexity in its
composition, as will be discussed in 3.4.
3.2 Feature Extraction
3.2.1 3-d Structure based Feature
THE microenvironment of a lysine residue can be charac-terized by the amino acids surrounding it within a cut-
off distance. This distance is the Euclidean distance between
the terminal zeta (NZ) position of the target lysine and the
alpha carbon (Cα) of a surrounding amino acid. Empirically,
the cut-off distance of 12 A˚ was selected though 20 A˚ was
also considered. Inspired by the inputs of language models,
such as a Transformer (Vaswani et al. 2017), a collection of
amino acids can be mathematically described as a set of
vectors with each vector representing one amino acid, the
same way researchers represent a sequence of words in a
sentence. The amino acid vector can be initially constructed
using a common technique, One-Hot-Encoding. Specifically,
Denoting εj as a vector with 1 at the ith entry and 0
elsewhere, given there exist 20 common types of amino
acids, εj is a vector in R20. Each entry of εj corresponds
to one amino acid type, and having a 1 at the ith entry
suggests it is the ith amino acid. Here, the local environment
of a lysine with k neighbouring amino acids within the cut-
off distance can be represented by the set {ε1, ε2, . . . , εk}.
Although the identity information has been encoded, the
equally important positional information of amino acids
in the environment was still missing. To accomplish this,
a positional encoding method was used. This method has
been empirically tested and proved effective in the domain
of Scene Representation recently by Mildenhall et al. 2020.
According to their implementation (Mildenhall et al. 2020),
here γ is defined as a mapping from R to R2L, p is one
DURHAM UNIVERSITY 4
Positive Negative
Lysine Structure Lysine Structure
unpublished experimental hits by Prof. Martin Cann’s group 37 451 3 3
supplementary table 4 (King et al. 2022) 24 82 - -
supplementary data 1 [E.coli (hCit)] (King et al. 2022) 115 869 - -
supplementary data 1 [Synecho (hCit)] (King et al. 2022) 26 44 - -
supplementary data 1 [E.coli (all) & Synecho (all)] (King et al. 2022) - - 1878 1878
Total 202 1446 1881 1881
Total with reduction 161 1376 1342 1342
TABLE 1: The composition of the benchmark dataset. The set is composed of structures. For each carbamate lysine, it can have
multiple structural representations included as samples in the set. The 3 unpublished negatives are resid 11, 27, and 29 of PDB
1UBQ. Among the positives, the reduction involves removing duplicated lysines across different sources; removing lysines that
do not have a protein structure archived in the PDB and at the same time has a low confidence score in the structure predicted
by AlphaFold. In total 161 distinct positive lysines and 1376 positive samples (structures) were left. Among the negatives, lysines
that have a low confidence score in the structure predicted by AlphaFold were removed. In total 1342 distinct negative lysines
and 1342 negative samples (structures) were left.
coordinate value, and L is a hyperparameter, formally the
encoding formula is:
γ(p) = (sin(20πp), cos(20πp), . . . , sin(2L−1πp), cos(2L−1πp)).
(1)
It should be noted that the origin of the Cartesian space
of a local environment was firstly set to the location of the
target lysine. Then, the positional encoding function was ap-
plied to each of the three coordinate values. The parameter
L was set to be 4. Accordingly, the positional information
of an amino acid can be represented by a vector in R24.
This vector was then concatenated with the 20-d one-hot
vector. By now, the representation of an amino acid reflects
both identity and relative position information. However,
one may notice that the number of amino acid vectors
can vary across different structures. In Deep Learning, the
varying input shape is not an unsolvable problem. Recent
development of the Self-Attention mechanism is the most
widely applied solution as will be discussed (see 3.3).
3.2.2 pKa and SASA
I T is suggested that the pH of the local cellular envi-ronment plays a key role in regulating biological pro-
cesses (Kilambi and Gray 2012). Predicting pKa values to
determine ionization states is important for understanding
protein functions (Gokcan and Isayev 2022). Specifically, low
pKa values that allow for dissociation to a neutral amine
and CO2 binding is the key to Carbamylation (Linthwaite
and Cann 2021). Hence, the residue pKa value was regarded
as an important feature in predicting carbamate sites. The
calculation of pKa was based on a widely used python
package called PROPKA (Olsson et al. 2011; Søndergaard
et al. 2011). It should also be pointed out that a new Deep
Learning based method for predicting pKa (Gokcan and
Isayev 2022) has been published shortly before the comple-
tion of the present work. Their method has been reported to
have better performance than PROPKA in some ways and
can be applied in future works.
SASA is in short for Solvent Accessible Surface Area.
It is suggested that the amino acid side chain that can be
modified by PTMs tends to be more accessible on the surface
of a protein (Lu et al. 2011), in other words, has a higher
SASA. In an empirical study (Vandermarliere and Martens
2013), it was found that 90% of the residues that have
undergone the Phosphorylation modification are solvent ac-
cessible. Solvent accessibility has also been used as a feature
in a discussed past work (G. Huang et al. 2013). Accordingly,
besides pKa, SASA was considered to be another manually-
extracted feature. The value of SASA was computed by
a well-developed python package called Biobox (Rudden
et al. 2022) using the Shrake-Rupley algorithm.
However, for each structure, the corresponding pKa and
SASA are two scalars which are not compatible with a
set of amino acid vectors. The exact way that pKa and
SASA involve in the training and decision-making will be
discussed in 3.3.3.
3.3 Neural Network Architecture
INSPIRED by functioning of human brains, a Neural Net-work (NN) is a mathematical model with multiple layers
of neurons linked together (Emmert-Streib et al. 2020). One
difficulty in developing an NN is dealing with inputs that
may have different shapes. In classical architectures such
as a Multilayer Perceptron (MLP), the set of parameters
(neuron connections) is predetermined and fixed during
the training. When the input changes in shape, for instance
from L-dimensional to L+1-dimensional, new connections
(parameters) between the information in the added 1 dimen-
sion and the existing neurons in the first layer are required,
which however is not possible.
Vaswani et al. 2017 published the paper titled “Attention
is All you Need”, a work that has long been recognized
as a milestone in language processing and computer vision
(lately). In simplified words, the Self-Attention Mechanism
takes a set of vectors as the input and outputs a set of vectors
with the same cardinality. Global Self-Attention also takes a
set of vectors as the input but only outputs one single vector.
Two things should be noted here: (1) vectors in a set do
not have ordering; (2) the cardinality of the set can change.
These good properties perfectly match the characteristics
of the 3-d structure based features and render the Self-
Attention block the foundation architecture of the proposed
neural network (see Figure 2).
DURHAM UNIVERSITY 5
Fig. 2: The detailed input pipeline and the network architecture. For the local environment of a lysine residue i, in total Ni amino
acid vectors are computed as the input of the network. Each vector’s first 20 dimensions represent identity information, and
the rest 24 dimensions reflect the relative position of the amino acid. Passing through 3 blocks of Attention and Residual Feed
Forward, the Ni output vectors are then converted to one single vector by the Global Attention. The output of one hidden layer
in the following Linear ReLU Stack is then extracted and concatenated with pKa and SASA to form the input of an MLP which
is responsible for generating the final output logits. The Attention Based NN and the MLP are trained separately.
3.3.1 Additive Self-Attention
A SSUMING there are N amino acids in the local structureof a lysine residue with each amino acid represented
by a vector νj ∈ Rh, j = 1, 2, . . . , N , the outcome of
one Query vector νjq after Self-Attention is dependent on
the N Key vectors which in this case is the whole set
{ν1k , ν2k , . . . , νNk }. The output ˜νjq can be computed by:
˜
νjq =
N∑
i=1
softmax(α(νjq , ν
i
k))ν
i
k,
in whichα(νjq , ν
i
k) = ν
j
q
T
tanh(Wkν
i
k +Wqν
j
q ).
(2)
Intuitively, what Self-Attention does to a Query is firstly
assessing the relationships (attention scores) between the
Query and all the Keys, then integrating the information
of the Keys based on these relationships (weighted sum-
mation) to the Query. Matrices Wk and Wq are trainable
parameters, and the softmaxed values are the so-called
attention scores of query-key pairs. When there is only one
Query attending to the Keys, this operation is defined as
the Global Attention. Naturally, an N by N (1 by N for
Global Attention) attention score matrix can be obtained
by storing these attention scores in a matrix. What makes
the attention matrix interesting is that it is unique for each
microenvironment. An example of an attention matrix will
be displayed (see Figure 7).
3.3.2 Residual Feed Forward
L EARNING abstract representations unusually requires adeep network architecture. Nevertheless, going deep
does not always render a better result. He et al. 2016 first
proposed a network framework called Deep Residual Net
that has been universally applied in deep network archi-
tectures. The problem they attempted to address is that
when building a really deep net, the optimization can be
so difficult that the generalizability after extensive training
can be worse than a shallow one. The solution He et al. 2016
presented is simply changing the original mapping F (x)
to F (x) + x for some hidden layers. Intuitively, the opti-
mization of the deep network in this case should be easier
because if it gets hard, this so-called residual connection
can make the deep net shallow. Specifically, the network
DURHAM UNIVERSITY 6
with residual connections can choose to push an actual
mapping of a deep layer F (x) to 0 and simply output x
as if the deep layer does not exist. In the present network,
the residual connections applied in the Residual block and
the Self-Attention block can be described as:
Residual Block:F (x) = ReLU(W2(ReLU(W1x) ) + x),
Self-Attention Block:G(x) = layernorm(Attention(x) + x).
(3)
3.3.3 Integrating pKa and SASA
TWO methods can be used to integrate the two scalarsinto the Self-Attention based net: (1) after the Global
Attention, directly concatenate the two values to the output
LYS vector; (2) train a new MLP whose input is the concate-
nation of a learned representation extracted from a layer of
the final Linear ReLU Stack and the two values. Preliminary
results demonstrated that when placing pKa and SASA at
the very end of the Attention based net using method (1), the
final Linear ReLU Stack would converge quickly, leaving the
Attention Blocks poorly trained. Hence, the second method
is adopted as shown in Figure 2.
3.4 Weighted Random Sampling
T RAINING a neural network is just a simplified expres-sion for updating the parameters of the net by stochastic
gradient descent (SGD). The network gets updated every
time after seeing just a small batch of data. The composition
of every batch determines how well the net can be opti-
mized. The conventional way of preparing these batches is
by random shuffling and splitting of the dataset. However,
such method can be problematic due to the complexity
of the training data. The problem is that some positive
lysines have contributed too many samples (in the extreme
case, more than 200 for 1 lysine) in the training set while
others have contributed too few (sometimes just 1). One
bad possible example is that after the shuffling, several
successive batches contains too many samples that belong to
just one or two positive lysines. In this case, it is inevitable
that the net will learn too much from these lysines, which
may drastically decrease its generalizability.
To cope with this issue, a technique called weighted ran-
dom sampling can be helpful. According to Efraimidis and
Spirakis 2006, the weighted random sampling algorithm can
be viewed as selecting m random items out of a population
of size n with the probability of each item being selected
determined by the item weight. In the context of the current
task, the goal is to find a set of weights Ω such that when
sampling 1 sample out of a population of size N+ + N−
(with replacement), in which N+ and N− represent the
class sizes of the positives and the negatives in the training
set, two conditions should be satisfied. The first condition
addresses the equal likelihood of selecting a positive or a
negative, let Θ+ and Θ− represent the positive and negative
classes:
(a)P (x ∈ Θ+ |Ω) = P (x ∈ Θ− |Ω). (4)
The second condition ensures the fairness when sampling
a positive sample in a batch. Assuming there are k distinct
lysines in the positive class and each lysine has its own set
Lysi such that
∑k
i=1 |Lysi| = N+, for i ̸= j and i, j ∈
{1, 2, . . . , k}:
(b)P (x ∈ Lysi |x ∈ Θ+, Ω) = P (x ∈ Lysj |x ∈ Θ+, Ω).
(5)
It can be shown that the following construction of Ω
can satisfy both conditions. For i = 1, 2, . . . , k, denote
ni = |Lysi|, for j = 1, 2, . . . , (N+ +N−):
wj =
{
1, when negative,
N−
nik
, when positive and belongs to Lysi.
(6)
Proof. Let w+i and w
−
i represent weights assigned to a
positive and a negative. P (x) is the unweighted probability
of any one sample being selected, and P (x |x ∈ Θ+) is
the conditional probability of any one positive sample being
selected:
(a)P (x ∈ Θ+ |Ω) =
k∑
i=1
ni∑
m=1
w+i P (x)
=
k∑
i=1
ni∑
m=1
N−
nik
∗ 1
N+ +N−
=
k∑
i=1
N−
k(N+ +N−)
=
N−
N+ +N−
P (x ∈ Θ− |Ω) =
N−∑
i=1
w−i P (x)
=
N−∑
i=1
1 ∗ 1
N+ +N−
= P (x ∈ Θ+ |Ω)
To show the conditional probability is independent of i, for
i = 1, 2, . . . , k:
(b)P (x ∈ Lysi |x ∈ Θ+, Ω) =
ni∑
m=1
w+i P (x |x ∈ Θ+)
=
ni∑
m=1
N−
nik
∗ 1
N+
=
N−
kN+
3.5 Loss Design
THE Cross-Entropy (CE) Loss is widely used in binaryclassification tasks in Deep Learning. Denoting p as the
predicted ”probability” for the class with label y = 1, it can
be described as:
CE (p, y) =
{
−log (p), if y = 1,
−log (1− p), otherwise. (7)
However, even when p is significantly greater than 0.5 and
the prediction is correct, in other words, the prediction is
easy, the loss −log (p) can still have “non-trivial magnitude
” (Lin et al. 2017). In the present work, it is assumed that the
negatives should be easy to predict after some amount of
training because the negative set is more diversified and
contains more information to learn. On the other hand,
DURHAM UNIVERSITY 7
the positives are hard to spot due to the lack of diversity.
Another possibility is that some samples can share similar
patterns which make them collectively relatively easier to
predict. The implication is that when the losses generated
by these easy predictions accumulate, the magnitude is non-
negligible, which can downgrade the efficiency in optimiza-
tion. Hence, it is necessary to answer the question—how to
weaken the losses of the easy predictions but amplify the
losses of the hard ones? Focal Loss introduced by Lin et al.
2017 can be the answer. Focal Loss can be described as:
FL (p, y) =
{
−α (1− p)γ log (p), if y = 1,
−(1− α) pγ log (1− p), otherwise. (8)
Take y = 1 as an example, given γ is greater than 1, the term
(1−p)γ adjusts the loss such that it shrinks drastically when
p is already high (close to 1). α is another hyper-parameter
for weighting the loss by classes. The two hyper-parameters
γ and α are set to be 2 and 0.75, respectively.
3.6 Label Smoothing
THE final issue to address lies in the uncertain na-ture of the Ground Truth labels. Even though the
(non)modification sites were gathered with great care, false-
negatives and false-positives are impossible to rule out. The
critical question to ask is how well can a neural network
perform when its training involves certain amount of inac-
curate labels?
3.6.1 The Student-Teacher Analogy
THE relationship between a neural network and theGround Truth can be characterized by the student-
teacher analogy. Naturally, the net is the student who learns
from the teacher, the Ground Truth. In an experiment, Guan
et al. 2018 intentionally corrupted 80% of the labels in the
MNIST training set (a standard in handwriting recognition)
and trained a neural network using the corrupted data to
do the classification. The test performance using the correct
labels was incredible that the error rate was only 8.23%.
Guan et al. 2018 even titled their experiment by “Beating the
Teacher”. They further pointed out the performance of the
network would collapse when the amount of the corrupted
labels increased to a certain percentage. The implication
is that they empirically proved that the performance of
a neural network is actually not bounded by how good
the teacher is under certain condition. The collapse point
is closely linked to the mutual information between the
corrupted labels and the correct ones (Guan et al. 2018).
While in the current task, the perfectly correct labels are
unavailable, and therefore mutual information cannot be
evaluated. However, the network performance as will be
shown may suggest that the training data is far beyond the
collapse point.
3.6.2 Soft Labels
A S initially proposed by Szegedy et al. 2016, labelsmoothing is a regularization method and has been
shown to be effective for improving the performance of
networks in image classification and language processing
tasks (Mu¨ller, Kornblith, and Hinton 2019). The relevance of
this technique to the current task is that intuitively, all it does
is adding a small amount of uncertainty to the Ground Truth
labels. For instance, when computing the Cross-Entropy
loss, instead of using the target label [1, 0], a so-called soft
label [1-ε, ε] now is used, in which ε is a small number. This
setting captures the real uncertainty existing in the Ground
Truth and transfers it to extra information for the network
to learn. Technically, according the Mu¨ller, Kornblith, and
Hinton 2019, training a neural network with soft labels
forces the difference between the logit of the correct class
and the logit of the incorrect class to be a constant. In other
words, the network trained by soft labels is more robust and
can not easily be over-confident (over-fitted) by pushing one
logit too large and the other one too low. In the present
network, the small number ε was set to be 0.03. As will be
shown, with such a relatively small level of smoothing, the
generalizability of the net can be improved.
4 RESULTS AND DISCUSSION
4.1 Contour Plot of SVM Using pKa and SASA
U SING pKa and SASA as two features, a Support VectorMachine (SVM) with a Gaussian kernel was trained
and regarded as the baseline model. The prediction values
(probability) of the SVM for a grid of (x, y) coordinates were
used to generate the contour plot as shown in Figure 3.
One key observation is that as expected, a lower pKa is
associated with a higher probability of being a positive.
At the bottom-left region of the scatters, the positives are
more densely distributed than the negatives. Nevertheless,
a residue being more solvent accessible does not actually
render a higher probability of being modifiable since the
positives and the negatives clearly overlap when SASA is
high.
In general, the effectiveness of using an SVM to separate
the two classes with pKa and SASA is not exceptional and
the large overlapping region (see Figure 3) may suggest a
high false-positive rate in the real application. One final
observation that should be noted is that there is a small
cluster of negatives with pKa values between 11 to 12.
pKa values for negatives were computed based on their
structures predicted by AlphaFold. Such a cluster could
potentially indicate a systematical error in the predicted
structures and should be further investigated by researchers.
4.2 Effect of Weighted Random Sampling
TO visualize exactly what weighted random samplingdid, 20 batches of data (batch size 64) were sampled
using the two methods, random shuffling & splitting and
weighted random sampling using the computed set of
weights Ω. A comparison of data composition is depicted
in Figure 4. With the conventional method, a few lysine
residues contributed too many samples in the 20 batches.
On the contrary, when weighted random sampling (with
replacement) was applied, all the residues were equal likely
to be sampled and the diversity of the residues in each batch
should be improved.
DURHAM UNIVERSITY 8
Fig. 3: Contour Plot of SVM Prediction Values and Scatter Plot of SASA against pKa. Only the training data is used to generate
the plot particularly to reflect how well the two classes can be separated by the Machine Learning algorithm and the two features.
The blue region represents the area in which any points will be predicted as a positive. Points that fall into the red region will be
predicted as negatives.
Fig. 4: The composition of 20 batches of data using two sam-
pling methods. The x axis values are categories representing
140 distinct positive residues. The y axis indicates how many
times each residue showed up in the sampled data.
4.3 Performance
INCLUDING the baseline model SVM, in total 6 modelswere trained and tested using the same training and test
sets (see Table 2).
pKa and SASA: the first observation is that SVM has
already obtained high performance in terms of Sensitivity
(76.6%). However, a low Specificity combined with a high
false-positive rate suggest that a great number of negative
samples overlapped with the positive samples in the pKa-
SASA feature space. An MCC of 0.44 indicates a mediocre
correlation between predicted labels and true labels. Over-
all, it is evident that pKa and SASA have certain predictive
power, but additional features should be added to better
separate the two classes.
Basic Network: the neural network trained with cross-
entropy loss has obtained the worst Sensitivity. One possi-
ble explanation is over-fitting since no regularization tech-
niques were used except for just a few dropout layers.
However, the extremely high 76.2% Specificity suggests that
the negatives were easier to classify. The net was likely
to have learned more knowledge from negative sample
but less from positive ones, which is not exactly the ideal
outcome. However, it is reasonable because the negative
class contains more distinct lysine residues, and naturally
more knowledge to learn. Targeting the limitations, label
smoothing and focal loss were subsequently implemented
to improve the performance.
Label Smoothing: probably the most significant obser-
vation is that just a small amount of smoothing in the
Ground Truth labels significantly improved the network’s
performance in all metrics by 3 to 4 percent. The overall
accuracy jumped from only 71% to nearly 75% and the im-
proved MCC of 0.5 suggests a stronger correlation between
predictions and the Ground Truth. What can be inferred is
(a) the slight amount of uncertainty injected into the labels
was effectively learned by the net, i.e. the net learned more
knowledge than the basic one; (b) the level of over-fitting
decreased due to the application of label smoothing as a
method of regularization. It can be concluded that label
smoothing is an effective technique in dealing with the
dataset.
Focal Loss: training a net with focal loss could improve
Specificity substantially from 69.5% to 75.4%. It makes sense
because spotting a positive originally can be hard and focal
loss could force the net to learn harder from these difficult
predictions. However, focal loss is not an ideal trick because
the Specificity decreased from 80.1% to 73.2%. Nevertheless,
the fact that applying focal loss can help researchers control
the trade-off between Sn and Sp can be helpful depending
DURHAM UNIVERSITY 9
Sn (%) FP (%) Sp (%) FN (%) Acc (%) MCC
SVM with pKa & SASA 76.6 27.0 67.7 28.4 72.4 0.44
NN CE 66.5 23.8 76.2 33.5 71.0 0.43
NN CE + LS 69.5 19.4 80.1 29.4 74.8 0.50
NN CE + LS + pKa & SASA 79.9 25.6 68.5 25.1 74.6 0.49
NN FL + LS 75.4 23.7 73.2 27.7 74.4 0.49
NN FL + LS + pKa & SASA 78.8 24.6 70.6 25.6 75.0 0.50
TABLE 2: Performance Comparison of Trained Models. Model acronym explanations: SVM—support vector machine; NN—
Attention based neural network; CE—cross-entropy loss; LS—label smoothing; FL—focal loss. Performance metric explanations:
Sn—sensitivity; FP—false-positive rate; Sp—specificity; FN—false-negative rate; Acc—total accuracy; MCC—Mattew’s correlation
coefficient.
on the researchers’ needs.
Network with pKa and SASA: it can be observed that
when pKa and SASA were added to the learned represen-
tations of the net with cross-entrypy loss and soft labels,
both the Sensitivity and the Specificity changed by around
10%. This drastic shift in the performance may suggest that
the information contained in the learned representation was
very different from the information carried by pKa and
SASA. Another key finding is by far, the test results of the
MLP has officially outperformed the baseline model SVM in
all metrics with a nearly 80% Sensitivity. On the other hand,
the performance shift of the network with focal loss and
soft labels after combining pKa and SASA was more gentle,
and the second MLP also obtained a better performance
than the baseline model. Because the second MLP (last row
in Table 2) earned higher scores in the total accuracy and
MCC than the first one did (the third row in Table 2) did,
it is regarded as the final optimal model (considering the
Attention based net and MLP jointly as a model).
4.4 Embedding Visualization
R EPRESENTATIONS extracted from the Attention basednet can be viewed as low-dimensional embeddings
of the network inputs. Specifically, the mapping can be
described as f : Rn×k → Rm, in which n, k, m represent the
number of amino acids, vector length, and embedding di-
mension, respectively. It is beneficial to visualize the learned
embeddings in the 2-d space to truly understand how well
the neural network abstracted the input information in or-
der to separate the two classes. The embeddings were from
the Attention based net with focal loss and soft labels. t-
SNE (t-distributed Stochastic Neighbor Embedding), a com-
monly used dimensionality reduction algorithm in Machine
Learning was applied to further learn a 2-d representation
of the embeddings.
Training Set: as shown in Figure 5, the positives
and negatives were relatively well-separated. The positives
formed a curvy shape and were distributed at the bottom-
left region while the negatives were more compactly dis-
tributed at the top-right region. The so-called minority
positives are the ones associated with the positive lysine
residues that only have few structure representatives (usu-
ally just 1) in the training set. The fact they were all sepa-
rated from the negatives proves the net was not biased by
learning too much from the positive residues with too many
conformations; otherwise, these minority positives would
not be well-separated or could be falsely classified in the
training set. Samples associated with the experimental hits
supplied by the research group at Durham University were
also marked by grey stars to give biologists a clearer picture.
Test Set: separability of the test set is less obvious as
shown in Figure 6. It is undeniable that given the size of
the dataset, even though regularization was applied, over-
fitting could still be an unsolvable issue. In general, the
positives still formed a cluster, but more negatives were
mixed into it. However, observations of these green stars
are encouraging. They are the structure representatives of
the positive lysine residues that do not have conforma-
tions included in the training set. 16 out of 22 so-called
difficult positives were clearly separated and compactly
distributed at the ”right” place (bottom-left). Even though
6 of them were missed, the possibility of them being real
false-positives may not be ruled out.
Although adding pKa and SASA to the learned embed-
dings would bring more noise to the depicted representa-
tions, general separability of the training and test sets could
preserve.
4.5 Attention Matrix Visualization
A TTENTION Matrices are collections of Attention Scoresstored in the matrix form. Taking protein P0ACD4
(Uniprot) as an example, in its structure 2KQK (PDB), 28
amino acids can be identified in the microenvironment of
lysine 124. The Attention Matrix (Figure 7) for this partic-
ular instance is 29 by 29 (the lysine itself is also included)
with each entry being a measurement of the relationship
between two amino acids. Just a variation of the Attention
Matrix, the Global Attention Matrix is 1 by 29.
Attention Matrices can tell how the input amino acid
vectors gather information from each other. For instance, in
Figure 7, information of ASN and ALA (two dark green
columns) were largely gathered by all the amino acids
because they obtained higher attention scores with respect
to all the amino acids. Knowing the flow of information
allows researchers to examine the interpretability of the
network decision-making for lysine residues of particular
interest.
DURHAM UNIVERSITY 10
Fig. 5: Training set t-SNE visualization of embeddings learned by the Attention based net with focal loss and soft labels. Minority
positives are positive samples associated with positive lysine residues that only have few conformations in the training set.
Experimental hits are samples related to the carbamate sites identified by the research group at Durham University.
Fig. 6: Test set t-SNE visualization of embeddings learned by the Attention based neural network with focal loss and soft labels.
Difficult positives are positive samples associated with positive lysine residues that do not have conformations in the training set.
Experimental hits are samples related to the carbamate sites identified by the research group at Durham University.
DURHAM UNIVERSITY 11
Fig. 7: Visualization of the First Attention Matrix for protein
P0ACD4 (Uniprot), structure 2KQK (PDB), resid 124. Each entry
of the matrix represents a measurement of the relationship
between two amino acids. The darkness of the entry’s color
suggests how strong the relationship is. Each amino acid gath-
ers information mainly from strongly-related amino acids.
4.6 Predicted Sites in Cyanobacteria and Human Pro-
teins
C YANOBACTERIA are likely the first organisms thattreated CO2 as the primary source of carbon (King et al.
2022). Also, several hundred lysine residues was identified
to have heightened reactivity in the human proteome by
global profiling (Hacker et al. 2017, as cited by King et al.
2022). Following these findings, as the final step of the work,
the trained Attention based net (the second-to-last row in
Table 2) was applied to lysine residues in cyanobacteria and
human proteins to evaluate the prevalence of Carbamyla-
tion. The Attention based net was solely used without the
corresponding MLP due to efficiency consideration. All the
proteins were reviewed ones identified from the Uniprot
database. Due to the huge quantity of the existing lysine
residues in these proteins, only structures predicted by
AlphaFold were used to generate the network inputs. In
total, 1169197 lysine residues from 43181 human proteins
and 210466 lysine residues from 13698 cyanobacteria pro-
teins were gathered. Preliminary results demonstrated that
29.75% of the lysine residues in human proteins and 28.9%
of the residues in cyanobacteria proteins were predicted as
potential carbamate sites (see Table 3).
Num of proteins Num of lysines Predicted (%)
Human 43181 1169179 29.75
Cyanobacteria 13698 210466 28.9
TABLE 3: Predicted carbamate sits in human and cyanobacteria
proteins. The percentage indicates the amount of predicted
carbamate sites among all the lysines.
5 LIMITATIONS AND FUTURE WORKS
5.1 Alternative Positional Encoding Method
ONE key step of leveraging information in protein struc-tures is to preserve the relative positional information
of each amino acid in the environment. The present work
has only tested one type of positional encoding used in
Scene Representation tasks. Other research (Parmar et al.
2018; Carion et al. 2020) that attempted to apply a Trans-
former based network to image classification or object de-
tection tasks have followed the original positional encoding
method used by Vaswani et al. 2017 and extended it to the
2-d case. As done by Parmar et al. 2018, an easy way is
that supposing the one-hot-encoded (identity) vector is L-
dimensional, one can use d = L/3 of the dimensions to
encode x, y, and z coordinate values, respectively and add
(or concatenate) the position-encoded L-dimensional vector
back to the original L-dimensional one-hot vector. Yet, the
tricky part is developing a way to convert the continuous
3-d coordinates to a discrete coordinate system (the position
of a pixel in an image or a word in a sequence is discretely
represented). One way to achieve this is by putting the
environment into a 3-d box made up of unit boxes. Then,
the position of an amino acid can be instead encoded by the
position of the unit box in which the amino acid stays using
the following formulas, pos represents the position and i
represents the dimension:
PE (pos, 2i) = sin (pos / 100002i/d),
PE (pos, 2i+ 1) = cos (pos / 100002i/d).
(9)
5.2 Transfer Learning
B IG data is critical to Deep Learning. However, in thepresent task, it is impossible to obtain thousands or
more carbamate sites. One idea is to use transfer learning
to leverage the possible big data of other PTMs. According
to Weiss, Khoshgoftaar, and D. Wang 2016, transfer learning
is the process of improving the model’s performance in
the target domain with the help of the information learned
in the source domain. Firstly, one can train the same At-
tention based net using the data of other non-enzymatic
lysine PTM sites, such as Glycation. Secondly, one needs
to freeze the parameters in the Attention and Residual Feed
Forward blocks and further train (fine-tune) the remaining
parameters (mostly just Linear ReLU Stack) using the Car-
bamylation data. Depending on the performance, one can
further unfreeze other parameters. In this way, the learned
knowledge (mostly likely the network’s perception of a 3-d
environment) from the Glycation data can be used to enrich
the knowledge that can be learned from the Carbamylation
data and improve the performance. It should also be pointed
out that using the network architecture to solely study the
Glycation PTM can also be beneficial.
5.3 Hyperparameter Optimization
H YPERPARAMETERS are model configurations that canbe tuned to improve model performance. Due to time
and computing power constraints, no grid search of the
optimal collection of hyperparamaters was implemented in
the present study. Therefore, the values of these parameters
DURHAM UNIVERSITY 12
can possibly be further optimized.
Network Depth—three times Attention and Residual Feed
Forward blocks would typically yield better performance
than one or two times. However, building a deeper network
to learn more abstract representations is only advisable
when more data is available.
Level of Smoothing—the value ε was set to be 0.03 in the
present study; ε = 0.1 is also acceptable but may require
more epochs of training to yield good performance.
Focal Loss Parameters—the two parameters γ and α were
set to be 2 and 0.75, respectively; tuning the value of α could
impact the balance between Sensitivity and Specificity; the
value of γ is not encouraged to change.
5.4 Alternative Way of Defining Adjacency of Amino
Acids and Cut-off Distance
ADJACENCY of an amino acid with respect to a lysineresidue in its microenvironment is defined by having
a Euclidean distance of less than 12 A˚ between the NZ
position of the lysine and the alpha carbon (Cα) of the
amino acid. Nevertheless, it could be the case that for an
amino acid, its Cα is outside of the cut-off region while its
side chain falls inside. The amino acid will be excluded in
that case, which may not be the best choice. Therefore, an
alternative way—including an amino acid if any of its atoms
is within the cut-off distance can be further tested. Due to
the time constraint, preliminary investigations were carried
out as shown in Appendix (7.2).
6 CONCLUSION
CARBAMYLATION is a critical lysine-based Post-translational Modification (PTM). However, in pro-
teins, the lysine residues that can be modified are hard
to identify using conventional experimental methods. The
present work has explored the possibility of using a Deep
Learning model to accomplish the identification task with
the help of a number of newly-identified carbamate sites.
This project is original in the following ways. Firstly, this
work has filled a gap in the literature by presenting the a
curated benchmark dataset with newly identified solvent
accessible modification sites. Secondly, the present work
has introduced a way to exploited protein 3-d structure
based information by designing the input feature that gath-
ers information of neighbouring amino acids in a lysine
microenvironment. Two biologically important values pKa
and SASA have also been studied and incorporated into
the model-training and decision-making by the proposed
way. The Self-Attention based network architecture has
been implemented to accommodate the input features with
varying shapes. Thirdly, methods to effectively deal with
data imbalance and potential noise in the Ground Truth
labels have been discussed. Finally, in total 6 models, includ-
ing one baseline Machine Learning model and five Deep
Learning models were trained and tested. The final optimal
model obtained a 78.8% Sensitivity, a 70.6% Specificity, and
a 75% total accuracy. Large-scale computations applying
one of the trained model have predicted that nearly 30%
of the lysine residues in human and cyanobacteria proteins
are modifiable. Future works can experiment with an al-
ternative positional encoding method for better encoding
amino acid relative position information. The idea of trans-
fer learning can also be tested in order to cope with data
deficiency. The Attention based network architecture can
also be applied to study other non-enzymatic lysine PTMs,
such as Glycation. An alternative way to define amino
acid adjacency in determining neighbouring amino acids in
lysine microenvironments can be further implemented.
DATA, MODEL, AND CODE AVAILABILITY
Data and models discussed in the present work, along
with the implementations can be found at GitHub: https:
//github.com/manhao9843/AttentionBasedNetwork.
ACKNOWLEDGMENTS
As the author of this study, I would like to extend my most
special thanks of gratitude to my supervisors Prof. Martin
Cann and Dr. Matteo Degiacomi for their patient teaching
and guidance from the beginning to the end to get me
through the project. At the beginning of my work, I knew
absolutely nothing about biology, whereas now I have the
structure of proteins engraved in my mind. I would also
like to thank the research team led by Prof. Martin Cann
at Durham University, without their important work, the
present study would not exist. I must not forget the teaching
and administration faculty who have given me the skills and
the opportunity to contribute. And of course I am grateful
for the sponsorship of my parents, who looked after me in
their own ways even during difficult times at home. Finally,
I would like to thank myself for not going an easy way, but
instead choosing this challenging project.
DURHAM UNIVERSITY 13
REFERENCES
Branden, Carl Ivar and John Tooze (2012). Introduction to
protein structure. Garland Science.
Ramazi, Shahin and Javad Zahiri (2021). “Post-translational
modifications in proteins: resources, tools and prediction
methods”. In: Database 2021.
Huang, Kai-Yao et al. (2019). “dbPTM in 2019: exploring
disease association and cross-talk of post-translational
modifications”. In: Nucleic acids research 47.D1, pp. D298–
D308.
Jimenez-Morales, David et al. (2014). “Lysine carboxyla-
tion: unveiling a spontaneous post-translational modi-
fication”. In: Acta Crystallographica Section D: Biological
Crystallography 70.1, pp. 48–57.
Blake, Lynsay I and Martin J Cann (2022). “Carbon Dioxide
and the Carbamate Post-Translational Modification”. In:
Frontiers in Molecular Biosciences, p. 166.
King, Dustin T et al. (2022). “Chemoproteomic identification
of CO2-dependent lysine carboxylation in proteins”. In:
Nature Chemical Biology 18.7, pp. 782–791.
Linthwaite, Victoria L and Martin J Cann (2021). “A method-
ology for carbamate post-translational modification dis-
covery and its application in Escherichia coli”. In: Inter-
face Focus 11.2, p. 20200028.
Linthwaite, Victoria L, Joanna M Janus, et al. (2018). “The
identification of carbon dioxide mediated protein post-
translational modifications”. In: Nature communications
9.1, pp. 1–11.
Senior, Andrew W et al. (2020). “Improved protein struc-
ture prediction using potentials from deep learning”. In:
Nature 577.7792, pp. 706–710.
Huang, Guohua et al. (2013). “Prediction of carbamylated
lysine sites based on the one-class k-nearest neighbor
method”. In: Molecular BioSystems 9.11, pp. 2729–2740.
Ning, Qiao et al. (2021). “pQLyCar: Peptide-based dynamic
query-driven sample rescaling strategy for identifying
carboxylation sites combined with KNN and SVM”. In:
Analytical Biochemistry 633, p. 114386.
Wang, Huiqing et al. (2021). “MDCAN-Lys: A Model for
Predicting Succinylation Sites Based on Multilane Dense
Convolutional Attention Network”. In: Biomolecules 11.6,
p. 872.
Ahmed, Saeed et al. (2021). “DeepPPSite: a deep learning-
based model for analysis and prediction of phospho-
rylation sites using efficient sequence information”. In:
Analytical biochemistry 612, p. 113955.
Jumper, John et al. (2021). “Highly accurate protein struc-
ture prediction with AlphaFold”. In: Nature 596.7873,
pp. 583–589.
Berman, Helen M et al. (2000). “The protein data bank”. In:
Nucleic acids research 28.1, pp. 235–242.
Dou, Lijun et al. (2021). “A comprehensive review of the im-
balance classification of protein post-translational modi-
fications”. In: Briefings in Bioinformatics 22.5, bbab089.
Chawla, Nitesh V et al. (2002). “SMOTE: synthetic minority
over-sampling technique”. In: Journal of artificial intelli-
gence research 16, pp. 321–357.
Vaswani, Ashish et al. (2017). “Attention is all you need”.
In: Advances in neural information processing systems 30.
Mildenhall, Ben et al. (2020). “Nerf: Representing scenes as
neural radiance fields for view synthesis”. In: European
conference on computer vision. Springer, pp. 405–421.
Kilambi, Krishna Praneeth and Jeffrey J Gray (2012). “Rapid
calculation of protein pKa values using Rosetta”. In:
Biophysical journal 103.3, pp. 587–595.
Gokcan, Hatice and Olexandr Isayev (2022). “Prediction
of Protein p K a with Representation Learning”. In:
Chemical science 13.8, pp. 2462–2474.
Olsson, Mats HM et al. (2011). “PROPKA3: consistent treat-
ment of internal and surface residues in empirical p K a
predictions”. In: Journal of chemical theory and computation
7.2, pp. 525–537.
Søndergaard, Chresten R et al. (2011). “Improved treatment
of ligands and coupling effects in empirical calculation
and rationalization of p K a values”. In: Journal of chemical
theory and computation 7.7, pp. 2284–2295.
Lu, Cheng-Tsung et al. (2011). “Carboxylator: incorporating
solvent-accessible surface area for identifying protein
carboxylation sites”. In: Journal of computer-aided molec-
ular design 25.10, pp. 987–995.
Vandermarliere, Elien and Lennart Martens (2013). “Protein
structure as a means to triage proposed PTM sites”. In:
Proteomics 13.6, pp. 1028–1035.
Rudden, Lucas SP et al. (2022). “Biobox: a toolbox for
biomolecular modelling”. In: Bioinformatics 38.4, p. 1149.
Emmert-Streib, Frank et al. (2020). “An introductory review
of deep learning for prediction models with big data”.
In: Frontiers in Artificial Intelligence 3, p. 4.
He, Kaiming et al. (2016). “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 770–778.
Efraimidis, Pavlos S and Paul G Spirakis (2006). “Weighted
random sampling with a reservoir”. In: Information pro-
cessing letters 97.5, pp. 181–185.
Lin, Tsung-Yi et al. (2017). “Focal loss for dense object detec-
tion”. In: Proceedings of the IEEE international conference on
computer vision, pp. 2980–2988.
Guan, Melody et al. (2018). “Who said what: Modeling in-
dividual labelers improves classification”. In: Proceedings
of the AAAI conference on artificial intelligence. Vol. 32. 1.
Szegedy, Christian et al. (2016). “Rethinking the inception
architecture for computer vision”. In: Proceedings of the
IEEE conference on computer vision and pattern recognition,
pp. 2818–2826.
Mu¨ller, Rafael, Simon Kornblith, and Geoffrey E Hinton
(2019). “When does label smoothing help?” In: Advances
in neural information processing systems 32.
Hacker, Stephan M et al. (2017). “Global profiling of lysine
reactivity and ligandability in the human proteome”. In:
Nature chemistry 9.12, pp. 1181–1190.
Parmar, Niki et al. (2018). “Image transformer”. In: Inter-
national conference on machine learning. PMLR, pp. 4055–
4064.
Carion, Nicolas et al. (2020). “End-to-end object detection
with transformers”. In: European conference on computer
vision. Springer, pp. 213–229.
Weiss, Karl, Taghi M Khoshgoftaar, and DingDing Wang
(2016). “A survey of transfer learning”. In: Journal of Big
data 3.1, pp. 1–40.
DURHAM UNIVERSITY 14
7 APPENDIX
7.1 Network Training Details
Batch Size—the optimal batch size is 64. Alternative batch
sizes that were tested include 16, 32, and 64.
Optimizer—Adam in the default settings was chosen; SGD
with a momentum was also tested but outperformed by
Adam regarding the speed of convergence and the final
performance on the test set.
Learning Rate—In the first 35-40 epochs, the learning rate
upper bound of Adam was the default value 1e-3. After at
most 40 epochs, the learning rate upper bound was adjusted
to 1e-4.
Number of Epochs—the ideal total number of epochs by
experience is 60; however, further training using a small
learning rate such as 1e-4 is encouraged if the test perfor-
mance is relatively poor.
Incorporate pKa and SASA internally or externally?—
it is crucial to train the Attention Based Network without
including pKa and SASA in the network. Initially, the two
values were concatenated to the output LYS vector of Global
Attention. In other words, pKa and SASA were involved
in the training of the Attention Based net. However, after
visualizing the embeddings, it was discovered that the
separability of the two classes was extremely poor. One pos-
sible explanation is that the presence of the two externally-
determined values caused the final Linear ReLU Stack to
converge quickly, leaving the core Attention and Residual
Feed Forward Blocks untrained. Hence, it is advised that
the best way to include the two values is to train another
MLP after the training of the Attention based net. The input
feature of the MLP can be a concatenation of pKa & SASA
and the learned embeddings.
7.2 Amino Acid Counts for Analyzing Microenviron-
ments
ONE way to explore lysine microenvironments is bycounting occurrences of amino acids within the 12
A˚ cut-off distance. Because these counts are highly similar
across conformations of each unique lysine residue, to avoid
duplicates, only structures predicted by AlphaFold in the
training set were used. For each 3-d structure, 20 occurrence
numbers that correspond to 20 types of amino acids can be
obtained. Treating the occurrence numbers as new features,
distributions of each of the 20 features conditioned by class
labels (i.e. positive and negative classes) can be plotted by
histograms. Two figures were produced, one applying the
original definition of amino acid adjacency (Figure 10), one
featuring the alternate definition (Figure 11).
In Figure 10, differences between conditional distribu-
tions are subtle. It may be owing to the fact that with the
original definition of adjacency, amino acids whose side
chains are in the microenvironment but alpha carbons are
far away were not counted. However, Figure 11 with the
alternate definition by which more relevant amino acids
were included still shows no significant difference between
conditional distributions for all 20 types of amino acids.
Nevertheless, with the alternate way, more amino acids,
such as negatively charged ASP and GLU were observed
in the microenvironments. Other residues including ALA,
ILE, LEU, and VAL were also observed more frequently.
7.3 Good Performance or Good Hacking
HACKING refers to training a model that can obtainexceptional test performance but may perform poorly
in real applications. Here (Table 4), three Machine Learning
models were trained using the entire training set with the
22-d features—pKa, SASA, and the 20 amino acid counts
discussed in 7.2.
All three model obtained extremely high performance
on the test set. The best one, the Random Forest even
scored a 97.4% Specificity and a nearly 90% Sensitivity. The
lowest 2.4% false-positive rate further makes it potentially
the ultimate solution for Carbamylation. But why is this
hacking?
Sn (%) FP (%) Sp (%) FN (%) Acc (%)
Native Bayesian 83.6 20.2 75.7 19.8 80.0
Support Vector Machine 82.9 8.6 91.0 17.7 86.7
Random Forest 89.6 2.4 97.4 10.1 93.3
TABLE 4: Performance comparison of three Machine Learning
models trained using the entire training set. The features were
the 22-d vectors composed of pKa, SASA, and amino acid
counts. The Random Forest was based on 200 individual trees.
The SVM applied a Gaussian kernel with the level of penalty
C=100. The Native Bayes was a Gaussian one.
The key lies in how the training and test data were
constructed and what features were used. The benchmark
dataset was essentially composed of protein conformations,
and a small portion of these conformations were randomly
selected as the test set. The feature vector was composed of
mainly amino acid counts that were highly similar across
conformations of the same lysine residue. The consequence
of this setting is that in the test set, a great number of
samples became extremely easy to predict because their
alternate conformations were seen by the model in the train-
ing set. Therefore, only the difficult positives introduced
in 4.4 that were by no means involved in the training set
can be used to really evaluate the performance of the three
models (see Figure 8). It turns out that only 3 out of 22 of
these difficult positives were identified, which apparently
contradicts with the seemingly high model performance.
The separability of the training set is also poor (Figure 9).
Therefore, the three models, possibly along with other mod-
els in the literature that did not properly address this issue
are not applicable and are the results of good hacking. It
should be pointed out that the present work is not com-
pletely immune to the issue, however great care has been
taken. The design of the input feature largely ensures the
variability across conformations since not only amino acid
identities but also relatively positives that can drastically
vary across conformations were gathered. Also, the fact that
16 out of 22 difficult positives were spotted by the network
is a strong indication of its true performance.
DURHAM UNIVERSITY 15
Fig. 8: Test set t-SNE Visualization with the 22 features. The prediction mistakes of the Random Forest on the test set were marked
by red crosses. The difficult positives were marked by green stars.
Fig. 9: Training set t-SNE Visualization with the 22 features. Minority positives marked by red stars were samples associated with
lysines residues that only have few conformations in the training set. Experimental his highlighted by grey stars were samples
related to the carbamate sites supplied by the research group at Durhan University.
DURHAM UNIVERSITY 16
Fig. 10: Amino acid adjacency is defined as a less than 12 A˚ Euclidean distance between the Cα of an amino acid and the lysine
NZ position. Conditional Categorical Distributions of 20 types of amino acids. The x axis represents occurrence numbers for an
amino acid in lysine microenvironment. The y axis is the normalized frequency in the percentage term (i.e. for one condition
distribution, y values add up to 100).
DURHAM UNIVERSITY 17
Fig. 11: Amino acid adjacency is alternatively defined as a less than 12 A˚ Euclidean distance between as least one atom of an
amino acid and the lysine NZ position. Conditional Categorical Distributions of 20 types of amino acids. The x axis represents
occurrence numbers for an amino acid in lysine microenvironment. The y axis is the normalized frequency in the percentage term
(i.e. for one condition distribution, y values add up to 100). In comparison to Figure 10, amino acids that were more frequently
observed were highlighted by red lines.