1
CptS 475/575: Data Science
Project Ideas
Fall 2021
Draft, Last Updated: October 8, 2021
(Note: this document will be updated as needed until project assignments to teams is finalized)
1. Protein Clustering
Protein clustering is an important capability for many biological applications. It can be used for
structure/functional prediction (figuring out what a new protein is likely to do based on similar
known proteins), lineage modelling (figuring out which sequences evolved from which others),
and identification of erroneous sequencing readings, just to name a few. For this project you will
have access to three compiled datasets on the course page (pneumonia genes, multispecies
GroEL proteins and a set of enzymes organized by family and superfamily condensed and
cleaned from this paper), but you are encouraged to collect and use any others that may interest
you from GenBank, which is an excellent repository for genetic data. To handle these datasets,
and the ones provided, you will want to get comfortable with the FASTA format, which is a
standard for sequence data. Read up on that here.
Given labels for your dataset(s) of choice produce the best clustering(s) of that you can. Things
you may want to consider include your choice of distance/similarity measure (do you want to use
something biologically meaningful, like an alignment score, something exact and slow, or an
approximation like BLAST or FASTA, or do you want to develop something on your own?),
clustering approach (hierarchical, network based, centroid based, etc.), choice of data (is there a
particular species or gene of interest?) and experimental design (do you want to find a way to
merge results from multiple approaches to improve accuracy? compare efficacy of some
approach across multiple datasets? design a novel algorithm and benchmark it against existing
approaches? or do something else?) You will be evaluated on how well you demonstrate a
knowledge of and ability to apply existing methods, as well as the soundness of your
experimental design and how you interpret your results (not on the quality of those results
themselves, however).
2. Networks
A network is a common way to represent the relationships among a set of objects (also known as
nodes or vertices). These relationships are represented by an edge, which connects these nodes to
one another. Typically, edges are defined only between a pair of nodes. The set of these edges
and nodes constitutes a graph (or network). Networks have become more prevalent with the rise
or the internet and social media, both of which are frequently represented and analyzed as
networks. Each of the following projects will interact with features of network construction, as it
2
pertains to string data. This class of network is referred to as a sequence similarity network
(SSN), which are a common tool in bioinformatics. They can be used to show how a set of
protein sequences are related, in terms of their structural similarity. You can read more about the
applications and details of SSNs in this paper. This type of network can also be applied to other
types of string data, such as text. In order to build an SSN, you must choose some sequence
similarity measure to determine which nodes are similar enough to be connected by an edge. You
also need a model for including or excluding edges. The basic approach is to set some threshold
on your similarity measure.
Project Idea 2.1: Normalizing similarity and distance measures
In order to build edges from a set of sequences, some notion of similarity or distance is required,
where similarity measures the closeness of two sequences (identical sequences receive a high
value) and distance measures the dissimilarity between sequences (identical sequence receive a
value of 0). In many contexts, similarity and distance measures can be treated as interchangeable.
However, this is not always the case. Many applications require normalization on the lengths of
the input sequences in order to weigh longer (more pertinent) similarities more highly. This
becomes especially relevant in a dataset containing sequences of varying lengths. For example, a
dataset with 50 long sequences (say a paragraph each) and 50 short sequences (each a sentence)
using a distance measure with no normalization will tend to connect the shorter sequences more
readily, simply because the number of possible differences between shorter sequences is
smaller, even when the overall level of matching between longer sequences may be larger. (ex.
The distance between "cat" and "rat" is the same as the distance between "willing" and "wilting"
even though the latter example has a far greater overall percentage of matches than the former.)
Similarity measures do not suffer from this problem, since larger matching areas are naturally
weighted higher. However, it suffers from a complementary problem, in that shorter identical or
near identical sequences will have a similarity smaller than large sequences with a low
percentage of matches, which are nonetheless longer that the shorter sequence's matches. (ex.
"The cat and the dog are both animals." will have a higher similarity with "The tree and the
avocado are both green." than "normlaization" has with "normalisation", though it is not
intuitively clear that one pair is genuinely more similar than the other.)
Normalizing these measures for the lengths of sequence can help alleviate each of these issues.
However, it does create problems of its own. In particular, it can disrupt certain useful properties
of a distance/similarity metric like the triangle inequality. For details on the distinction between a
metric and a measure, read this article. For this project you will conduct a review of the literature
concerning normalized sequence similarity and distance measures or metrics. Implement two or
more (at least one distance and one similarity measure/metric) of these methods for yourself OR
design a normalized similarity/distance measure of your own based on your research. Test the
results using a dataset(s) of your choosing with varying sequence lengths, and report of the
3
effects of normalization on your resulting distance/similarity matrices. You may also choose to
represent this as a network, if you prefer.
You will be evaluated on your ability to describe the normalization process for the methods you
deploy, and any effects it has on the properties of your method in your report as well as your
ability to accurately implement these method(s). You will also need to choose a dataset that
demonstrates the need for normalization, show through experimentation the effect it has on your
results, and express these concepts via numeric and visual summary.
Project Idea 2.2: Sequence similarity network model comparison
For this project, you will be using one of the sequence datasets previously mentioned (or one of
your own) to construct a sequence similarity network. You can use the methods of the given
paper as a starting point. Think about the choices they made in constructing their networks. What
are the weaknesses of their approach? Factors you should consider include which distance
metric(s) to use, what type of network to construct (threshold, nearest neighbor, etc.), how to
verify its accuracy, and how to visualize it effectively (hint, check out the igraph package).
You will be evaluated on your ability to critically evaluate the work presented in the paper
above, use an alternate network construction technique which might improve some of these
weaknesses, evaluate the techniques against one another, and on how you interpret your results
(not on the quality of those results themselves, however).
Project Idea 2.3: Modifying the Directed, Weighted All Nearest Neighbors Model
One way to make building SSNs more efficient is to reduce the number of distance/similarity
computations needed. To this end, the directed, weighted all nearest neighbors (DiWANN)
model was developed to make constructions more efficient. The model eschews a threshold in
favor of connecting each sequence to its nearest neighbor (or neighbors in case of a tie). You can
read about the details of the model and how to build it efficiently here. However there are several
improvements that could be made to the model. For this project, you will develop one or more of
these improvements in a computationally efficient manner and report on the effects of the
modification.
Option a) One weakness of the DiWANN model in its current form is its ability to include spurious
connections if an outlier has no valid connections. To alleviate this problem, add an optional
parameter to the construction method such that if a sequence has only neighbors below the input
percentage distance from the maximum (i.e. if it only has very distance neighbors) it may be left
with no edges at all. This percentage will typically be relatively small (5%-25%).
Option b) Add a neighborhood range threshold, such each node is connected to its nearest
neighbor in addition to any sequence within a given range of that nearest neighbor. Again, this
range should be small (several edits or 5%-10% difference).
4
You will be evaluated on the correctness and efficiency of your code and how well you
demonstrate these properties through testing. Your report should include a detailed description of
how you changed/improved the existing algorithm, and how those changes effect result
outcomes and runtimes.
3. Covid-19 EDA and Visualization
Background: As we all know, the covid-19 pandemic has had massive impacts on every aspect
of our lives, and caused widespread economic hardship across the globe. There are ongoing
efforts among scientists of many disciplines to better understand the virus to find new ways to
control it.
One of the key proteins that is needed by the virus to infect its host is known as the spike protein
(https://www.nature.com/articles/s41401-020-0485-4). This is the larger of the two protrusions
you see on diagrams of the virus. Its job is to bind to certain receptors in the host cell to get
access to its machinery and enable infection and reproduction. These spike proteins are
considered one of the key targets for vaccine and antiviral drug development, so understanding
their structure could provide valuable insight toward fighting the virus.
For this project you will be provided with a set of Covid-19 spike protein data collected between
January 1st 2020 and June 30th 2021. The data include the following attributes: accession ID,
which is a code assigned to each published sequence by the popular data repository Genbank,
protein sequence, collection date, and for some sequences collection location. All sequences in
the provided dataset we collected in the United States. If you find this dataset is not adequate to
your project idea, you are encouraged to find your own as part of your project.
Project Idea 3.1. Visualize sequence mutation over time
For this project you will show how the sequence of the Covid-19 spike protein has changed over
the course of the pandemic. To do this, you will focus primarily on the sequence of reported
spike proteins and their collection date. You will notice that at any given time, one or two
sequence variants will be vastly more common than their counterparts. You should keep track of
what those are, and report this. Beyond that, there are several directions you can take with this
project, but here are a few basic ideas.
You may look at particular mutations that appear on the spike protein sequence at various times
and locations, visualizing this as you see fit. You may also look at the number of different
variations within different timeframes too see what this might tell you (though keep in mind the
need to account for bias in how your data were collected). You may also wish to look at how
different mutant sequences are from the dominant sequence at a given time period. If you have
other ideas, along these lines, you are encouraged to explore those as well.
5
The key deliverables for this project include: A detailed description of how you parsed the data,
and why you chose this method. An accounting for how you handled bias in your data, and what
effects that might have on your analysis. One or more descriptive visualizations that show how
the spike protein sequences changed over time. Interpretive details on what these visualizations
reveal about the data, and what that might imply about the real world.
Project Idea 3.2: Clustering of SARS-CoV and SARS-CoV-2 proteins
The SARS-CoV-2 virus, which causes Covid-19, is closely genetically related to another virus
called severe acute respiratory syndrome coronavirus (SARS-CoV). You may recall the SARS
pandemic threat of late 2002; this was the result of the SARS-CoV virus. There are slight
variations in SARS-CoV-2 that make it more pathogenic. The aim of this project is to compare
protein sequences of the two coronaviruses. Clustering is a good way to study similarities among
proteins in a network. You will be provided with datasets for both the viruses, which you will
label in a meaningful way using one or more clustering methods. The goal is to group together
proteins with a similar sequence, and identify which proteins are most similar (and most
different) between the two viruses. This is likely to require some background research on your
part, to verify which proteins are homologous between the two species. Be sure to choose a
similarity/distance measure that is meaningful for sequence data.
You will be evaluated on the quality (precision/recall) of the clusters you produce, but also the
interpretation and visualization of your results. Be sure to include background that will make the
interpretation of your results possible. For example, if you find that protein A from SARS-CoV
and protein B from SARS-CoV are consistently in the same cluster together, talk about what (if
anything) those proteins are known/thought to do. There is no need for a deep understanding of
the biology, but a high level explanation will help make your findings meaningful.
4. High-resolution Visualization of Health Data
Data visualization and mapping are powerful ways to communicate information to users. The use
of such tools for communication of health information has grown exponentially during the
COVID-19 pandemic. Online mapping tools more specifically can provide users with the ability
to explore data findings and identify new patterns that otherwise would be difficult to detect.
This is even more challenging when the data we want to visualize is of high resolution and
contains complex spatial patterns.
The aim of this project is to create a tool to visualize high resolution health data. Of special
interest is visualization of Covid-19 data. The tool should be simple and intuitive to operate. The
primary goal of the tool will be to provide health professionals and policy makers with
mechanisms that will enable them to explore the data. To get an idea for the kind of tool the
project aims at, visit the website: http://www.chaselab.net/Reports.html
6
This tool was developed using hand-tailored json coding. You are encouraged to create a simpler
version using Tableau or any other visualization tool you find suitable. An example of a useful
tool here is PyViz (see the video posted under Resources on the Blackboard site of the course).
Additional information about the project and relevant data sets will be provided if you are
interested in picking the project. Professor Ofer Amram at the Department of Nutrition and
Exercise Physiology in the Elson Floyd College of Medicine at WSU will serve as a contact
person and mentor for the project (Dr. Amram’s email address is ofer.amram@wsu.edu).
5. Air pollution exposure and the built environment
Air pollution is one of the biggest environmental risks for morbidity and early death. Outdoor air
pollution is caused by many factors, including proximity to roads and industrial areas among
others. In recent years, several emerging methods and tools have been developed to better
understand how air pollution exposure is impacted by environmental features. The aim of this
project is to assess the relationship between air pollution exposure and the built environment.
You will receive an individual air pollution monitor to assess your air pollution exposure over
several days. Using GIS and spatial analysis technique you will develop a model to examine
what environmental features explain variation in air pollution exposure.
Additional information about the project and relevant data sets will be provided if you are
interested in picking the project. Professor Ofer Amram at the Department of Nutrition and
Exercise Physiology in the Elson Floyd College of Medicine at WSU will serve as a contact
person and mentor for the project (Dr. Amram’s email address is ofer.amram@wsu.edu).
6. Physical Activity Classification
Measuring physical activity is very important in health and behavioral research. Using these
measurements, we typically try to understand how physical activity impacts health and how it is
influenced by the built environment. In recent years, several emerging methods have been
developed to assess an individual’s physical activity using accelerometer and Geographic
Position Systems (GPS) data. However, the algorithms are constantly being refined in order to
increase classification accuracy.
The aim of this project is to quantify physical activity behaviors using GPS and accelerometer
data. This will be a measure of actual physical activity against a collection of data a person has
recorded in a dairy. This will allow you to develop verifiable models to classify types of physical
activity. You will then develop an algorithm that takes as input GPS data. Using timestamps and
geographic coordinates, the algorithm should be able to classify different types of physical
activity (walking, running, biking etc…) using accelerometer data, GPS data, or a combination
of both.
7
Additional information about the project and relevant data sets will be provided if you are
interested in picking the project. Professor Ofer Amram at the Department of Nutrition and
Exercise Physiology in the Elson Floyd College of Medicine at WSU will serve as a contact
person and mentor for the project (Dr. Amram’s email address is ofer.amram@wsu.edu).
7. Cyber Threat Observability Modeling
Common Background:
Computer and network monitoring are necessary strategies to detect cyber-attacks. The number
of tools available, however, is quite large, and depending on which combination of tools are
used, one may either see too little data – leaving no evidence that an attack has occurred – or too
much, which may needlessly take up space and make searching for relevant logs difficult.
Furthermore, there may be significant problems caused by false positives, as some attacks may
share features in common with ordinary user behavior. Accordingly, it would be helpful to have
a model for measuring how effective a monitoring strategy might be for allowing attacks to be
detected.
Project Idea 7.1. Observability Model
A recent paper has suggested one possible solution to this problem. In it, two types of scores are
used to compute a matrix of “observability” scores for each pair of computers on the network
(including pairs that contain the same machine twice, suggesting an on-host attack), and for each
attack tactic that could be conducted between them. The first score employed considers the
likelihood of machines receiving an attack in a simulated environment. The second score
considers how often the features of the given attack tactic are found in ordinary behavior
between the two computers, and is inversely proportional to the likelihood of their occurrence
(that is, if it’s likely that a machine’s ordinary behavior could be considered an attack with some
monitoring strategy, then that strategy has less observability over that machine). A score of 0 is
given to a strategy that is incapable of detecting a particular tactic, however. A higher
observability score is used to determine the most effective strategy for monitoring.
As the concept of cyber threat observability is a relatively new topic, it is possible that better
models may exist than what has been proposed in the paper. Your task would be to develop a
model of your own to evaluate monitoring strategies. This would require collecting cyber data of
your own of different types (or finding it through open source datasets), and applying your model
to determine which combination of monitoring strategies is most effective for observing attacks
on the network you collected your data from.
Project Idea 7.2. Observability Model Application
Alternatively, you could apply the original observability model to a different environment. In the
paper, the model is applied to a network of several relay devices connected to a gateway, and a
desktop computer (human machine interface), and uses three types of monitoring tools –
8
Sysmon, windows logs, and netflow. If this model were applied on a more traditional computer
network, with more (or different) monitoring tools, how might the observability scores differ,
and could the application to a different environment reveal flaws with the model?
8. Social Media Analysis
The internet is a giant,>data science/machine learning is analyzing and making predictions based on social media.
Twitter is a particularly interesting platform, as it is easy to collect data. Collect a dataset, either
posts about a certain topic, or between a group of users. A lot can be studied from such data. You
could create a machine learning model to predict certain aspects on twitter, such as given data
from a few weeks, predict the number of posts that a user will make the next week. Or, predict
whether a certain tweet will be retweeted or not, or how many likes it will get. There are many
things you could try to predict. You could also try to come up with a model (doesn’t have to be
machine learning) of how information spreads. For example, how likely is it that if one user
starts talking about something, that another user will talk about it. This could be used for many
reasons, including to model how disinformation is spread, or how effectively important
information is spread correctly, or on a less serious note, how well a meme spreads.
Starter tips:
• Tweepy is an easy to use python API for twitter. Make sure you sign up for a developer
account. http://docs.tweepy.org/en/v3.5.0/getting_started.html
o https://developer.twitter.com/en/apply-for-access
• Make sure you learn about the difference between the streaming (less limitation but can
only collect events in real time as they happen) and regular apis (you can collect past data
but there are certain limitations).
• You could create a set of users by starting with one user, and then getting a list of their
friends, and the list of friends for all of those friends, etc. This way, you have a set of
users who are connected to each other. You could go back in time and get as many tweets
for each of them as possible.
• You could also come up with a set of keywords/hashtags you are interested in and use the
streaming api to get all instances of these hashtags (this will collect a lot of data)
• If you want to do machine learning, it may be helpful to read about doc2vec, a way to
turn text passages into vectors. https://medium.com/@mishra.thedeepak/doc2vec-simple-
implementation-example-df2afbbfbad5
• Sentiment (overall positivity or negativity may also be an interesting metric)
o https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-
analysis-using-vader-in-python-f9e6ec6fc52f
9
9. Geospatial data management
Dr. Jia Yu’s guest lecture on Septemeber 17, 2001 on the topic of geospatial data management
had featured several examples of potential project ideas. The ideas inlcude problems on
• Project Idea 9.1: Traffic hot spot detection in New York City
This idea will focus on applying spatial statistics to spatio-temporal big data in order to identify
statistically significant hot spots.
More details can be found here: https://sigspatial2016.sigspatial.org/giscup2016/
• Project Idea 9.2: Map visualization for New York City traffic
This idea is to visualization New York City taxi trips (https://www1.nyc.gov/site/tlc/about/tlc-
trip-record-data.page) to different map visualization effects. You could consider using Deck.gl
(https://deck.gl/) or Google Maps
(https://developers.google.com/maps/documentation/javascript/heatmaplayer)
If you are interested in a project on this topic, a project can be defined in consultation with Dr.
Yu who will serve as contact person and mentor (email address: jia.yu@wsu.edu).
10. Scientific data visualization
The goal of this project is to identify the best-fit error-bounded scientific data reduction software
(including SZ, ZFP, and MGARD, TTHRESH) and configuration (i.e., error-bound mode and
error-bound values) that can help scientists (1) reduce the size of the data while (2) keeping the
fidelity of the compressed data for visualization (hint: use rate-distortion as a metric, rate:
bits/value, distortion: PSNR). Please refer to the last figure in this [website] for ratio/quality
comparision using rate-distortion curves.
Candidate datasets for the project are:
1. Hurricane Isabel
Data description: The data set for this contest is a simulation of a hurricane from the
National Center for Atmospheric Research in the United States. The data consists of
several time-varying scalar and vector variables over large dynamic ranges. The sheer
size of the data also presents a challenge for interactive exploration. Please select any
three fields you want (e.g., PRECIP, QVAPOR, QICE). The dataset can be downloaded
using this [link]. More info about the data can be found:
http://vis.computer.org/vis2004contest/data.html.
2. Earthquake TeraShake
10
Data description: The data is the TeraShake 2.1 earthquake simulation data set. It is time-
varying velocity vector data (values in meters/sec). The velocities created by the
simulation are broken into 3 components (x,y,z) and are divided into files by component
and time step. Thus, there are 3 files (x,y,z) for each time step. Please use the three fields
(x, y and z) of a single time-step at T=80 seconds (i.e., TS21z_X_R2_008000.bin,
TS21z_Y_R2_008000.bin, TS21z_Z_R2_008000.bin). The dataset can be downloaded in
this [website]. More info about the data can be found: http://sciviscontest-
staging.ieeevis.org/2006/data.html
Candidate error-bouded data reduction software are:
- SZ (https://github.com/szcompressor) [1]
A prediction based lossy compression framework for scientific data
- ZFP (https://github.com/LLNL/zfp) [2]
A block transform based lossy compressor for scientific data
- MGARD (https://github.com/CODARcode/MGARD) [3]
A multilevel lossy compression of scientific data based on multigrid methods
- TTHRESH (https://github.com/rballester/tthresh) [4]
A tucker tensor decomposition based lossy compressor for scientific data
- Other methods (e.g., PCA) for comparison are acceptable
Note that all the github webpages include how to install and run (compress/decompress)
1D/2D/3D/ datasets with specific error-bound mode and error-bound value.
Dr. Dingwen Tao at the School of EECS at WSU will serve as a contact person and mentor for
the project (email address is dingwen.tao@wsu.edu). Dr. Tao’s guest lecture on September 27
has related information about the project.
References:
[1] Di, Sheng, and Franck Cappello. "Fast error-bounded lossy HPC data compression with SZ." In 2016
ieee international parallel and distributed processing symposium (ipdps), pp. 730-739. IEEE, 2016.
[2] Lindstrom, Peter. "Fixed-rate compressed floating-point arrays." IEEE transactions on
visualization and computer graphics 20, no. 12 (2014): 2674-2683.
[3] Ainsworth, Mark, Ozan Tugluk, Ben Whitney, and Scott Klasky. "Multilevel techniques for
compression and reduction of scientific>quantities." SIAM Journal on Scientific Computing 41, no. 4 (2019): A2146-A2171.
[4] Ballester-Ripoll, Rafael, Peter Lindstrom, and Renato Pajarola. "TTHRESH: Tensor
compression for multidimensional visual data." IEEE transactions on visualization and computer
graphics 26, no. 9 (2019): 2891-2903.
11. Data science projects at Chelan PUD
11
Peter Vanney, Senior Data Analyst at Chelan PUD, in his guest lecture on September 29 outlined
a number of project ideas that are of direct relevance to PUD and among the most active fronts
they are working on. The project ideas include:
Idea 11.1. Substation power usage: key elements
• Clustering algorithms
• Predicting peak power
• Look for historical growth
• Time series analysis
• Spatial component
Idea 11.2. Total Dissolved Gas modeling: key elements:
• Literature review
• Ideal lag time analysis
• Impact of spill on TDG (machine learning model)
• Extra data from the US Army Corps:
https://www.nwd.usace.army.mil/CRWM/Water-Quality/
Idea 11.3. Hydrologic models: key elements:
• Water routing (flow shapes and timing)
• Inflow prediction
• Forecast accuracy
• Extra data from the USGS REST API to analyze streamflow:
https://waterservices.usgs.gov/rest/
If you are interested in any of these project ideas, the idea can be further defined and scoped in
consultation with Peter who also will serve as contact person and mentor for the projects (email:
Peter.vanney@chelanpud.org)
12. Self-Attention
Common Background:
Long Short-Term Memory (LSTM) netwroks are a type of recurrent neural network capable of
learning order dependence in sequence prediction problems. They are the industry standard for
making predictions with machine learning on time series data. LSTMs were invented in 1997 in
order to better very long time series. Here is some more information on how LSTMs work, and
why they were invented https://medium.com/datadriveninvestor/how-do-lstm-networks-solve-
the-problem-of-vanishing-gradients-a6784971a577
Project Idea 12.1: Time Series with Self-Attention
12
One problem with LSTMs however is that each timestep in a time series needs to be processed
sequentially. This limits how fast LSTMs can be processed, and prevents them from leveraging
parallel matrix processing algorithms, which can be a major limitation.
Self-Attention is a newer type of machine learning that is commonly used in natural language
processing and image captioning. It works by creating internal representations of different parts
of the input data and detecting trends/similarities. This has been used recently in many machine
learning areas, but hasn’t been used much in time-series classification problems, except a few
papers that have proved that it is plausible (https://arxiv.org/abs/1711.03905). However, there is
still a lot to be explored. This ability to detect trends is most likely underutilized as a powerful
tool for time series prediction.
One potential project could be to compare the relative time complexities and accuracies of Self-
Attention and LSTMs for time series classifications. Also, self-attention has internal
representations of vectors that are used to calculate attention, so even with a fixed input and
output size, these internal representations can be adjusted. There is a lot with time-series
classifications using Self-Attention that can be adjusted (as compared to LSTMs where the
internals are pretty much always the same), so you can look into what can be done to increase
accuracy. Great explanation of Self-Attention is available on this website
(https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a). A good way to start
this project could be implementing self attention yourself in a python class (I recommend using
the Keras library) to learn how it works and be able to customize that easily.
Project Idea 12.2. Image Segmentation with Self-Attention:
Most problems regarding image processing with neural networks have been using convolutional
neural networks for quite some time. One such task is image segmentation, where the input is an
image, and the output is an image of the same size that divides the image into parts. For example,
if you have an image of five people, the network could output an image where the background is
all black, and each of the people are highlighted a different color.
Recently, researchers have been adding in self attention layers into image segmentation and
image classification problems. The conceptual idea is the self attention layers help this process
by relating global properties of the image, with smaller scale properties of the image. A little bit
more explained in this article: https://towardsdatascience.com/self-attention-in-computer-vision-
2782727021f6
One potential project could be to try to show improvement on a segmentation / classification task
using Self-Attention. First, create a baseline of image segmentation performance by using a
image segmentation dataset, and a regular, published, CNN framework for image segmentation.
13
Secondly, modifying the CNN framework to include Self-Attention somewhere in the model to
try to improve performance.
13. Generative Adversarial Networks
Common Background: Due to recent advances in mobile devices, data analysis and sensors,
remote health monitoring has grown in prevalence. Many of these approaches rely on supervised
machine learning algorithms, which are trained with labeled data. However, there are a few
issues with this:
○ Labelling data can be time-consuming, expensive and, depending on the domain,
can require expert knowledge.
○ Sensor data often contains sensitive user information, which can limit the large-
scale deployment of such systems.
Thus, there is a need to generate realistic synthetic labeled sensor data. Generative Adversarial
Networks (GANs) have shown promise for doing just this.
In a recent research project (with a paper currently being drafted), we developed a new GAN
framework for generating human activity data (collected from accelerometer) which is guided by
the intelligence of a pre-trained classifier. The intuition is that if an accurate classifier can
properly recognize the class label of the data being generated, then the data is likely realistic.
The assessment of multivariate time-series data can also be challenging, so we also proposed
metrics which are easy to compute and interpret. Although we were able to successfully generate
synthetic data that, based on our metrics, was similar to real data, there are many questions left to
answer / new directions to explore. If you are interested in any of the following projects, we will
provide a current draft of the paper and other materials necessary for working with the GAN.
Project Idea 13.1: Testing the framework in another (time-series) domain
Thus far, we have only generated accelerometer data for different sports activities. It would be
interesting to apply this to another time-series domain in which the entire time-series itself has a
label and thus can be guided by the classifier in order to generate better data.
Side notes:
○ If the time-series data is not labeled, the classifier could be removed and the
statistical regularization term could be included by itself.
○ Or if the classifier could recognize some condition within the data (i.e. patient
demographic information), it could still help us to create a framework where data
can be generated for different states / conditions.
Project Idea 13.2: Increasing Train on Synthetic Test on Real (TSTR) metric:
As of now, we have been able to create fairly realistic synthetic sensor data that has similar
statistical properties to the real data, and can be properly recognized by a classifier that was pre-
14
trained on real data. However, in past testing, when we would train on the synthetic data and test
on the real data, the accuracy would be very low!
Lately we have added new statistical regularization terms to the generator, but the approach has
not be re-tested since the addition of the regularization terms. So a first step would be to compare
the TSTR score with and without these terms. For more information about the TSTR metric, read
this paper, where the metric was first introduced. If the TSTR score is still not better, it would be
necessary to further explore the differences between the real and synthetic data that could be
causing this issue. Based on these differences, new regularization terms could be developed /
new training methodologies could be tested.
Project Idea 13.3: Build some type of visualization GUI for this framework:
During training, there are many metrics to keep track of (discriminator accuracy/loss, STS
similarity, RTS similarity, classifier accuracy, etc.). Moreover, it can be helpful to visually
examine generated samples side by side with real samples to gain further insights.
To allow for better understanding during training as well as more intuitive analysis, a dashboard
that included real-time plots for several of these metrics, showed visual examples of the
generated data, and allowed for different testing parameters (batch size, learning rate, etc.) to be
changed without touching the underlying code would be extremely helpful and interesting.
Project Idea 13.4: Improve the statistical regularizer:
As mentioned in project idea 10.2, our generator uses a statistical regularization term in addition
to the classifier for training. Currently, that statistical regularizer uses several features that are
tailored to one specific dataset, and we are not entirely sure whether the feature set is optimal to
begin with. Given our current configuration, find a more optimal set of statistical features and
demonstrate its performance. A regularizer that performs well on datasets containing different
types of data is ideal.
Speaking of GANs:
Here is a cool video talking about a very recent work exciting work on video compression:
https://www.youtube.com/watch?v=NqmMnjJ6GEg&feature=emb_title
14. Kaggle Challenges
Kaggle (http://www.kaggle.com) provides a forum for data scientists to compete in, featuring
challenging real-world problems on fascinating datasets. The competitions typically come with
some kind of scoring to test how good your model is relative to other submissions. The current
challenges are available here: https://www.kaggle.com/competitions
15
Browse through the list of challenges and see if there is a challenge you want to pick as your
project. I will look myself and provide a few suggestions for projects later.
15. Datasets that might inspire other project ideas
The following web cites have a list of datasets or pointers to datasets available on the Internet.
Use these datasets either as use cases for your project or see if some of the datasets inspire a
project idea for you.
https://www.dataquest.io/blog/free-datasets-for-projects/
https://bigml.com/gallery/datasets
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page (NYC Taxi trips data)
http://www.aws.amazon.com/publicdatasets
www.openstreetmap.org
http://www.basketball-reference.com/
https://baseball-databank.org/
Global Health Facts (http://www.globalhealthfacts.com/)
UN data (http://data.un.org)
OECD Statistics (http://stats.oecd.org/)
World Bank (http://data.worldbank.org/)
学霸联盟