Section 1. Short-answer questions
Question 1.
For each of the following example mutations, select the most appropriate statement from
the options below:
Mutation 1: Single nucleotide mutation in the anticodon loop of a mitochondrial tRNA.
Mutation 2: DNA insertion in the upstream region of a gene encoding a collagen protein.
Mutation 3: Non conservative amino acid mutation in conserved domain of a transcription
factor.
A. The mutation is likely to impact the gene transcript level and the cellular
consequence is likely to affect the transcript level or function of only a single gene.
B. The mutation is likely to impact the gene transcript level and the cellular
consequence is likely to affect the transcript level or function of multiple transcripts
or gene products.
C. The mutation is likely to impact the function of the gene product and the cellular
consequence is likely to affect the transcript level or functional product of only a
single gene.
D. The mutation is likely to impact the function of the gene product and the cellular
consequence is likely to affect the transcript level or function of multiple transcripts
or gene products.
You may write your answer in this simple format e.g.: “Mutation 4 answer is E”
Question 2.
The image below is a view in IGV of a region of the human genome loaded with 3
sequencing data tracks and a gene annotation track.
• For each of the 3 data tracks (A, B and C), state whether it is WGS, WES or RNA-
seq data.
• For each datatype, describe a feature that is apparent in the image that
distinguishes it from each of the other datatypes.

Question 3.
Read the methods section below and write out the bioinformatics analysis workflow as a
bullet point list in the correct order.
For each step, include the following details where available: workflow step, sequencing
technology, tool name, file type for data inputs and outputs.
Total RNA was extracted using the Qiagen RNeasy Plant Mini kit. RNA integrity numbers
of the extracted RNA, measured using a Agilent 2100 Bioanalyzer, were between 8.6 and
10. 400 ng of total RNA from each sample was used for RNAseq library preparation with
the TruSeq Stranded Total RNA with Ribo-Zero Plant kit. 125-base paired-end reads were
generated on an Illumina HiSeq 2500. Quality was assessed using fastqc and reads were
trimmed and adaptors removed using cutadapt. The MSU v7 annotation of the Oryza sativa
ssp. japonica cv. Nipponbare reference genome with STAR in 2-pass mode was used for
mapping reads and counting reads per gene. Lowly expressed genes were filtered and
DESeq2 used for normalisation and differential expression analysis of genes using a DE
threshold of log2 >2.0 and FDR <0.05.
Question 4.
Interpret the assembly graph below.

Repeats can cause genome assembly problems. The figure above shows an assembly
graph (left) and some theoretical genome arrangements (right).
• Which genome arrangement (A-D) could have resulted in this assembly graph?
• Explain how either paired-end short reads, or long reads could be used to untangle
this assembly graph.
Question 5.
Bulk RNA-seq and single-cell RNA-seq are often used to measure RNA expression levels
as a proxy for the protein products of coding genes.
State 3 reasons why a researcher would measure RNA levels when they are actually
interested in protein levels?
Question 6.
The mRNA expression levels of genes do not always correlate with the expression levels of
their active protein products in cells and tissues.
Name two cellular processes that could affect the correlation between cellular
expression levels of mRNA and its active protein product, and describe how they
could affect this correlation.
Question 7.

You have been given a draft assembly for a prokaryote genome along with some summary
information for the assembly. Shown in the table above.
You wish to improve the draft assembly by performing more sequencing.
• State which sequencing approach – short or long read – will improve the assembly
most.
• Explain why you have chosen this approach.
Question 8.
For each of the RNA sequencing cases below, describe a different method of assigning
reads to genes for read counting.
Assume a well-annotated reference genome is available.
State which of the two approaches described is more computationally intensive. Assume
the genome sizes of the two species are comparable.
A. RNA-seq readset from tissues from model organism tissue.
B. RNA-seq readset from tissues from a newly discovered eukaryotic species.
Question 9.
Your task is to predict the function of an open reading frame (ORF) from a recently
discovered worm. Its closest relative is the model organism C. elegans, but the genomes of
these two worms are very different.
• The results of a BLAST search against C. elegans find that the gene svh-2, has the
best homology to your ORF:
o 10% DNA sequence identity.
o 30% AA sequence identity.
• You use I-TASSER to predict the tertiary structure of the protein product of the ORF,
and the generated model has 1.31 RMSD to svh-2.
• The figure below shows the span of your ORF compared to svh-2, with protein
domains indicated.

A. Is the ORF likely to have the same function as svh-2?
B. Do you believe that your ORF would have the same cellular localisation as svh-2?
Explain your answers by referencing the data provided.
Question 10.
Single cell RNA-sequencing (scRNA-seq) data and its subsequent analysis shares some
similarities with bulk RNA-seq analysis, but also differs in a number of ways.
Name and discuss 3 of the similarities and 3 differences between bulk RNA-seq and scRNA-
seq. (List a total of 6 similarities/differences.)
Section 2.Long-answer question
NB. Some facts in the scenario described have been simplified or altered for the purposes of
examination.
Gene duplication resulting in increased copy number (CN) can be associated with adaptation
to environmental change. Approximately 10,000 years ago the human amylase gene (AMY)
locus underwent a gene duplication event that resulted in several copies of the amylase 1
gene (AMY1).
AMY1 genes produce the protein Alpha-amylase, an enzyme that is important for digestion
as it breaks down starch molecules into sugars. It is mainly produced by the salivary glands
in the mouth. It is thought that high AMY1 CN results in increased amylase activity in saliva
and makes starchy foods taste sweeter. This may have resulted in nutritious food choices at
a time when early human populations were transitioning to a more agricultural lifestyle and
adapting to a starch-rich diet, thus benefiting individuals with high AMY1 CN.
There are a number of different AMY haplotypes in humans today. The locus also encodes
AMY1 paralogues, the pancreatic alpha-amylases AMY2A and AMY2B. See figure below.

Human AMY locus on chr1. Human amylase haplotypes have one copy each of AMY2A and
AMY2B and an odd number AMY1 copies. Increased AMY1 CN arises from the presence of a
genomic segment containing 2 copies of AMY1 (transcribed on opposite strands).
All AMY1 gene copies encode the same amino acid sequences, but there are some amino
acid differences between the alpha amylase produced by AMY1 and that produced by
AMY2A and AMY2B. This is reflected in the selected region from a multiple sequence
alignment of the human amylase genes shown below. Genomic copies of AMY1 are labelled
A, B, and C to differentiate them.

Published studies have shown that low AMY1 CN is associated with risk of both obesity and
type 2 diabetes. However, there is little understanding of the influence of AMY1 CN on diet
and metabolic health.
Part A:
The human reference genome hg38 depicts the AH3 amylase haplotype. If a person is
homozygous for AH3 how many copies of the AMY1 gene would they have in somatic cells?
You are part of a research team investigating diet and genetic risk of metabolic diseases.
Your job is to determine the amylase genotype for a large study group selected from
students on a University campus.
Describe an appropriate sequencing technology, and an associated bioinformatics analysis
workflow to accurately measure the AMY1 CN and determine the AMY genotype for all
participants in the study group. Your answer should include the important steps in sample
collection, analysis, and any factors you might need consider when designing the study,
analysing the data and interpreting the results.
Describe (or illustrate) an appropriate data structure for presenting your results and any
other relevant QC and metadata, to the research team.
Explain how the AMY1 CN results you generated might be useful to the research team
investigating genetic risk of metabolic diseases?
Part B:
Bulk RNA-seq studies examining gene expression in the 3 main human saliva glands show
that expression of all AMY paralogues is detectable (see table below). In addition, protein
studies have determined that there could be more than 20 different amylase proteoforms
present in saliva.

Table of normalised gene expression values for amylase genes in human saliva
glands, determined by RNA-seq.
The results of your genotyping experiment above revealed that the student study
population includes all possible AMY genotypes. Design an experiment to test the
hypothesis that AMY1 CN correlates with levels of the AMY1 protein product in saliva, using
the same study population.
Your answer should include the important steps in sample preparation, choice of analysis
method, tools and databases used for analysing the results. Also describe any factors you
might need to consider when designing the experiment and interpreting the resulting data.
Part C:
Saliva is essential for maintaining oral health. It is made up of salivary gland products as
well as products originating from other tissues, including blood. Saliva also includes
products arising from the oral microbiome. Blood and saliva proteomes overlap
significantly, and saliva is currently under investigation as a potential source of diagnostic
markers for monitoring human health, disease and pathogens.
Discuss the potential benefits and challenges relating to developing a clinical test based on
saliva, for disease detection or health monitoring. Your answer should include discussion of
aspects relating to personalised medicine and population health.
You are encouraged to include examples of existing diagnostic saliva tests you are aware of
and/or any potential tests you can think of, to highlight points in your discussion.
学霸联盟