R代写-B68C|学霸联盟

R代写-B68C

时间：2021-11-19

Project Idea 1: Protein Clustering

The data can be found at
https://drive.google.com/drive/folders/1uGITYXDyDguoB68C-
mzOOnYga3ukno2d?usp=sharing

The first two files are from the paper mentioned in the project description. This is a large dataset
so you are recommended to take a subset of the data. You can select a few protein families and
extract their sequence from the fasta file.

The Amarg fasta file consists of short sequence repeats of the species A. marginale. You can
read more on the strains and the dataset in the paper:
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2686-2

You can also select your own dataset as discussed in the meeting. Make sure you select sequences
that are fundamentally similar. This would give you clusters of proteins having similar functions.
You are free to select and compare any clustering methods.

If you are using a network model (DiWANN) to visualize the sequences, you will find the code at
https://drive.google.com/drive/folders/10UY-6xfx4W8cmf7X1dEujXbbe7aYQj39?usp=sharing

You can read more about the model in the below paper:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2453-2

If you are not using networks, you can implement other sequence clustering methods such as
hierarchical clustering. You are free to choose the methods and tools.

You can read more on hierarchical clustering here:
https://www.datacamp.com/community/tutorials/hierarchical-clustering-R

Additional resources:

FaBox: https://birc.au.dk/~palle/php/fabox/ is a good tool if you need to work with fasta sequences.

If you’re using networks for analysis, the igraph package https://igraph.org/ will be useful to
analyze and visualize the network.

学霸联盟