python代写-BIOL4292-Assignment 2|学霸联盟

python代写-BIOL4292-Assignment 2

时间：2021-11-18

BIOL4292: Assignment 2
Dr Rónán Daly
Ronan.Daly@glasgow.ac.uk
WWCRC, Garscube Campus
Specication for Assignment 2
¿is, the second task in a portfolio of three assesses the following programming fundamentals:
• File I/O
• Complex objects besides lists
• Functions
• Exceptions
• Object-based Programming
as well as those things covered by Assignment 1.
When implementing this specication, make sure that the only output generated by your program is the output you have
been asked to generate. When completing this assignment, you cannot use any libraries to complete the task, other than the
standard libraries included with Python itself.
Your task is to write a Python program that implements a FASTQ parser that does the following:
Open a series of FASTQ les given on the command line. ¿ere should be an option called --quality_filter that takes a
number, whose usage will be explained later. ¿e format of the command line should be something like this – note that there
may be many FASTQ les, indicated by ...:
python .py file1.fastq file2.fastq ... --quality_filter=45
Make sure is given as your GUID.
For each FASTQ le:
• Create an object from a class representing that le
• For each FASTQ sequence in the le
– Create an object from a class representing that sequence. ¿e object should store all the important information from
the FASTQ entry. Store the sequence object in the le object.
– Print out the following information on a single line, separated by spaces:
* ¿e name of le
* ¿e FASTQ identier
* ¿e number of missing bases
* ¿e percentage GC content (without the % sign)
• Output a FASTA le corresponding to the FASTQ le. ¿e name of the FASTA le should be the same as the FASTQ le,
except the extension should be .fasta. ¿e sequences in the FASTA le should be those sequences in the FASTQ le that
have an average (arithmetic mean) quality that is greater than the number given by --quality_filter. ¿e identier of
a FASTA sequence should be the identier from the FASTQ sequence.
Make sure your code works correctly if bad input (e.g. bad FASTQ les are used). If a single FASTQ entry is bad, instead of
outputing the entry to the FASTA le and instead of printing out the information above, the entry should be discarded and the
following information should be printed on a single line, separated by spaces:
1
• ¿e name of the le
• ¿e FASTQ identier if available
• A message saying specically what is wrong
You should continue parsing the le from the next entry (i.e. the next id line beginning with @).
Remember to write “good code” in the sense of what was discussed in Lecture 3. ¿e code should be correct, ecient and
maintainable, with descriptive variables, consistent formatting, good spacing and useful comments. Because you are using object
based programming, you should think about how you will implement your code in terms of the object – i.e. you should let the
object do computation on the data it is storing. Do not directly access attributes from outside the class. You should also break
down your code into small self contained methods and functions and try not to put all you code into one block.
Addendum – FASTQ format
¿ere are FASTQ les that have slightly dierent format. Here is the format of the les that you will need to parse, taken from
Wikipedia.
A FASTQ le uses four lines per sequence. A le can have multiple sequences, each in a block of four lines. ¿ese are the
only lines in the le.
Line 1 begins with a ’@’ character and is followed by a sequence identier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters, A, G, C, T and N (for a missing call). ¿e letters can be upper- or lowercase.
Line 3 begins with a ’+’ character and is optionally followed by the same sequence identier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the
sequence.
¿e byte representing quality runs from 0x21 (lowest quality; ’!’ in ASCII) to 0x7e (highest quality; ’~’ in ASCII).
Here are the quality value characters in le -to-right increasing order of quality (ASCII):
!"#$%&’()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_‘abcdefghijklmnopqrstuvwxyz{|}~
Addendum – FASTA format
FASTA format is the basic format. A le can have multiple sequences, each in a block of two lines. ¿ese are the only lines in the
le.
Line 1 begins with a ’>’ character and is followed by a sequence identier and an optional description.
Line 2 is the raw sequence letters.
2

.py file1.fastq file2.fastq ... --quality_filter=45
Make sure is given as your GUID.
For each FASTQ le:
• Create an object from a class representing that le
• For each FASTQ sequence in the le
– Create an object from a class representing that sequence. ¿e object should store all the important information from
the FASTQ entry. Store the sequence object in the le object.
– Print out the following information on a single line, separated by spaces:
* ¿e name of le
* ¿e FASTQ identier
* ¿e number of missing bases
* ¿e percentage GC content (without the % sign)
• Output a FASTA le corresponding to the FASTQ le. ¿e name of the FASTA le should be the same as the FASTQ le,
except the extension should be .fasta. ¿e sequences in the FASTA le should be those sequences in the FASTQ le that
have an average (arithmetic mean) quality that is greater than the number given by --quality_filter. ¿e identier of
a FASTA sequence should be the identier from the FASTQ sequence.
Make sure your code works correctly if bad input (e.g. bad FASTQ les are used). If a single FASTQ entry is bad, instead of
outputing the entry to the FASTA le and instead of printing out the information above, the entry should be discarded and the
following information should be printed on a single line, separated by spaces:
1
• ¿e name of the le
• ¿e FASTQ identier if available
• A message saying specically what is wrong
You should continue parsing the le from the next entry (i.e. the next id line beginning with @).
Remember to write “good code” in the sense of what was discussed in Lecture 3. ¿e code should be correct, ecient and
maintainable, with descriptive variables, consistent formatting, good spacing and useful comments. Because you are using object
based programming, you should think about how you will implement your code in terms of the object – i.e. you should let the
object do computation on the data it is storing. Do not directly access attributes from outside the class. You should also break
down your code into small self contained methods and functions and try not to put all you code into one block.
Addendum – FASTQ format
¿ere are FASTQ les that have slightly dierent format. Here is the format of the les that you will need to parse, taken from
Wikipedia.
A FASTQ le uses four lines per sequence. A le can have multiple sequences, each in a block of four lines. ¿ese are the
only lines in the le.
Line 1 begins with a ’@’ character and is followed by a sequence identier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters, A, G, C, T and N (for a missing call). ¿e letters can be upper- or lowercase.
Line 3 begins with a ’+’ character and is optionally followed by the same sequence identier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the
sequence.
¿e byte representing quality runs from 0x21 (lowest quality; ’!’ in ASCII) to 0x7e (highest quality; ’~’ in ASCII).
Here are the quality value characters in le -to-right increasing order of quality (ASCII):
!"#$%&’()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_‘abcdefghijklmnopqrstuvwxyz{|}~
Addendum – FASTA format
FASTA format is the basic format. A le can have multiple sequences, each in a block of two lines. ¿ese are the only lines in the
le.
Line 1 begins with a ’>’ character and is followed by a sequence identier and an optional description.
Line 2 is the raw sequence letters.
2

学霸联盟