Due date: 10/16/18 at 10 am
Follow the template shown in class for writing your code. The “main” segment of the module should prompt the user for the arguments and pass them to the functions and print the return values to the terminal.
Write a function, fastq_fasta(input_file, output_file)
, that converts a fastq file to a new fasta formatted file.
The function should have the following attributes:
The input and output files should have the following formats (excluding the comments):
Input: a fastq file
@NS500697:12:HN75WBGXX:1:11101:19826:1052 1:N:0:1 # line 1: sequence identifier GCGGGNTGGAAGGTGGAGCACGATCTCGAGTGGGTTGACGTCGTGAGCGA # line 2: sequence + # line 3: optional identifier @AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE # line 4: quality values
Output: a fasta file with header and sequence
>NS500697:12:HN75WBGXX:1:11101:19826:1052 1:N:0:1 # sequence identifier GCGGGNTGGAAGGTGGAGCACGATCTCGAGTGGGTTGACGTCGTGAGCGA # sequence
A sample fastq dataset can be downloaded here.
A description of fastq files is here.
A description of fasta files is here.
Note that '@' and nucleotide characters sometimes appears in the quality score line and thus to identify the header and sequence lines you will need to use line numbers. Each read is represented by 4 lines within a fastq file. The first and second lines contain the ID and sequence, respectively.
Hint: use the modulus operator (%
) to determine if a line is a header or sequence line.
For example: if (line_number + 3) % 4 == 0: # header line
Write a function, fastq_trimmer(input_file, output_file, trim_5p, trim_3p)
, that trims any number of nucleotides from the 5' end of each read and any number of nucleotides from the 3' end of each read in a fastq file and writes the results to a new fastq formatted file. The quality score lines should be trimmed exactly as the sequence lines.
The function should have the following attributes:
trim_5p
and trim_3p
arguments.
Combine your functions into a single module and submit via Canvas for grading.