User Tools

Site Tools


assignments:grep

Prepare Reference Files for RNA-seq Exercise


grep

grep is a command line tool for searching for patterns within a file and returning lines containing the pattern.

$ grep "pattern" file

Regular Expressions

Regular expressions are character combinations that represent a particular pattern.

Regular Expressions

Expression Meaning
\t tab
\n new line
\s any white space (new line, space, etc.)
\d any digit 0-9
\w any alphanumeric character or underscore
. a wildcard character (pretty much anything but a new line)
* match the preceding character or pattern zero or more times (greedy and will match as much as possible; use *? to match as little as possible)
+ match the preceding character or pattern one or more times
\b word boundary
^ matches beginning of line or string
$ matches end of line or string
[0-9] a character class, any number
[a-z] a character class, any letter
| use to match something on the left or something on the right e.g. DNA|RNA
\\n the first back slash negates the second backslash, useful if you really want to search for something like \n and not a new line.

Prepare Reference Files for RNA-seq Exercise

1. Navigate to the Ensembl website: https://uswest.ensembl.org/Homo_sapiens/Info/Index

2. Click on the following link: Download GTF or GFF3 files for genes, cDNAs, ncRNA, proteins. This will open an FTP portal. Login as guest. A finder window should launch. The file you should download is called Homo_sapiens.GRCh38.94.gtf.gz

3. Extract lines corresponding to chromosome 22 features using grep. Redirect the grep output to a file called GRCh38.94.chr22.gtf.

4. Extract exon features from the GRCh38.94.chr22.gtf file from step 3 using grep. Redirect the grep output to a file called GRCh38.94.chr22.CDS.gtf.

5. Extract CDS features from the GRCh38.94.chr22.gtf file from step 3 using grep. Redirect the grep output to the file called GRCh38.94.chr22.CDS.gtf from step 4 (be sure to append the data rather than overwrite the file).

6. Download chromosome 22 DNA sequence from ensembl by clicking on the link: Download DNA sequence (FASTA). From the Finder window, download Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz.

7. Examine the file using more. Do the chromosome IDs match the chromosome IDs in the gtf file?

Alternative Approach

1. Open Cyberduck or Filezilla.

2. Download the genome annotations for human genome release hg39 from ensembl:

path: ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/
file: Homo_sapiens.GRCh38.90.gtf.gz

3. Extract lines containing chromosome 22 features using grep.

4. Extract exon entries using grep.

5. Download chromosome 22 DNA sequence from ensembl:

path: ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/
file: Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz

6. Rename sequence ID to 22 using sed.

assignments/grep.txt · Last modified: 2018/11/15 13:09 by dokuroot