Homework 4
Assignment 4.
In this assignment, you have to piece together a few different items. To be turned in via Github by 9/17 at 11:59PM. Create a program called ensg2hugo.py
that takes a comma-delimited file as an argument and a column number as an input, and print a file where the Ensembl gene name has become a HUGO name.
Key hints. You need to read the Homo_sapiens.GRCh37.75.gtf
to create a dictionary, whereby you lookup the Ensembl name and replace it with the HUGO name.
The location of this file is: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
Notes:
- Create a github with a readme that has installation and usage instructions. The instructions will have to say how to get
Homo_sapiens.GRCh37.75.gtf
as it will be too large (e.g. tell them tocurl
and the path). - Your program must use a list as a dictionary to look up substitutions.
- Your program must use a regular expression.
- Your program must be installable using
git clone
and following README (easy), and have license file etc. - ENSEMBL gene names you need to match up to the
.
inENSG00000248546.3
, since the latter is relevant to build. Thus “ENSG00000248546.3”, “ENSG00000248546.31”, andENSG00000248546
should yieldANP32C
. Hint – just store the part you need in the dictionary. The match should work if the input file is using quotes or not using quotes. ** Important – if you don’t find matches, this may be the problems ** - The instructions will have to say how get
Homo_sapiens.GRCh37.75.gtf
as it will be too large (e.g. tell them tocurl
and the path). - The unit test for you is here: https://github.com/davcraig75/unit
- You have to allow an option “-f [0-9]” where -f2 would pick the 2nd column. If there is no “-f” then the first column is used.
Create a github with a readme that has installation instructions. The instructions will have to say how get Homo_sapiens.GRCh37.75.gtf
as it will be too large (e.g. tell them to curl
and the path). The unit test for you is here: https://github.com/davcraig75/unit
ensg2hugo.py -f2 expression_analysis.tsv >expression_analysis.hugo.tsv
will turn this file from
"","gene_id","gene_name","gene_type","logFC","AveExpr","t","P.Value","adj.P.Val" "14541","ENSG00000248546.3","processed_pseudogene",0.449817926522256,0.0739725408539951,3.47895145072996,0.000284302244388779,0.999999999912779 "14546","ENSG00000201050.1","snRNA",0.380944080200912,0.169836608364135,2.92569531023051,0.00183380737252742,0.999999999912779
into:
"","gene_id","gene_name","gene_type","logFC","AveExpr","t","P.Value","adj.P.Val" "14541","ANP32C","processed_pseudogene",0.449817926522256,0.0739725408539951 "14546","RNU6-668P","snRNA",0.380944080200912,0.169836608364135,2.92569531023051