Translational Biomedical Informatics

Assignment 4.

In this assignment, you have to piece together a few different items. To be turned in via Github by 9/17 at 11:59PM. Create a program called ensg2hugo.py that takes a comma-delimited file as an argument and a column number as an input, and print a file where the Ensembl gene name has become a HUGO name.

Key hints. You need to read the Homo_sapiens.GRCh37.75.gtf to create a dictionary, whereby you lookup the Ensembl name and replace it with the HUGO name.

The location of this file is: ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

Notes:

Create a github with a readme that has installation and usage instructions. The instructions will have to say how to get Homo_sapiens.GRCh37.75.gtf as it will be too large (e.g. tell them to curl and the path).
Your program must use a list as a dictionary to look up substitutions.
Your program must use a regular expression.
Your program must be installable using git clone and following README (easy), and have license file etc.
ENSEMBL gene names you need to match up to the . in ENSG00000248546.3, since the latter is relevant to build. Thus “ENSG00000248546.3”, “ENSG00000248546.31”, and ENSG00000248546 should yield ANP32C. Hint – just store the part you need in the dictionary. The match should work if the input file is using quotes or not using quotes. ** Important – if you don’t find matches, this may be the problems **
The instructions will have to say how get Homo_sapiens.GRCh37.75.gtf as it will be too large (e.g. tell them to curl and the path).
The unit test for you is here: https://github.com/davcraig75/unit
You have to allow an option “-f [0-9]” where -f2 would pick the 2nd column. If there is no “-f” then the first column is used.

Create a github with a readme that has installation instructions. The instructions will have to say how get Homo_sapiens.GRCh37.75.gtf as it will be too large (e.g. tell them to curl and the path). The unit test for you is here: https://github.com/davcraig75/unit

ensg2hugo.py -f2 expression_analysis.tsv >expression_analysis.hugo.tsv

will turn this file from

"","gene_id","gene_name","gene_type","logFC","AveExpr","t","P.Value","adj.P.Val"
"14541","ENSG00000248546.3","processed_pseudogene",0.449817926522256,0.0739725408539951,3.47895145072996,0.000284302244388779,0.999999999912779
"14546","ENSG00000201050.1","snRNA",0.380944080200912,0.169836608364135,2.92569531023051,0.00183380737252742,0.999999999912779

into:

"","gene_id","gene_name","gene_type","logFC","AveExpr","t","P.Value","adj.P.Val"
"14541","ANP32C","processed_pseudogene",0.449817926522256,0.0739725408539951
"14546","RNU6-668P","snRNA",0.380944080200912,0.169836608364135,2.92569531023051