Assignment 4.

In this assignment, you have to piece together a few different items.  To be turned in via Github by 9/17 at 11:59PM. Create a program called ensg2hugo.py that takes a comma-delimited file as an argument and a column number as an input, and print a file where the Ensembl gene name has become a HUGO name.

Key hints. You need to read the Homo_sapiens.GRCh37.75.gtf to create a dictionary, whereby you lookup the Ensembl name and replace it with the HUGO name.

Notes:

  1. Create a github with a readme that has installation and usage instructions. The instructions will have to say how to get Homo_sapiens.GRCh37.75.gtf as it will be too large (e.g. tell them to curl and the path).
  2. Your program must use a list as a dictionary to look up substitutions.
  3. Your program must use a regular expression.
  4. Your program must be installable using git clone  and following README (easy), and have license file etc.
  5. ENSEMBL gene names you need to match up to the . in ENSG00000248546.3, since the latter is relevant to build. Thus “ENSG00000248546.3”, “ENSG00000248546.31”, and ENSG00000248546 should yield ANP32C. Hint – just store the part you need in the dictionary. The match should work if the input file is using quotes or not using quotes.  ** Important – if you don’t find matches, this may be the problems **
  6. The instructions will have to say how get Homo_sapiens.GRCh37.75.gtf as it will be too large (e.g. tell them to curl and the path).
  7. The unit test for you is here: https://github.com/davcraig75/unit
  8. You have to allow an option “-f [0-9]” where -f2 would pick the 2nd column. If there is no “-f” then the first column is used.

Create a github with a readme that has installation instructions. The instructions will have to say how get Homo_sapiens.GRCh37.75.gtf as it will be too large (e.g. tell them to curl and the path). The unit test for you is here: https://github.com/davcraig75/unit

ensg2hugo.py -f2 expression_analysis.tsv >expression_analysis.hugo.tsv

will turn this file from

"","gene_id","gene_name","gene_type","logFC","AveExpr","t","P.Value","adj.P.Val"
"14541","ENSG00000248546.3","processed_pseudogene",0.449817926522256,0.0739725408539951,3.47895145072996,0.000284302244388779,0.999999999912779
"14546","ENSG00000201050.1","snRNA",0.380944080200912,0.169836608364135,2.92569531023051,0.00183380737252742,0.999999999912779

into:

"","gene_id","gene_name","gene_type","logFC","AveExpr","t","P.Value","adj.P.Val"
"14541","ANP32C","processed_pseudogene",0.449817926522256,0.0739725408539951
"14546","RNU6-668P","snRNA",0.380944080200912,0.169836608364135,2.92569531023051