Homework 1B.

While there are instructions provided within this assignment, there is a lot of background material you really must have read to follow each step successfully. If you haven’t gone through that material, STOP, and do that. I do not spell out each step, and there are purposeful gaps meant to challenge you and make you think some. So fill in the gaps. You may mess up your directory tree. In that case, you can either start-over, or you can use Unix commands to fix. Your homework is automatically graded, so make sure you pay attention to names. One typo and program fails. So typos count against you, and you do not get credit if it is close. Similarly, your programs won’t work if they are close.

The homework is due August 27, 8AM as a tar.gz file to Blackboard of your assignment1 folder after completing the steps below.  A tar.gz folder is created typically by using the program tar, such as tar -cvfz myfile.tar.gz path/to/mydirectory.  You’ll need to use rsync to transfer the file from the trgn.usc.edu server to your computer for uploading.

Create new Github Repository

Click create repository.

Setup Assignment Folder

  • Make a folder called assignments in your home director using mkdir.
  • Within assignments, make a directory called trgn_assignment1 and cd into that directory.
  • Initialize your repository
    echo "#trgn510_assignment1" >> README.md
    git init
    git add README.md
    git commit -m "first commit"
    git remote add origin https://github.com/davcraig75/trgn510_assignment1.git
    git push -u origin master
  • With vi, edit the README.md file to read Create a file called README.md within your assignment1 directory using vim with the following contents:

    This directory contains my first assignment in Fall 2020 TRGN510
  • Make your first commitment of the directory tree to git. First add the files.git add . -A
  •  type: git commit -m "First commit"
  • Type: git push origin master

Problem 1

  • Create a directory called problem1 within the ~/assignments/trgn_assignment1 folder. Type: history.
  • Create a file called myhistory.txt within the ~/assignments/trgn_assignment1/problem1 folder using redirect > symbol, such as history > myhistory.txt not forgetting to be in the correct folder when you do this.

Problem 2

  • Create a folder called problem2 within your ~/assignments/trgn_assignment1/ folder.  Copy a file elsewhere on the server into this directory using cp /data/bashrc ~/assignments/trgn_assignment1/problem2/.
  • Change directories into ~/assignments/trgn_assignment1/problem2/. and type ls -la where the -la provides you more information about permissions and hidden files.  
  • You want to edit the bashrc file, which is your settings file.  Right now its not active because its not in the right place and with the right name. It must be in your home directory and have the name .bashrc . Note that the preceding . character makes it hidden from standard ls.   This file is a series of bash commands run whenever you start linux.  Let’s add a welcome message.  Use vim to add a line at the very end so that we always know when we have login to the server.  Go to the end of the file (type G, if escaped), and add the line to the bashrc file: echo “Welcome. You are $USER on $HOSTNAME”
  • Its always good to know what operating system you are in.  Lets add that too.  Type cat /etc/*release which should give:

    CentOS Linux release 7.7.1908 (Core)
    NAME="CentOS Linux" VERSION="7 (Core)"
    ID="centos" ID_LIKE="rhel fedora"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    .ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" CentOS Linux release 7.7.1908 (Core) CentOS Linux release 7.7.1908 (Core)
  • That is too much info to provide every time we login, so let us just type the first line by piping into the program head.  Test out cat /etc/*release | head which gives you the first 5 lines.  I would like you to figure out how to change this command to make it just 1 line. Please place that line as the last line in your file ~/assignments/trgn_assignment1/problem2/bashrc
      1. Test out your new bashrc file typing source ~/assignments/trgn_assignment1/problem2/bashrc and your prompt should change.  This basically just ran every line in that file using bashrc. You should see something like:
      2. Now lets install it such that it runs every time we go into this server by giving it the right name and the right location: cp ~/assignments/trgn_assignment1/problem2/bashrc ~/.bashrc
      3. Lets now add a setup file for vim.  You’ll want to use the example provided at /home/data/vimrc.  Please copy this file into your assignment directory: cp /home/data/vimrc ~/assignments/trgn_assignment1/problem2/ and inspect it with vim by opening the file.  It should look pretty bland.  Now cp that to your home directory and give it the correct name by typing cp vimrc ~/.vimrc.  Now examine the file using vi.  You should see a lot of coloring.  This is your new setup.

Problem 3

      1. Create a new directory called ~/assignments/trgn_assignment1/problem 3/ and cd into the directory.  Note that there is a space in the directory name and you need to accurately have the space in the directory name.  If you type ls -l you should see something like:
      2. Make and change directories into problem 3. Create a file called My History.txt that contains the output to the command history.  Again, no credit will be given if you forget the space and you may have to try this step multiple times.

Problem 4

      1. Make and change directories into problem4.  No Space!  You can see all the programs running right now using top, try it out!  Now sometimes, we need to know what is running and if our program has finished.  Lets list all of the processes of our username by typing ps -ef | grep $USER.  Ok, now put that output within a file called myprocesses.txt that should be within your problem4 directory.  You’ll need to use redirection.
      2. Let’s edit myprocesses.txt to have a header.  Create a header using echo, echo "# My processes" > header.txt. Now lets put those together: cat header.txt > myprocesses.header.txt and the file should appear in your problem4 directory.

Problem 5

      1. Create and cd into a problem5 directory.  For the next problem, we will need to download or scrape a webpage.  We can use wget or curl for this.  Get the following file which is a list of the genes in the genome and their details. Please download the file using:
        wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.basic.annotation.gtf.gz
      2. That was a large file, and we can check that it is there using ls -l.  How big is the file? It is 24290522 bytes.  That is about 24 Megabytes.  However, how do we know that its not corrupted?  We can calculate a md5sum on the file contents which provides a short set of characters unique to the file. We can do this by typing md5sum gencode.v29.basic.annotation.gtf.gz and we see a value: 00d1c11098c15e8d79fec541afd1dff0.
      3. Lets unzip the file, typing gunzip gencode.v29.basic.annotation.gtf.gz.  Now how big is the file?  Type ls -l. It is 768007075 bytes.  That is about 768 Megabytes.  Inspect the file first with head: head gencode.v29.basic.annotation.gtf and we can see the contents:We can also see other things.  Inspect the end of the file using tail.  Count the number of lines by typing wc -l  gencode.v29.basic.annotation.gtf`.  Now we notice that there are tabs separating concepts, and there is a term gene in the third column for lines that are gene.  How many lines are there that have the word gene in the fourth column?  We can use grep to find out, which searches for lines matching.  This is tricky though because we need to search for <tab>gene<tab>.  The way to encode the <tab> character in linux is with control-v and then press the tab key.  You can do that trick with any character, including the enter. key.  So now type grep " gene " gencode.v29.basic.annotation.gtf | wc -l`  but before and after the word gene encode the tab character.  You are successful if you see there are 58721 genes. Let’s put just those into a file: grep " gene " gencode.v29.basic.annotation.gtf > genes.tmp.txt.
      4. Lets put just the gene names into a file.  We can use grep and regular expressions.  We learn more about regular expressions later but you can explore the concept at https://www.regex.com.  For now, lets use this command: grep -oP 'gene_name "(.*?)"' genes.tmp.txt  where we are matching on this pattern.  Let’s redirect that result into a file called genes.tmp2.txt.  If you type head genes.tmp2.txt, you should see something like:
        gene_name "DDX11L1"
        gene_name "WASH7P"
        gene_name "MIR6859-1"
        gene_name "MIR1302-2HG"
        gene_name "MIR1302-2"
        gene_name "FAM138A"
      5. The next step I’d like you to do in vim .Using vim you can do substitutions, and I’d like you to remove gene_name " from each line, and the trailing " – hint %s.  Now rename this file genes.final.txt.  You should see something like below by typing head genes.final.txt
      6. Finally, I’d like you to clean up the directory by removing the temporary files and the very large starting file. These are too big for git.  Please type: rm *.gtf *tmp.txt genes.txt *.gz.


    1. Create a tar zip of your homework directory trgn_assignment1 using the program tar and command tar -cvzf to a file with the name assignment1.$USER.tar.gz where $USER is your username.
    2. Using the program rsync on your personal laptop, transfer the homework tar zip file from the trgn.usc.edu server onto your personal laptop.  Upload the tar.gz file to blackboard.
    3. Finally, commit all your updates to github, such that you can see your repository on github.com