Homework Assignment 2

Due 8AM PST September 3rd

  • Create an application that creates a webpage with a wordcloud summarizing a (2+) series of webpages once per day.
    • Retrieves a series of web-pages in a file called: my_webpages.txt
    • Extracts text from them
    • Generates a wordcloud and creates an image from them.
    • Places these into a directory with an HTML referencing the images.
    • This script will run using cron scheduler.

Component 1: Generate_wordcloud_from_file.sh.

Initialize a repository

Initialize a new github: trgn_wordcloud
Create a README.md file with information about the assignment. You’ll need to do this in a Markdown format.  Learn about markdown format here: https://guides.github.com/features/mastering-markdown/.

In this README.md, I’m going to want you to provide the following major sections:

  • About the app
    • 1-2 sentences about the app
  • Installation & Usage
    • How to install the app into a server from scratch, start with git clone.
  • Dependencies
    • A list of programs needed for this to work
  • Contact

In addition, you’ll need to create a license file. Create a license.txt file with an MIT style license.

Step 1. Create a main script.

Lets start creating a new script called generate_wordcloud_from_file.sh

We start with a shebang line:

#!/usr/bin/bash

echo "Successfully run for $USER";

We can then test that out that script.  Now we would want the script to retreive each file in the directory using wget, by iteratively going through each webpage in the my_webpages.txt and placing them in a directory called current_pages.  Thus we need a unit test file for the my_webpages.txt.

Step 2. Create a resource file

Create my_webpages.txt unit test

https://en.wikipedia.org/wiki/Translational_bioinformatics
https://en.wikipedia.org/wiki/Genomics

For your application DO NOT USE THESE – USE YOUR OWN web pages, preferrably ones that change nightly.

For example, if we run the script it should yield two files for the two web URLs in the following directory tree.

.
├── current_pages
│   ├── Genomics
│   └── Translational_bioinformatics
├── generate_wordcloud_from_file.sh
├── license.txt
├── my_webpages.txt
└── README.md

Now the names come from the website, and I can tell you that is a bit problematic, so lets use an option of wget that allows us to be able to specify the filename, such as file1.html and so forth. Specifically, I’ve used wget -O.  Look it up on google how to solve.

.
├── current_pages
│   ├── file1.html
│   └── file2.html
├── generate_wordcloud_from_file.sh
├── license.txt
├── my_webpages.txt
└── README.md

Now – you must be thinking – how did I get file1.html and file2.html.  In my script, I iterated through each file giving it a number based on a count in my script.

Inside these files, we should see the HTML used to generate the webpage.  There is a lot of text we don’t need. Lets add a program that converts these files to text. Lets use the python script html2text: https://github.com/aaronsw/html2text.git.  Let’s install it by:

cd ~/bin
git clone https://github.com/aaronsw/html2text
cd html2text
mv html2text ~/bin/.

This will make the program run without requiring the whole path. We have created a dependency that we should highlight and mention in our README.md markdown.  We can test the program by running:

html2text.py file1.html > my_current.txt
html2text.py file2.html >> my_current.txt 
head my_current.txt

Note that I have used something new – the “>>”.  This special redirect appends, and thus I don’t write over my_current.txt when I download the second time.  The alternative would be to give them two different names, and then concatenate them to a single file.

# Genomics

From Wikipedia, the free encyclopedia

Jump to navigation Jump to search

This article is about the scientific field. For the journal, see [Genomics
(journal)](/wiki/Genomics_\(journal\)).

"Genome biology" redirects here. For the journal with the same name, see

And if we count the number of lines with wc -l my_current.txt you should see 1777 lines,

We can put these in a new file called /my_current.txt.

We can use a program to extract the text.  Now, we should add some detail about dependencies in the README. Our first dependency is wget.

Step 3. Add in Dependency 1

Step 4. Add in Dependency 2


Dependency 1: Generate Wordcloud Image

We are going to install a program called wordcloud which is a python script that creates a wordcloud image.  You can get info about this simple script at https://github.com/amueller/word_cloud. Lets install in our ~/bin/ directory.

cd ~/bin 
git clone https://github.com/amueller/word_cloud.git
cd word_cloud

Now read the directions, and create a unit test.

You might get errors, such as:

This is clearly because its trying to install in directories you don’t have permission for. You need to install locally, so you’ll have to modify their directions to

python -m pip install --user wordcloud

You often find yourself needing to install things locally.  Once its installed, you’ll notice that its moved the script to a directory that is already in your bin.  You’ll want to test things out following their directions.

For the unit test, you can use any thing, but for example the text below would work.  What you should find interesting, is that this program is for many automatically installed into a directory in their path.

Text for a unit test

Though heredity had been observed for millennia, Gregor Mendel, a scientist and Augustinian friar working in the 19th century, was the first to study genetics scientifically. Mendel studied “trait inheritance”, patterns in the way traits are handed down from parents to offspring. He observed that organisms (pea plants) inherit traits by way of discrete “units of inheritance”. This term, still used today, is a somewhat ambiguous definition of what is referred to as a gene.Trait inheritance and molecular inheritance mechanisms of genes are still primary principles of genetics in the 21st century, but modern genetics has expanded beyond inheritance to studying the function and behavior of genes. Gene structure and function, variation, and distribution are studied within the context of the cell, the organism (e.g. dominance), and within the context of a population. Genetics has given rise to a number of subfields, including molecular genetics, epigenetics and population genetics. Organisms studied within the broad field span the domains of life (archaea, bacteria, and eukarya).Genetic processes work in combination with an organism’s environment and experiences to influence development and behavior, often referred to as nature versus nurture. The intracellular or extracellular environment of a living cell or organism may switch gene transcription on or off. A classic example is two seeds of genetically identical corn, one placed in a temperate climate and one in an arid climate (lacking sufficient waterfall or rain). While the average height of the two corn stalks may be genetically determined to be equal, the one in the arid climate only grows to half the height of the one in the temperate climate due to lack of water and nutrients in its environment.

Example Unit Test

Now lets test this with some text from the previous component, specifically, what’s in my_current.txt.

The next thing is to add a line to your script from above generate_wordcloud_from_file.sh script to create the image.

Now we need to figure out how to we get an image on a web page..

Dependency 2: Web Server

First, we need to check that we have a place to create a webpage. That location is https://trgn.usc.edu/user/~yourusername and it should take files located in your ~/public_html directory that are world readable.  It should be empty, and look something like this

If we create an index.htmlfile in the ~/public_html directory, it will replace this page.

Lets make sure we can add a document. Lets create a unit test HTML called unittest.html and just a generic HTML page in the public_html folder.

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>My Unit Test</title>
</head>
<body>
<h1> Unit Test</h1>
<p> Unit Test<p>
</body>
</html>

This should yield:

Now if this test works, we can actually modify this HTML above to reference an image within the same directory.  You’ll need to look that up, but essentially its adding something like:

<img src="myimage.png></img>

to the place you want the image to appear within your html file called index.html within your public_html directory.  Make sure the image is located in the same directory!.

Lastly, you need to go back to your generate_wordcloud_from_file.sh script, and have it automatically put the image in this directory, replacing the old one when run.  Finally, using cron to set it to run once per day at your favorite time of the day.