Building Meaning From Data

Modern Tools for Working with Data

The term Data Science is ubiquitous, and definitions vary. This is similar for terms such as informaticsbioinformatics, computational biology, among others. The terms are largely generic, but the underlying concepts are largely the same and do not change fully.  It’s important to give a few areas, we will focus on:

  • JupyterLab – web-based interactive development environment for Jupyter notebooks, code, and data
  • BASH/Command-line
  • Python. A general-purpose scripting/programming language that emphasizes code readability.
  • R. A scripting language with its routes in statistics
  • R Studio. A company that has developed graphic user interfaces (GUI) and libraries for R, making R more accessible and usable.
  • Databases (SQL). A standard language for accessing and manipulating databases.

It’s important to start with the idea that we have different types of data. Data can be numbers, letters, dates, and so forth. Within a computer, these are all represented differently and in very defined ways.

Understanding the Types of Data

What’s the difference between the number 2, the character 2, two, 2.0, 200%? Quite a bit, and the impact is significant to scientific data and data analysis. Recently, in fact, there was an article:

How can this arise?  Open Excel and type the gene SEPT6and save it as a csv.  For example, if we take a file and put:

Gene, Value KRAS, 2.323 SEPT6, 2321 TP53, 2.1.2 Then we open it excel and see that the data has changed. This can have far-reaching effects, and is something to be careful:

We then save it, and look at the raw data we see that the underlying data has changed. Gene, Value KRAS,2.323 6-Sep,2321 TP53, 2.1.2

What’s happening is that Excel is interpreting and determining the type of the data, and decides that SEPT6 should be a date. In Excel, date is a type of data.  Accordingly, it then stores it in a way whereby fundamentally it’s changed. Excel is what’s called a weak type language, and guesses what type of data is underlying. Other languages are strong typed. What does this mean, and what are some fundamental types?  We review a few, but its important to understand each language has types and these are some of the most primordial parts of data science we need to understand.

Primitive or Simple Data types

We first describe the basis for primitive datatypes.

Bits & boolean types of data

A bit is the basic unit of data that is either 0 or 1. Everything builds from there. A common representation is true/false (boolean).

Boolean is the first type of data to remember, and it can be encoded as a 0 or 1.

We can describe more complex data by using bits together. Two bits give us access to four slots:

00 Slot 1
01 Slot 2
11 Slot 3
10 Slot 4

We can extend this further to 8 bits and beyond

8 Bits to a Byte & a character or char type of data.

Modern computing really began when we started using 8 bits to represent a byte of data. 2^8 gives 256 categories. This is convenient because it can store what can be typed on a keyboard. The ASCII standard is the encoding of characters using 7 bits to these keys (below), reserving the last bit, and another 128 characters for foreign characters using the last 8th bit. Below we can see that A is 011 0001.

The second type of data to know is character or sometimes abbreviated as char.  It is a single letter.  Character data types are typically strings of ASCII characters

Stringing characters together & string types of data

If we take several characters and bundle (or string) them together we have the next important type of data: Stings.  A string data type is a series of characters stored together. We store these together in variables and we’ll discuss this more later. An example of string is:

data

which underneath the hood is composed of for characters in order: d,a,t,a.

Using bits to represent whole numbers, integers as a type of data

Storing data has always been limiting – from floppy disks to IPads, nothing has changed – we need more storage. For example, consider the year 2020.  We can store it as 4 char string:2020 .  We can of course just abbreviate it as a year: ’20. For those old enough, they would immediately remember the year 2000 bug and the impact of such a decision. Essentially, for years prior to the year 2000, dates were stored using the last two digits, such as 99. As a result, the year 00 could either mean the year 1900 or 2000, and all downstream code had an inherent default prediction. This caused all sorts of challenges.

We can be smart about it though, and we can store it as a 2-bytes using 16 bits giving us access to 2^16=65,536 numbers. What if we want to store a larger number? We need to use a double integer that uses 4 bytes or 32 bits: 2^32=4,294,967,296. Building computers off of 32-bit computing was good throughout the 90’s when there were barely 5 billion people, but if we want the ability to go higher. Likewise, a genome is over 3 billion bytes, and if we store the location of a variant using 32 bits, we would lose precision and become ambiguous. We really need 64 bits:  2^64 gives us 18,446,744,073,709,551,616 numbers.

Using bits to represent numbers with decimals.

What about decimals? Well, we obviously would need more bits. A floating-point number is a limited-precision that is not whole and typically has a decimal. These numbers are stored internally as scientific notation. Still, floating-point numbers have limited precision, only a subset of real or rational numbers can be represented.

Float uses 4 bytes (or 64 bits) and gives us access to 3.4^10-38 to 3.4^10-38. Need more, well, you need more bits.

Summary

Statistical Datatypes

In many biomedical and data science applications, one is often interested in determining statistical meaning, it’s helpful to characterize the underlying data more and it can be sub-divided. By statistical meaning, we mean that we may wish to graph data, group data, or conduct statistical analysis on data. The first two major types of statistical data are categorical and numerical data (recognizing a few others as well, shown below).

Definitions vary by programming or scripting language, but generally, these are considered valid:

Categorical Data

Categorical data can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group. Categorical data can then be typically divided into Ordinal/ordered data and un-ordered named or nominal data. Additionally, there is also boolean or binary data, that can be true or false.  As many healthcare professionals now accept, gender is no longer binary.

Nominal Data

“Arizona”,”Nevada”, “California”

Ordered Data

“Monday”,”Tuesday”,”Wednesday”,”Thursday”

Numerical Data

Numerical data is just that – data that is largely numbers.  These take three primary forms: discrete, continuous, and ratio. Discrete data are usually considered whole numbers or integers, (such as 1,2,3,4), whereas continuous data can contain decimals.  There is also a data-type called ratio that includes fractions and percentages, or types of data that really shouldn’t have some operations on them such as dividing. An example in the health-care setting would be temperature – going from 2C to 4C is not doubling in temperature. Other examples include enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight, and survival time.

Can be computed Nominal Ordinal   Ratio
 Frequency distribution Yes Yes   Yes
 Median and percentiles No Yes   Yes
 Add or subtract No No   Yes
 Mean, standard deviation, standard error of the mean No No   Yes
 Ratios, coefficient of variation No No   Yes

Introducing variables & composite datatypes:

We are going to talk a lot about keeping data organized. We do this the same way we do in life – we give names to represents folders (both physical or otherwise), files, and just about everything else. Likewise, we put some data – whether an integer, character, or otherwise, we do want it back. We give them a name and box to be stored. They may change, and thus we use the term variable. Technically, there are also constants, but for our purposes, these are boxes where we store data and can retrieve them.  We often type these variables so that the programs can optimally store them. An int variable may be able to store a single integer at a time. What about a set of data, such as a list of holidays?  These can be stored as well in composite data types, such as arrays, dictionaries, or lists.

Composite Datatypes

Collections of data can also be considered a single object and essentially a type of data.  We review a few major types, but let’s start with considering a form of data as tables.

Tables can be described through a spreadsheet Worksheet which most people are familiar with.  Below, we have the table hospital-data with multiple column headers. This actually comes from a csv file (below) that is publicly available, and you can download it (link).

Arrays, Vectors, Lists, or Ordered Arrays.

When we think of ways to store data, one analogy is by the mailbox. Basically, numbered places where we store data. If we think about a table from above, we could also consider an array to be a column.

Likewise, we can make an ordered list of data such as by A1 is ‘hello’ and A2 is ‘goodbye’. One might declare it (typically) by brackets. The key is that they are numbered sequentially.  You can place things out of order, but in general, the expectation is that you push one thing onto a stack growing the size of the array by one.

A[1]=0.234234 A[2]=0.3234 A[3]=23.23

Arrays can, of course, be multi-dimensional, but generally, they are presumed to be all the same type of data, and thus you can write:

A[1,2]=0.234234

To emphasize, arrays are often indicated by brackets, and so when we see code or other types of data listed within brackets {}, then we should think of those as un-ordered arrays.  For example:

workdays=[“Monday”,”Tuesday”,”Wednesday”,”Thursday”]

We can then retrieve the first item in the list workdays[0] is Monday in some program languages and workday[1].  Indeed, some languages such as C start counting at 0 and some others such as R start at 1.

Associative arrays, Dictionaries, objects in javascript, named arrays, or hashes.

Number storage vehicles have limitations, and thus there is another type of storage that is much similar to an address, and those are termed Associative arrays. Instead of a number, we use a name.

GeneNames={“PTEN”:”phosphatase and tensin homolog”,”KRAS”:”K-Ras”,”TP53″:”p53″}

I could have a variable called GeneNames.  I could store GeneNames{'PTEN'}, and then store all sorts of information in a way that is logically retrievable.  This comes in handy a lot.  Again, historically, you do have the same type of data in unordered lists or hashes.

We can get much more complex and mix these quite a bit into data structures.  In R, we use dataframes, which include Hashes of arrays, etc., and so forth.  For example, let us load up some data!

These can be mixed where we have arrays of dictionaries and vice versa.  We’ll get back to that point with JSON stores below.

JSON and Document Stores

JSON is a language-independent data format that allows for embedded data types of data, in a record.  A collection of records is often called a document. At the heart of JSON is the Key:Value approach, where the value can strings, booleans, numbers, arrays, associative arrays, and null.  Strings are encapsulated in quotes ("), boolean is unquoted true or false , arrays are surrounded by brackets, [and ], Associative arrays are surrounded by curly brackets, { and }.

Text-based Data-interchange Formats

Comma-separated values (CSV) and Tab-Separated Files are plain text files, where the first line is typically a header and the following lines are rows.  These are typically in ASCII or plain text.

CSV Files

We can represent the same data as a CSV.  The first row is typically the header.

newID,PC1,PC2,PC3,PC4 PPMI.Pilot.HA_ITG001_296597,0.108006766,0.015726028,0.093262438,0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690,0.104529423,0.017393778,0.117255688,0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682,0.107576826,0.002806186,0.171024597,0.071616656 PPMI.Pilot.HA_ITG004_3119622,0.074604492,0.049329752,-0.115794448,0.099060234 PPMI.Pilot.HA_ITG005_3145413,0.110742871,0.061465364,0.231489894,-0.070450761 PPMI.Pilot.HA_ITG006_953306,0.113388178,0.013073106,0.232587341,-0.002473437 PPMI.Pilot.HA_ITG007_1176618,0.065665593,0.080236696,-0.218573769,0.071228383 PPMI.Pilot.HA_ITG008_PP0015.9868,0.088975015,0.026523725,-0.02374357,0.10829439

There is a big problem here in that often “,” is used within documents.  For this reason, csv is really not ideal.  Some tools like R offer to use ” to help, but it’s still prone to problems. There are a few other issues, such as the use the unicode.  More often than not, it’s best to have quotes (“) in them.

Another example of a CSV is below:

TSV (Tab Separated Files)

Tabs are /t in most tools such as Unix. In BASH you need to press control-v then press tab after letting go of control-v.  We will use these a fair amount

newID PC1 PC2 PC3 PC4 PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656

All of these are considered flat files.  In the future, we will talk about structured tables and databases such as through SQL or PostGRSQL.

 

XML

XML consists of a start-tag such as <start> and an end-tag </start> or can be an empty-element tag such as <nothing />.  You can mark these up with key value encoding, such as <start mykey="val">Just a start</start.    This format is most often thought of in the context of HTML, so we’ll use that for our purposes here.  HTML follows generally XML.

Example simple HTML:

<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> <p>This is a paragraph.</p> <p>This is another paragraph.</p> </body> </html>

In practice, many more tags are defined, but we can provide some examples below for a few more common ones

HTML

HTML Rendering
<h1>My First Heading</h1>
My first paragraph.
<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<a href=”https://dtg.usc.edu”>This is a link</a>
Try it Yourself »
<button>My button</button>

THIS IS HEADING 1

This is heading 2

This is heading 3

This is a link
Try it Yourself »

HTML defines the content of every web page on the Internet. By “marking up” your raw content with HTML tags, you’re able to tell web browsers how you want different parts of your content to be displayed. Creating an HTML document with properly marked up content is the first step of developing a web page.

Diagram: raw content turning into HTML markup turning into a web page

Aggregating, Pivoting, and Summarizing

One of the first major goals of understanding a dataset is to summarize it. Its easiest to think of in the context of table. For example, one may wish to know what the range or possible values/factors/categories is of a particular column, sorting by a given column, filtering it to remove certain rows, or perhaps create different ways of viewing that data by groupingit together. In part from the Excel nomenclature, this is referred to as pivoting or aggregating data.

We can actually learn most of these using Google Sheets or Excel, though let’s use Google Sheets form the above data. I’ve gone already into the Data menu and selected create a filter.  We will test out some of these ideas using a clinical sheet describing samples from within The Cancer Genome Atlas (TCGA), shown below. Later you can explore this dataset using the exercise link below.

If we click on a header, we can see that we can both sort and add filters on the data. We can also see different categories of the data, such as the different genders.

Graphing Exercise

In the graphing exercise we below, start by clicking “load data”, or put in some tab/csv delimited data. As we change columns the applications changes to different types of graphs. Filtering gets interesting because we can effectively add complex conditions to our filtering, including using includes, greater than, and many other operations.  We can join these together as well using and or or.

If we highlight all the data and click pivot table under the Data menu, we can do something really amazing, and create new tools that summarize the prior data by defining columns, rows, and fields of our new table.

For example, we can count the number of times a certain group is found. We can do more than count, we can sum, take the mean, and many others.  The key to do this is to identify which fields will be our new columns and rows, and what field will make up our values. Under the Pivot table editor in the right, click Add, then select the type of variable.  Very important, we can count, and do many other operations as well.  For example, we can example individuals who are part of the Tumor Cancer Genome Atlas (TCGA), and create a new graph of Average Number of Cigarettes per Day by Race, by adding rows and columns appropriately.

Exercises

Exercise 1:

Make a copy of the breast cancer TCGA Breast Cancer Data.  Using the pivot functions, explore different aggregations of the data.

Exercise 2:

Download a table of TCGA Breast Cancer Clinical data:

Open this data in a text viewer.  Look at the data by scrolling up and down.  Now, paste the data into the text window below.  Change the x-axis and explore different parameters to see how graphs for different types of data.