Data Science
Building Meaning From Data
Modern Tools for Working with Data
The term Data Science is ubiquitous, and definitions vary. This is similar for terms such as informatics, bioinformatics, computational biology, among others. The terms are largely generic, but the underlying concepts are largely the same and do not change fully. It’s important to give a few areas, we will focus on:
- JupyterLab – web-based interactive development environment for Jupyter notebooks, code, and data
- BASH/Command-line
- Python. A general-purpose scripting/programming language that emphasizes code readability.
- R. A scripting language with its routes in statistics
- R Studio. A company that has developed graphic user interfaces (GUI) and libraries for R, making R more accessible and usable.
- Databases (SQL). A standard language for accessing and manipulating databases.
It’s important to start with the idea that we have different types of data. Data can be numbers, letters, dates, and so forth. Within a computer, these are all represented differently and in very defined ways.
Understanding the Types of Data
What’s the difference between the number 2, the character 2, two, 2.0, 200%? Quite a bit, and the impact is significant to scientific data and data analysis. Recently, in fact, there was an article:
How can this arise? Open Excel and type the gene SEPT6
and save it as a csv
. For example, if we take a file and put:
Gene, Value KRAS, 2.323 SEPT6, 2321 TP53, 2.1.2 Then we open it excel and see that the data has changed. This can have far-reaching effects, and is something to be careful:
We then save it, and look at the raw data we see that the underlying data has changed. Gene, Value KRAS,2.323 6-Sep,2321 TP53, 2.1.2
What’s happening is that Excel is interpreting and determining the type
of the data, and decides that SEPT6 should be a date
. In Excel, date
is a type of data. Accordingly, it then stores it in a way whereby fundamentally it’s changed. Excel is what’s called a weak type language, and guesses what type of data is underlying. Other languages are strong
typed. What does this mean, and what are some fundamental types? We review a few, but its important to understand each language has types
and these are some of the most primordial parts of data science we need to understand.
Primitive or Simple Data types
We first describe the basis for primitive datatypes.
Bits & boolean types of data
A bit is the basic unit of data that is either 0 or 1. Everything builds from there. A common representation is true/false (boolean).
Boolean
is the first type of data to remember, and it can be encoded as a 0
or 1
.
We can describe more complex data by using bits together. Two bits give us access to four slots:
00 | Slot 1 |
01 | Slot 2 |
11 | Slot 3 |
10 | Slot 4 |
We can extend this further to 8 bits and beyond
8 Bits to a Byte & a character or char
type of data.
Modern computing really began when we started using 8 bits to represent a byte of data. 2^8 gives 256 categories. This is convenient because it can store what can be typed on a keyboard. The ASCII standard is the encoding of characters using 7 bits to these keys (below), reserving the last bit, and another 128 characters for foreign characters using the last 8th bit. Below we can see that A
is 011 0001
.
The second type of data to know is character
or sometimes abbreviated as char
. It is a single letter. Character data types are typically strings of ASCII characters
Stringing characters together & string types of data
If we take several characters and bundle (or string) them together we have the next important type of data: Stings
. A string data type is a series of characters stored together. We store these together in variables
and we’ll discuss this more later. An example of string
is:
data
which underneath the hood is composed of for characters in order: d
,a
,t
,a
.
Using bits to represent whole numbers, integers as a type of data
Storing data has always been limiting – from floppy disks to IPads, nothing has changed – we need more storage. For example, consider the year 2020
. We can store it as 4 char
string:2020
. We can of course just abbreviate it as a year: ’20. For those old enough, they would immediately remember the year 2000 bug and the impact of such a decision. Essentially, for years prior to the year 2000, dates were stored using the last two digits, such as 99. As a result, the year 00 could either mean the year 1900 or 2000, and all downstream code had an inherent default prediction. This caused all sorts of challenges.
We can be smart about it though, and we can store it as a 2-bytes using 16 bits giving us access to 2^16=65,536 numbers. What if we want to store a larger number? We need to use a double integer that uses 4 bytes or 32 bits: 2^32=4,294,967,296
. Building computers off of 32-bit computing was good throughout the 90’s when there were barely 5 billion people, but if we want the ability to go higher. Likewise, a genome is over 3 billion bytes, and if we store the location of a variant using 32 bits, we would lose precision and become ambiguous. We really need 64 bits: 2^64
gives us 18,446,744,073,709,551,616
numbers.
Using bits to represent numbers with decimals.
What about decimals? Well, we obviously would need more bits. A floating-point number is a limited-precision that is not whole and typically has a decimal. These numbers are stored internally as scientific notation. Still, floating-point numbers have limited precision, only a subset of real or rational numbers can be represented.
Float
uses 4 bytes (or 64 bits) and gives us access to 3.4^10-38 to 3.4^10-38. Need more, well, you need more bits.
Summary
Statistical Datatypes
In many biomedical and data science applications, one is often interested in determining statistical meaning, it’s helpful to characterize the underlying data more and it can be sub-divided. By statistical meaning, we mean that we may wish to graph data, group data, or conduct statistical analysis on data. The first two major types of statistical data are categorical
and numerical data
(recognizing a few others as well, shown below).
Definitions vary by programming or scripting language, but generally, these are considered valid:
Categorical Data
Categorical data can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group. Categorical data can then be typically divided into Ordinal/ordered
data and un-ordered named or nominal
data. Additionally, there is also boolean
or binary data, that can be true
or false
. As many healthcare professionals now accept, gender is no longer binary.
Nominal Data
“Arizona”,”Nevada”, “California”
Ordered Data
“Monday”,”Tuesday”,”Wednesday”,”Thursday”
Numerical Data
Numerical data is just that – data that is largely numbers. These take three primary forms: discrete
, continuous
, and ratio
. Discrete data are usually considered whole numbers or integers, (such as 1,2,3,4), whereas continuous data can contain decimals. There is also a data-type called ratio
that includes fractions and percentages, or types of data that really shouldn’t have some operations on them such as dividing. An example in the health-care setting would be temperature – going from 2C to 4C is not doubling in temperature. Other examples include enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight, and survival time.
Can be computed | Nominal | Ordinal | Ratio | |
Frequency distribution | Yes | Yes | Yes | |
Median and percentiles | No | Yes | Yes | |
Add or subtract | No | No | Yes | |
Mean, standard deviation, standard error of the mean | No | No | Yes | |
Ratios, coefficient of variation | No | No | Yes |
Introducing variables & composite datatypes:
We are going to talk a lot about keeping data organized. We do this the same way we do in life – we give names to represents folders (both physical or otherwise), files, and just about everything else. Likewise, we put some data – whether an integer, character, or otherwise, we do want it back. We give them a name and box to be stored. They may change, and thus we use the term variable. Technically, there are also constants, but for our purposes, these are boxes where we store data and can retrieve them. We often type
these variables so that the programs can optimally store them. An int
variable may be able to store a single integer at a time. What about a set of data, such as a list of holidays? These can be stored as well in composite data types, such as arrays, dictionaries, or lists.
Composite Datatypes
Collections of data can also be considered a single object
and essentially a type
of data. We review a few major types, but let’s start with considering a form of data as tables
.
Tables can be described through a spreadsheet Worksheet which most people are familiar with. Below, we have the table hospital-data
with multiple column headers. This actually comes from a csv
file (below) that is publicly available, and you can download it (link).
Arrays, Vectors, Lists, or Ordered Arrays.
When we think of ways to store data, one analogy is by the mailbox. Basically, numbered places where we store data. If we think about a table from above, we could also consider an array to be a column.
Likewise, we can make an ordered list of data such as by A1 is ‘hello’ and A2 is ‘goodbye’. One might declare it (typically) by brackets. The key is that they are numbered sequentially. You can place things out of order, but in general, the expectation is that you push one thing onto a stack growing the size of the array by one.
A[1]=0.234234 A[2]=0.3234 A[3]=23.23
Arrays can, of course, be multi-dimensional, but generally, they are presumed to be all the same type of data, and thus you can write:
A[1,2]=0.234234
To emphasize, arrays are often indicated by brackets, and so when we see code or other types of data listed within brackets {}, then we should think of those as un-ordered arrays. For example:
workdays=[“Monday”,”Tuesday”,”Wednesday”,”Thursday”]
We can then retrieve the first item in the list workdays[0]
is Monday
in some program languages and workday[1]
. Indeed, some languages such as C
start counting at 0
and some others such as R
start at 1
.
Associative arrays, Dictionaries, objects in javascript, named arrays, or hashes.
Number storage vehicles have limitations, and thus there is another type of storage that is much similar to an address, and those are termed Associative arrays. Instead of a number, we use a name.
GeneNames={“PTEN”:”phosphatase and tensin homolog”,”KRAS”:”K-Ras”,”TP53″:”p53″}
I could have a variable called GeneNames. I could store GeneNames{'PTEN'}
, and then store all sorts of information in a way that is logically retrievable. This comes in handy a lot. Again, historically, you do have the same type of data in unordered lists or hashes.
We can get much more complex and mix these quite a bit into data structures. In R, we use dataframes, which include Hashes of arrays, etc., and so forth. For example, let us load up some data!
These can be mixed where we have arrays of dictionaries and vice versa. We’ll get back to that point with JSON
stores below.
JSON and Document Stores
JSON is a language-independent data format that allows for embedded data types of data, in a record. A collection of records is often called a document. At the heart of JSON is the Key
:Value
approach, where the value can strings, booleans, numbers, arrays, associative arrays, and null. Strings are encapsulated in quotes ("
), boolean is unquoted true
or false
, arrays are surrounded by brackets, [
and ]
, Associative arrays are surrounded by curly brackets, {
and }
.
Text-based Data-interchange Formats
Comma-separated values (CSV) and Tab-Separated Files are plain text files, where the first line is typically a header and the following lines are rows. These are typically in ASCII or plain text.
CSV Files
We can represent the same data as a CSV. The first row is typically the header.
newID,PC1,PC2,PC3,PC4 PPMI.Pilot.HA_ITG001_296597,0.108006766,0.015726028,0.093262438,0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690,0.104529423,0.017393778,0.117255688,0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682,0.107576826,0.002806186,0.171024597,0.071616656 PPMI.Pilot.HA_ITG004_3119622,0.074604492,0.049329752,-0.115794448,0.099060234 PPMI.Pilot.HA_ITG005_3145413,0.110742871,0.061465364,0.231489894,-0.070450761 PPMI.Pilot.HA_ITG006_953306,0.113388178,0.013073106,0.232587341,-0.002473437 PPMI.Pilot.HA_ITG007_1176618,0.065665593,0.080236696,-0.218573769,0.071228383 PPMI.Pilot.HA_ITG008_PP0015.9868,0.088975015,0.026523725,-0.02374357,0.10829439
There is a big problem here in that often “,” is used within documents. For this reason, csv
is really not ideal. Some tools like R
offer to use ” to help, but it’s still prone to problems. There are a few other issues, such as the use the unicode
. More often than not, it’s best to have quotes (“) in them.
Another example of a CSV is below:
TSV (Tab Separated Files)
Tabs are /t in most tools such as Unix. In BASH you need to press control-v then press tab after letting go of control-v. We will use these a fair amount
newID PC1 PC2 PC3 PC4 PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656
All of these are considered flat files. In the future, we will talk about structured tables and databases such as through SQL or PostGRSQL.
XML
XML consists of a start-tag such as <start>
and an end-tag </start>
or can be an empty-element tag such as <nothing />
. You can mark these up with key value encoding, such as <start mykey="val">
Just a start</start. This format is most often thought of in the context of HTML, so we’ll use that for our purposes here. HTML follows generally XML.
Example simple HTML:
<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> <p>This is a paragraph.</p> <p>This is another paragraph.</p> </body> </html>
In practice, many more tags are defined, but we can provide some examples below for a few more common ones
HTML
HTML | Rendering |
<h1>My First Heading</h1> My first paragraph. <h1>This is heading 1</h1> <h2>This is heading 2</h2> <h3>This is heading 3</h3> <a href=”https://dtg.usc.edu”>This is a link</a> Try it Yourself » <button>My button</button> |
THIS IS HEADING 1This is heading 2This is heading 3This is a linkTry it Yourself » |
HTML defines the content of every web page on the Internet. By “marking up” your raw content with HTML tags, you’re able to tell web browsers how you want different parts of your content to be displayed. Creating an HTML document with properly marked up content is the first step of developing a web page.
Aggregating, Pivoting, and Summarizing
One of the first major goals of understanding a dataset is to summarize
it. Its easiest to think of in the context of table. For example, one may wish to know what the range
or possible values/factors/categories
is of a particular column, sorting
by a given column, filtering
it to remove certain rows, or perhaps create different ways of viewing that data by grouping
it together. In part from the Excel nomenclature, this is referred to as pivoting
or aggregating
data.
We can actually learn most of these using Google Sheets or Excel, though let’s use Google Sheets form the above data. I’ve gone already into the Data
menu and selected create a filter
. We will test out some of these ideas using a clinical sheet describing samples from within The Cancer Genome Atlas (TCGA), shown below. Later you can explore this dataset using the exercise link below.
If we click on a header, we can see that we can both sort
and add filters
on the data. We can also see different categories of the data, such as the different genders.
Graphing Exercise
In the graphing exercise we below, start by clicking “load data”, or put in some tab/csv delimited data. As we change columns the applications changes to different types of graphs. Filtering gets interesting because we can effectively add complex conditions to our filtering, including using includes, greater than, and many other operations
. We can join these together as well using and
or or
.
If we highlight all the data and click pivot table
under the Data
menu, we can do something really amazing, and create new tools that summarize the prior data by defining columns, rows, and fields of our new table.
For example, we can count the number of times a certain group is found. We can do more than count
, we can sum
, take the mean
, and many others. The key to do this is to identify which fields will be our new columns and rows, and what field will make up our values. Under the Pivot table editor
in the right, click Add
, then select the type of variable. Very important, we can count
, and do many other operations as well. For example, we can example individuals who are part of the Tumor Cancer Genome Atlas (TCGA), and create a new graph of Average Number of Cigarettes per Day by Race, by adding rows and columns appropriately.
Exercises
Exercise 1:
Make a copy of the breast cancer TCGA Breast Cancer Data. Using the pivot functions, explore different aggregations of the data.
Exercise 2:
Download a table of TCGA Breast Cancer Clinical data:
Open this data in a text viewer. Look at the data by scrolling up and down. Now, paste the data into the text window below. Change the x-axis
and explore different parameters to see how graphs for different types
of data.