Data Primer
Building Meaning From Data
The term Data Science is ubiquitous, and definitions vary. This is similar for terms such as informatics, bioinformatics, computational biology, among others. The terms are largely generic, but the underlying concepts are largely the same and do not change fully. It’s important to give a few areas, we will focus on:
- Jupyter Lab – web-based interactive development environment for Jupyter notebooks, code, and data
- BASH/Command-line
- Python. A general-purpose scripting/programming language that emphasizes code readability.
- R. A scripting language with its routes in statistics
- R Studio. A company that has developed graphic user interfaces (GUI) and libraries for R, making R more accessible and usable.
- Databases (SQL). A standard language for accessing and manipulating databases.
It’s important to start with the idea that we have different types of data. Data can be numbers, letters, dates, and so forth. Within a computer, these are all represented differently and in very defined ways.
Primitive or Simple Data types
We first describe the basis for primitive datatypes.
Bits & boolean types of data
A bit is the basic unit of data that is either 0 or 1. Everything builds from there. A common representation is true/false (boolean).
Boolean
is the first type of data to remember, and it can be encoded as a 0
or 1
.
We can describe more complex data by using bits together. Two bits give us access to four categories:
00 -> Category 1, 01 -> Category 2, 11->category 3, 10 -> category 3
8 Bits to a Byte & a character type of data.
Modern computing really began when we started using 8 bits to represent a byte of data. 2^8 gives 256 categories. This is convenient because it can store what can be typed on a keyboard. The ASCII standard is the encoding of characters using 7 bits to these keys (below), reserving the last bit, and another 128 characters for foreign characters using the last 8th bit. Below we can see that A
is 011 0001
.
The second type of data to know is character
or sometimes abbreviated as char
. It is a single letter. Character data types are typically strings of ASCII characters
Stringing characters together & string types of data
If we take several characters and bundle (or string) them together we have the next important type of data: Stings
. A string data type is a series of characters stored together. We store these together in variables
and we’ll discuss this more later.
Using bits to represent whole numbers, integers as a type of data
Storing data has always been limiting – from floppy disks to IPads, nothing has changed – we need more storage. We can be smart about it though, and instead of storing a number like 65,536 as a string using 5 bytes, we can store it as a 2-bytes using 16 bits giving us access to 2^16=65,536 numbers. What if we want to store a larger number? We need to use a double integer that uses 4 bytes or 32 bits: 2^32=4,294,967,296
. Building computers off of 32 bit computing was good throughout the 90’s when there were barely 5 billion people, but if we want the ability to go higher, we need 64 bits. 2^64
gives us 18,446,744,073,709,551,616
numbers.
Using bits to represent numbers with decimals.
What about decimals? Well, we obviously would need more bits. A floating-point number is a limited-precision that is not whole and typically has a decimal. These numbers are stored internally as scientific notation. Still, floating-point numbers have limited precision, only a subset of real or rational numbers can be represented.
Float
uses 4 bytes (or 64 bits) and gives us access to 3.4^10-38 to 3.4^10-38. Need more, well, you need more bits.
Introducing variables.
We are going to talk a lot about keeping data organized. We do this the same way we do in life – we give names to represents folders (both physical or otherwise), files, and just about everything else. Likewise, we put some data – whether an integer, character or otherwise, we do want it back. We give them a name and box to be stored. They may change, and thus we use the term variable. Technically, there are also constants, but for our purposes, these are boxes where we store data and can retrieve them.
Summary
Composite Datatypes
Tables
Tables can be thought of as a Worksheet. Below, we have the table hospital-data
with multiple column headers. This actually comes from a csv
file (below) that is publicly available, and you can download it (link).
CSV Comma Separated Files
We can represent the same data as a CSV. The first row is typically the header.
newID,PC1,PC2,PC3,PC4 PPMI.Pilot.HA_ITG001_296597,0.108006766,0.015726028,0.093262438,0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690,0.104529423,0.017393778,0.117255688,0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682,0.107576826,0.002806186,0.171024597,0.071616656 PPMI.Pilot.HA_ITG004_3119622,0.074604492,0.049329752,-0.115794448,0.099060234 PPMI.Pilot.HA_ITG005_3145413,0.110742871,0.061465364,0.231489894,-0.070450761 PPMI.Pilot.HA_ITG006_953306,0.113388178,0.013073106,0.232587341,-0.002473437 PPMI.Pilot.HA_ITG007_1176618,0.065665593,0.080236696,-0.218573769,0.071228383 PPMI.Pilot.HA_ITG008_PP0015.9868,0.088975015,0.026523725,-0.02374357,0.10829439
There is a big problem here in that often “,” is used within documents. For this reason, csv
is really not ideal. Some tools like R
offer to use ” to help, but its still prone to problems. There are a few other issues, such as the use the unicode
. More often then not, its best to have quotes (“) in them.
TSV (Tab Separated Files)
Tabs are /t in most tools such as unix. In BASH you need to press control-v then press tab after letting go of control-v. We will use these a fair amount
newID PC1 PC2 PC3 PC4 PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656
All of these are considered flat files. In the future we will talk about structured tables and databases such as through SQL or PostGRSQL
CSV & TSV Tables
Comma-separated values (CSV) and Tab-Separated Files are plain text files, where the first line is typically a header and the following lines are rows. These are typically in ASCII or plain text.
Arrays, Vectors, Lists, or Ordered Arrays.
When we think of ways to store data to later retrieve is by the mailbox. Basically, numbered places where we store data.
Likewise, we can make an ordered list of data such as by A1 is ‘hello’ and A2 is ‘goodbye’. One might declare it (typically) by brackets. The key is that they are numbered sequentially. You can place things out of order, but in general, the expectation is that you push one thing onto a stack growing the size of the array by one.
A[1]=0.234234 A[2]=0.3234 A[3]=23.23
Arrays can, of course, be multi-dimensional, but generally, they are presumed to be all the same type of data, and thus you can write:
A[1,2]=0.234234
Associative arrays, objects in javascript, named arrays, or hashes.
Number storage vehicles have limitations, and thus there is another type of storage that is much similar to an address, and those are termed Associative arrays. Instead of a number, we use a name.
GeneName={"PTEN":"phosphatase and tensin homolog"}
I could have a variable called GeneInfo. I could store GeneInfo{‘PTEN’}{‘Chr’}=’chromosome10′, and then store all sorts of information in a way that is logically retrievable. This comes in handy a lot. Again, historically, you do have the same type of data in unordered lists or hashes.
We can get much more complex and mix these quite a bit into data structures. In R, we use dataframes, which include Hashes of arrays, etc., and so forth. For example, let us load up some data!
JSON and Document Stores
JSON is a language-independent data format that allows for embedded data types of data, in a record. A collection of records is often called a document. At the heart of JSON is the Key
:Value
approach, where the value can strings, booleans, numbers, arrays, associative arrays, and null. Strings are encapsulated in quotes ("
), boolean is unquoted true
or false
, arrays are surrounded by brackets, [
and ]
, Associative arrays are surrounded by curly brackets, {
and }
.