Building Meaning From Data

The term Data Science is ubiquitous, and definitions vary. This is similar for terms such as informaticsbioinformatics, computational biology, among others. The terms are largely generic, but the underlying concepts are largely the same and do not change fully.  It’s important to give a few areas, we will focus on:

  • Jupyter Lab – web-based interactive development environment for Jupyter notebooks, code, and data
  • BASH/Command-line
  • Python. A general-purpose scripting/programming language that emphasizes code readability.
  • R. A scripting language with its routes in statistics
  • R Studio. A company that has developed graphic user interfaces (GUI) and libraries for R, making R more accessible and usable.
  • Databases (SQL). A standard language for accessing and manipulating databases.

It’s important to start with the idea that we have different types of data. Data can be numbers, letters, dates, and so forth. Within a computer, these are all represented differently and in very defined ways.

Primitive or Simple Data types

We first describe the basis for primitive datatypes.

Bits & boolean types of data

A bit is the basic unit of data that is either 0 or 1. Everything builds from there. A common representation is true/false (boolean).

Boolean is the first type of data to remember, and it can be encoded as a 0 or 1.

We can describe more complex data by using bits together. Two bits give us access to four categories:

00 -> Category 1, 01 -> Category 2, 11->category 3, 10 -> category 3

8 Bits to a Byte & a character type of data.

Modern computing really began when we started using 8 bits to represent a byte of data. 2^8 gives 256 categories. This is convenient because it can store what can be typed on a keyboard. The ASCII standard is the encoding of characters using 7 bits to these keys (below), reserving the last bit, and another 128 characters for foreign characters using the last 8th bit. Below we can see that A is 011 0001.

The second type of data to know is character or sometimes abbreviated as char.  It is a single letter.  Character data types are typically strings of ASCII characters

Stringing characters together & string types of data

If we take several characters and bundle (or string) them together we have the next important type of data: Stings.  A string data type is a series of characters stored together. We store these together in variables and we’ll discuss this more later.

Using bits to represent whole numbers, integers as a type of data

Storing data has always been limiting – from floppy disks to IPads, nothing has changed – we need more storage. We can be smart about it though, and instead of storing a number like 65,536 as a string using 5 bytes, we can store it as a 2-bytes using 16 bits giving us access to 2^16=65,536 numbers. What if we want to store a larger number? We need to use a double integer that uses 4 bytes or 32 bits: 2^32=4,294,967,296. Building computers off of 32 bit computing was good throughout the 90’s when there were barely 5 billion people, but if we want the ability to go higher, we need 64 bits. 2^64 gives us 18,446,744,073,709,551,616 numbers.

Using bits to represent numbers with decimals.

What about decimals? Well, we obviously would need more bits. A floating-point number is a limited-precision that is not whole and typically has a decimal. These numbers are stored internally as scientific notation. Still, floating-point numbers have limited precision, only a subset of real or rational numbers can be represented.

Float uses 4 bytes (or 64 bits) and gives us access to 3.4^10-38 to 3.4^10-38. Need more, well, you need more bits.

Introducing variables.

We are going to talk a lot about keeping data organized. We do this the same way we do in life – we give names to represents folders (both physical or otherwise), files, and just about everything else. Likewise, we put some data – whether an integer, character or otherwise, we do want it back. We give them a name and box to be stored. They may change, and thus we use the term variable. Technically, there are also constants, but for our purposes, these are boxes where we store data and can retrieve them.

Summary

Composite Datatypes

Tables

Tables can be thought of as a Worksheet.  Below, we have the table hospital-data with multiple column headers. This actually comes from a csv file (below) that is publicly available, and you can download it (link).

CSV & TSV Tables

Comma-separated values (CSV) and Tab-Separated Files are plain text files, where the first line is typically a header and the following lines are rows.  These are typically in ASCII or plain text.

 

 

Arrays, Vectors, Lists, or Ordered Arrays.

When we think of ways to store data to later retrieve is by the mailbox. Basically, numbered places where we store data.

Likewise, we can make an ordered list of data such as by A1 is ‘hello’ and A2 is ‘goodbye’. One might declare it (typically) by brackets. The key is that they are numbered sequentially.  You can place things out of order, but in general, the expectation is that you push one thing onto a stack growing the size of the array by one.

A[1]=0.234234
A[2]=0.3234
A[3]=23.23

Arrays can, of course, be multi-dimensional, but generally, they are presumed to be all the same type of data, and thus you can write:

A[1,2]=0.234234

Associative arrays, objects in javascript, named arrays, or hashes.

Number storage vehicles have limitations, and thus there is another type of storage that is much similar to an address, and those are termed Associative arrays. Instead of a number, we use a name.

GeneName={"PTEN":"phosphatase and tensin homolog"}

I could have a variable called GeneInfo.  I could store GeneInfo{‘PTEN’}{‘Chr’}=’chromosome10′, and then store all sorts of information in a way that is logically retrievable.  This comes in handy a lot.  Again, historically, you do have the same type of data in unordered lists or hashes.

We can get much more complex and mix these quite a bit into data structures.  In R, we use dataframes, which include Hashes of arrays, etc., and so forth.  For example, let us load up some data!

JSON and Document Stores

JSON is a language-independent data format that allows for embedded data types of data, in a record.  A collection of records is often called a document. At the heart of JSON is the Key:Value approach, where the value can strings, booleans, numbers, arrays, associative arrays, and null.  Strings are encapsulated in quotes ("), boolean is unquoted true or false , arrays are surrounded by brackets, [and ], Associative arrays are surrounded by curly brackets, { and }.