Composite Data Types
Composite Datatypes
Arrays, Vectors, Lists, or Ordered Arrays.
When we think of ways to store data to later retrieve is by the mailbox. Basically, numbered places where we store data.
Likewise, we can make an ordered list of data such as by A1 is ‘hello’ and A2 is ‘goodbye’. One might declare it (typically) by brackets. The key is that they are numbered sequentially. You can place things out of order, but in general, the expectation is that you push one thing onto a stack growing the size of the array by one.
A[1]=0.234234 A[2]=0.3234 A[3]=23.23
Arrays can, of course, be multi-dimensional, but generally, they are presumed to be all the same type of data, and thus you can write:
A[1,2]=0.234234
Associative arrays, objects in javascript, named arrays, or hashes.
Number storage vehicles have limitations, and thus there is another type of storage that is much similar to an address, and those are termed Associative arrays. Instead of a number, we use a name.
GeneName={"PTEN":"phosphatase and tensin homolog"}
I could have a variable called GeneInfo. I could store GeneInfo{‘PTEN’}{‘Chr’}=’chromosome10′, and then store all sorts of information in a way that is logically retrievable. This comes in handy a lot. Again, historically, you do have the same type of data in unordered lists or hashes.
We can get much more complex and mix these quite a bit into data structures. In R, we use dataframes, which include Hashes of arrays, etc., and so forth. For example, let us load up some data!
Tables
Tables can be thought of as a Worksheet. Below, we have the table hospital-data
with multiple column headers. This actually comes from a csv
file (below) that is publicly available, and you can download it (link).
CSV Comma Separated Files
We can represent the same data as a CSV. The first row is typically the header.
newID,PC1,PC2,PC3,PC4 PPMI.Pilot.HA_ITG001_296597,0.108006766,0.015726028,0.093262438,0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690,0.104529423,0.017393778,0.117255688,0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682,0.107576826,0.002806186,0.171024597,0.071616656 PPMI.Pilot.HA_ITG004_3119622,0.074604492,0.049329752,-0.115794448,0.099060234 PPMI.Pilot.HA_ITG005_3145413,0.110742871,0.061465364,0.231489894,-0.070450761 PPMI.Pilot.HA_ITG006_953306,0.113388178,0.013073106,0.232587341,-0.002473437 PPMI.Pilot.HA_ITG007_1176618,0.065665593,0.080236696,-0.218573769,0.071228383 PPMI.Pilot.HA_ITG008_PP0015.9868,0.088975015,0.026523725,-0.02374357,0.10829439
There is a big problem here in that often “,” is used within documents. For this reason, csv
is really not ideal. Some tools like R
offer to use ” to help, but its still prone to problems. There are a few other issues, such as the use the unicode
. More often then not, its best to have quotes (“) in them.
TSV (Tab Separated Files)
Tabs are /t in most tools such as unix. In BASH you need to press control-v then press tab after letting go of control-v. We will use these a fair amount
newID PC1 PC2 PC3 PC4 PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093PPMI.Pilot.HA_ITG001_296597 0.108006766 0.015726028 0.093262438 0.060885093 PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544PPMI.Pilot.HA_ITG002_PP0018.3690 0.104529423 0.017393778 0.117255688 0.07702544 PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656PPMI.Pilot.HA_ITG003_PP0018.3682 0.107576826 0.002806186 0.171024597 0.071616656
All of these are considered flat files. In the future we will talk about structured tables and databases such as through SQL or PostGRSQL
CSV & TSV Tables
Comma-separated values (CSV) and Tab-Separated Files are plain text files, where the first line is typically a header and the following lines are rows. These are typically in ASCII or plain text.