Databases
DATABASES
A database is an organized collection of data at its simplest level. Data is usually organized by rows or tables that contain compartmentalized sets of data. Examples could include addresses, bank transactions, medical records, and just about anything you can think about. Arguably, a database could be a simple textfile, an Excel workbook, or a set of tables within a database management system. Popular database options including MySQL, PostgreSQL, Microsoft Access, Microsoft SQL, and Oracle are relational. They use SQL
(Structured Query Language), which is a programming language just for managing data in a relational database. Non-relational databases, by contrast, do not have a strict column and row schema. They are viewed by proponents as “more flexible” and are increasingly used by organizations like Facebook, Google, and the NSA for big data situations like bioinformatics
History
The introduction of the term database coincided with the availability of direct-access storage (disks and drums) from the mid-1960s onwards. The Oxford English Dictionary cites a 1962 report by the System Development Corporation of California as the first to use the term “database” in a specific technical sense.
Modern databases are often built as Relational Databases, as outlined in the 1970 article by Dr. EF Codd “A Relational Model o fData for Large Shared Data Banks”. In this paper, he described a new system for storing and working with large databases. Instead of records being stored in some sort of linked list of free-form records as in CODASYL, Codd’s idea was to use a “table” of fixed-length records, with each table used for a different type of entity. A linked-list system would be very inefficient when storing “sparse” databases where some of the data for any one record could be left empty. The relational model solved this by splitting the data into a series of normalized tables (or relations), with optional elements being moved out of the main table to where they would take up room only if needed. Data may be freely inserted, deleted and edited in these tables, with the DBMS doing whatever maintenance needed to present a table view to the application/user.
Linking the information back together is the key to this system. In the relational model, some bit of information was used as a “key”, uniquely defining a particular record. When information was being collected about a user, information stored in the optional tables would be found by searching for this key. For instance, if the login name of a user is unique, addresses and phone numbers for that user would be recorded with the login name as its key. This simple “re-linking” of related data back into a single collection is something that traditional computer languages are not designed for.
Codd’s paper was picked up by two people at Berkeley, Eugene Wong, and Michael Stonebraker. They started a project known as INGRES using funding that had already been allocated for a geographical database project and student programmers to produce code.IBM started building the first database based on this model, and then in 1979 Oracle built Relational Technology began commercializing the first modern database. The impact of Oracle on modern computing in the workforce is often underestimated, but in fact, IBM and Oracle developed many of the tools that organized the foundational data within many industries. Larry Ellison’s Oracle Database started from a different chain, based on IBM’s papers on System R. The dominance of Oracle began when they beat IBM to market in 1979.
The 1980’s saw the emergence of desktop computing, empowering users with Lotus 1-2-3 and dBase – two tools that remained dominant through 1990s. The 90s also saw a rise in object-oriented programming and growth in how data in various databases were handled. Programmers and designers began to treat the data in their databases as objects. That is to say that if a person’s data were in a database, that person’s attributes, such as their address, phone number, and age, were now considered to belong to that person instead of being extraneous data. This allows for relations between data to be related to objects and their attributes and not to individual fields. Examples include: Name.address would give the address within the Name object.
In the 2000’s the rise of XML and later JSON documents occurred. Also began the rise of no-SQL databases (non-relational) which were very quick, and leveraged cheap data storage.
Tables
A table in a database is a collection of rows
and columns
. A table has defined the number of columns.
It can have any number of rows. The rows can be identified by one or more values within the column subset, though the uniquely identifying column is recommended – termed a UID. A specific choice of columns which uniquely identify rows is called the primary key
. Views
also function as relational tables, but their content changes as the table data changes.Rows
A row contains data pertaining to a single item or record in a table – Rows are also known as records or tuples
. This term tuple
we see used in many programming languages such as Python.
Columns
A column contains data representing a specific feature of the data, and can also be described as fields or attributes.
Databases typically are founded with a few core functions or capabilities. One can create an entry, read an entry, update an entry, or delete an entry. The acronym is CRUD.
Database management system (DBMS)
A system that interacts users, software or other parts of the database, allowing viewing, editing, and analyzing data within the database.
CRUDdy Databases
The acronym CRUD refers to all of the major functions that are implemented in relational database applications. Each letter in the acronym can map to a standard Structured Query Language (SQL) statement, Hypertext Transfer Protocol (HTTP) method (this is typically used to build RESTful APIs):
Operation | SQL | HTTP |
---|---|---|
Create | INSERT | PUT / POST |
Read (Retrieve) | SELECT | GET |
Update (Modify) | UPDATE | PUT / POST / PATCH |
Delete (Destroy) | DELETE | DELETE |
The comparison of the database-oriented CRUD operations to the HTTP methods has some flaws. Strictly speaking, both PUT and POST can create resources; the key difference is that POST leaves it up to the server to decide at what URI to make the new resource available, while PUT dictates what URI to use; URIs as a concept do not align neatly with CRUD. The significant point about PUT is that it will replace whatever resource the URI was previously referring to with a brand new version, hence the PUT method being listed for Update as well. PUT is a ‘replace’ operation, which one could argue is not ‘update’.
In computer programming, create, read, update, and delete (CRUD) are the four basic functions of databases. Think of it as building a table.
SQL
SQL (Structured Query Language) is a computer language aimed to store, manipulate, and query data stored in relational databases. The first incarnation of SQL appeared in 1974, when a group in IBM developed the first prototype of a relational database. The first commercial relational database was released by Relational Software (later becoming Oracle).
Read/Select:
SELECT column-names FROM table-name WHERE condition ORDER BY sort-order
Example:
SELECT FirstName, LastName, City, Country FROM Customer WHERE City = 'Paris' ORDER BY LastName
Insert/Create
INSERT table-name (column-names) VALUES (column-values)
Example
INSERT Supplier (Name, ContactName, City, Country) VALUES ('Oxford Trading', 'Ian Smith', 'Oxford', 'UK')
Update
UPDATE table-name SET column-name = column-value WHERE condition
Example
UPDATE OrderItem SET Quantity = 2 WHERE Id = 388
Delete
DELETE table-name WHERE condition
Example:
DELETE Customer WHERE Email = 'alex@gmail.com'
noSQL
NoSQL (referring to “non SQL” or “non-relational”) databases provide a mechanism for storage and retrieval of data that is modeled typically as key:value, and are document stores. JSONs are great examples. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called “Not only SQL” to emphasize that they may support SQL-like query languages, or sit alongside SQL database in a polyglot persistence architecture. Motivations for this approach include: simplicity of design, simpler “horizontal” scaling to clusters of machines (which is a problem for relational databases),and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as “more flexible” than relational database tables.
A critical key is that they do not natively allowing joining.
ACID
ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc. In the context of databases, a sequence of database operations that satisfies the ACID properties (and these can be perceived as a single logical operation on the data) is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.
Normalization
As needs change over time, a database will undergo frequent updates to its table and column layout. Redundant data can often occur and cause multiple problems:
- inefficient – the database engine will need to process more data for each query or update
- bloated – storage requirements increase due to redundant data
- errors – redundant data must be manually inputted, which is error-prone
Normalization is the formal term for the process of eliminating redundant data from database tables.
Joins
A SQL Join statement is used to combine data or rows from two or more tables based on a common field between them. Different types of Joins are:
- INNER JOIN
- LEFT JOIN
- RIGHT JOIN
- FULL JOIN
Joins in R
Referring back to R, dplyr is a great tool for joining two tables.
How would we join? First Load in data:
samples <- read.csv('sample_info.csv',header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, row.names = 1) genes <- read.csv('expression_results.csv',header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, row.names = 1)
Lets get sum summary level information. What and how does this work?
genesummary<-data.frame(lapply( genes , function(x) rbind( mean = mean(x) , sd = sd(x) , median = median(x) , minimum = min(x) , maximum = max(x) , s.size = length(x) ) ))
Lets manipulate the dataframe:
genesummary_t=data.frame(t(genesummary))
Lets subset the dataframe:
filteredsamples <- subset(genesummary_t, sd < 1000 & maximum < 200000)
Lets create a common column to join on:
filteredsamples$names=rownames(filteredsamples) samples$names=rownames(samples)
Lets join:
pen_samples<-inner_join(samples,filteredsamples,by = "names")