What Goes Around Comes Around

Today I read the What Goes Around Comes Around chapter from the “Red Book” by Michael Stonebraker and Joseph M. Hellerstein. The chapter (or paper if you will) is a summary of 35 years of data model proposals, grouped into 9 different eras. This post is a kind of cheat sheet to the lessons learned in the chapter.

The paper surveyed three decades of data model thinking. It is clear that we have come “full circle”. We started off with a complex data model (Hierarchical/Network model), which was followed by a great debate between a complex model and a much simpler one (Relational model). The simpler one was shown to be advantageous in terms of understandability and its ability to support data independence.

Then, a substantial collection of additions were proposed, none of which gained substantial market traction, largely because they failed to offer substantial leverage in exchange for the increased complexity. The only ideas that got market traction were user-defined functions (Object-Relational model) and user-defined access methods (Object-Relational model), and these were performance constructs not data model constructs. The current proposal is now a superset of the union of all previous proposals. I.e. we have navigated a full circle.

Hierarchical Data Model (IMS)

Late 1960’s and 1970’s

Lesson 1: Physical and logical data independence are highly desirable
Lesson 2: Tree structured data models are very restrictive
Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data
Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard. (Key-Value stores anyone?)

Network Data Model (CODASYL)

1970’s

Lesson 5: Networks are more flexible than hierarchies but more complex
Lesson 6: Loading and recovering networks is more complex than hierarchies

Relational Data Model

1970’s and early 1980’s

Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence
Lesson 8: Logical data independence is easier with a simple data model than with a
complex one
Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology (Key-Value stores anyone?)
Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers (Key-Value stores anyone?)

Entity-Relationship Data Model

1970’s

Lesson 11: Functional dependencies are too difficult for mere mortals to understand

Extended Relational Data Model

1980’s

Lesson 12: Unless there is a big performance or functionality advantage, new constructs will go nowhere

Semantic Data Model

Late 1970’s and 1980’s Innovation: classes, multiple inheritance.

No lessons learned, but the model failed for the same reasons as the Extended Relational Data Model.

Object-oriented: late 1980’s and early 1990’s

Beginning in the mid 1980’s there was a “tidal wave” of interest in Object-oriented DBMSs (OODB). Basically, this community pointed to an “impedance mismatch” between relational data bases and languages like C++.

Impedance mismatch: In practice, relational data bases had their own naming systems, their own data type systems, and their own conventions for returning data as a result of a query. Whatever programming language was used alongside a relational data base also had its own version of all of these facilities. Hence, to bind an application to the data base required a conversion from “programming language speak” to “data base speak” and back. This
was like “gluing an apple onto a pancake”, and was the reason for the so-called impedance mismatch.

Lesson 13: Packages will not sell to users unless they are in “major pain”
Lesson 14: Persistent languages will go nowhere without the support of the programming language community

Object-relational

Late 1980’s and early 1990’s

The Object-Relational (OR) era was motivated by the need to index and query geographical data (using e.g. an R-tree access method), since two dimensional search is not supported by existing B-tree access methods.

As a result, the OR proposal added:

user-defined data types
user-defined operators
user-defined functions
user-defined access methods

Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and thereby bluring the distinction between code and data) and user-defined access methods
Lesson 15: Widespread adoption of new technology requires either standards and/or an elephant pushing hard

Semi-structured (XML)

Late 1990’s to the present

There are two basic points that this class of work exemplifies: (1) schema last and (2) complex network-oriented data model.

Lesson 16: Schema-last is a probably a niche market
Lesson 17: XQuery is pretty much OR SQL with a different syntax
Lesson 18: XML will not solve the semantic heterogeneity either inside or outside the enterprise