Choosing Data Models for Big Data

In this article, we discuss the things to consider when choosing a data model to represent a particular concept.

About the author: Ayse Kok is a resident writer for Data Driven Investor.

At a high level, a data model defines the structure and meaning of data in a system. This shapes how you represent concepts of interest or entities and their relationships to each other. It also influences how you end up thinking about these entities. For example, one USER can own many Email ACCOUNTs, and each account can, in turn, own many EMAILs

There often isn’t one true answer for which data model should be used to represent a particular concept. It is up to the system architects to consider the aspects of the application they are trying to build, and select the data model which they think best represents the systems needs and allows for growth.

For example, instead of the user->accounts->emails representation, a user could ostensibly model e-mail in a different way by storing messages themselves as first-class objects, and link the accounts who exchanged the message to that particular email (this is a manner more similar to the Drive model of how documents are represented and connected between collaborators). Instead of storing one email per account, this model would deduplicate or normalize the email messages, possibly saving storage space at the cost of some more costly operations for certain queries.

This example also emphasizes the fact that the data model is not the same as the data representation, but really an actual modeling step. The choice of data model has real implications on how an application will behave, and what operations are natural or feasible for it to perform.

Layers of Data Models

Data models help developers manage and transform the information their applications work with specifying how that data should be represented and interrelated. This has the additional effect on the model they use to represent the data in their systems governs the way they think about and share that data.

Data models can be found either vertically through the application stack or horizontally when that application interacts with other systems. There are often multiple layers of data models which make up anyone application. Indeed, one level’s data model maybe another layer’s implementation.

As mentioned before, at a given layer, the data model is different than the actual representation of the data. Depending on the developer’s needs, he might want to store your data in structured JSON representations or XML. An application’s data model shouldn’t be confused with the implementation of the database, which is the mechanics behind how the information is stored and retrieved (which is a subsequent class).

Data Models

Perhaps the simplest data model is that of the key/value pair, where information is organized as pairs of keys and arbitrary opaque values. What is meant by opaque is that the system doesn’t understand anything about the values stored; they are simply unparsed and uninterpreted bytes.

b85fb82db41bcb305f6ce12c26ea2258c841c1ee

Non-relational database systems are known generally as noSql, which may not be a great name. They aren’t necessarily anti-relational database; rather they are simply alternatives that can be used when the relational model isn’t the best fit for the data. They can be considered as non-relational databases as the next step up from an optimized key-value store approach, in terms of adding structure to the data being stored.

One-to-many Relationships

The ability to naturally represent the implicit tree structure of one-to-many relationships is one advantage of the document model. Instead of spreading the items and character information across multiple tables in the database, the entire character sheet is stored in a single instance.

With the document model approach, to make a change across all character sheets a developer would have to request each character sheet from the database, then iterate over all of the items in each sheet to find the shields, then modify each item found. This can be tricky; there is a lot of overhead involved in requesting all the character and iterating over their items and runs the risk of data inconsistency.

While these kinds of operations can be performed more naturally in a normalized database, the tradeoff is being bound to a strict schema which defines the relationships between the entities.

Many-to-one Relationships

Many-to-one relationships differ from one-to-many relationships mostly based on perspective, and the kind of relationship depends on what side the entity is on.

In a traditional relational database, a many-to-one relationship will have a single row in the ‘one’ table (e.g., the user table), which is associated with many rows in the ‘many’ table (e.g., the articles table). This manifests generally in the database schema by adding an additional column to the ‘many’ table, which references the ID of the ‘one’ row (called a foreign key).

This kind of relationship isn’t a natural fit for document-style data models. There are no tables related by foreign keys in a document-model datastore, which means that if the data contains many-to-one relationships one can’t do a JOIN on the tables in question to return all the relevant data. In a document-model, the developer would need to rely either on application-level joins or denormalization.

Application-level joins mean that the developer would need to manually resolve references from one record to another. This often involves performing multiple queries to the database to retrieve the parts of the data wanted, then let the application decide how to query and stitch those parts together. This shifts the work of the JOIN from the database to the application code, something not everyone might find desirable, as databases are often optimized at a low level to perform JOIN operations, and the application may not be.

e376779a252a214871c369fe17483c196dd6a9d7

Denormalization is an alternative to or enhancement for application-level JOINs, where the developer would duplicate some of the data in order to avoid having to make multiple queries. The tradeoff here is that updating denormalized data requires some overhead which would risk inconsistent data if for some reason not all copies of the denormalized data are updated (recall our shield update scenario).

Many-to-many Relationships

Many-to-many relationships are connections lists of entities. For example, given a collection of books and authors, it is entirely possible that each book may be written by multiple authors, and each author may have written one or more books.

If the document model were used some denormalization could be undertaken by storing books titles in the author records:

{author:"balint",book_titles:["Email in 500 Easy Steps","Delivery delays;Feature or Bug?"]}

{author:"igorgr",book_titles:["Compensation and Competition; Motivating Your Employees","Delivery delays; Feature or Bug?"]}

This would let one obtain all of the books a given author has written relatively easily by querying author document records. However, if there was information in the book document which one wanted that wasn’t the title, one would need to query book document records as well and perform the joins in your application.

The relational model can represent many-to-many relationships up to a point, but when the interconnections between entities are complex, there is a more natural alternative referred to as the graph model.

Data Model: the Relational Model

Within a relational database, data is organized into relations (generally called tables), where each relation in a table is called a row. One can read a particular row by designating some columns in the row as keys and matching those in your query. Cross-table relationships can be constructed by adding keys.

In the relational model, the predefined type of the fields in a given row is specified by the schema. The schema governs what kinds of data can go where what tables exist and how they are related.

Relational databases often enforce data normalization via their tabular nature to minimize redundancy (the opposite of the document model approach) and enforce data consistency. Normalization can be considered as the reduction of data to any kind of canonical form: e.g., instead of allowing a free form text field for your state of residence, only allow a choice from a predefined set of values. Rows in tables can be related to one another via foreign keys. This facilitates a natural way to represent many-to-one relationships and allows the database to natively support JOIN operations to retrieve related data.

Data Models: the Graph Model

As mentioned before, highly interconnected data aren’t a great fit for the document model, which most naturally works with tree structured data (data with one-to-many relationships). The relational model can handle some simple cases of interconnected data (many-to-many relationships), but if the data is very interconnected, then the graph data model can be considered.

In the graph model, data is represented by vertices (entities or nodes) and edges (the relationships between those entities). There are many different kinds of graphs, but a few more common ones are social graphs, web graphs, and road graphs.

In a social graph, the entities are people, and the edges indicate which people know one another. On the other hand, web graphs represent individual web pages as entities and links between them as edges. Algorithms such as Google’s PageRank can be used to rank the importance of web pages by popularity by using the number of edges (links) to a page as a proxy for this popularity.

In road graphs, vertices represent locations (perhaps coordinate pairs) and edges model the roads between them. You might immediately think of Google Maps. The data is represented are all highly interconnected.

3e212fc85c617f73746d765454d716528ac824c0

The graph model can also be used to store completely different types of objects and their relationships in a single datastore. In a graph data model, the relationships between entities are stored directly. Instead of looking in two tables, the entry in the users’ table has a pointer to the email address. This eliminates the need for a potentially costly JOIN, whose cost increases with the complexity of the query. Imagine that you wanted to perform the following query: “who is the comedian host guy on that British baking show with Paul Hollywood?”. In a relational database, you can imagine there are going to be a lot of separate searches through separate people and tv-show tables. By contrast, in a graph database, you can find the answer in one query. You’d start with Paul Hollywood, gather the links to the TV shows he’s been on, search that list for “baking”, follow that to the Great British Bake Off, gather all the links for hosts on that show, then search that list of host for people tagged with ‘comedian’ and ‘male’.

The important thing is that any vertex can be connected to another via an edge (there are no restrictions enforced by a schema on what can be connected), which makes the model flexible. Also, you should be able to traverse the graph simply; given any vertex, you should be able to efficiently find both its incoming and outgoing edges. You can also use labels or other metadata to differentiate between relationships, making it possible to store different kinds of information in a single graph while keeping the data model clean.

Choosing a Data Model

Given all the different kinds of data models available, one might be wondering how to pick between them. As mentioned before, there are no hard and fast rules, and that every choice has some costs and some benefits.

When choosing a data model, it is helpful to keep a few things in mind such as which model will lead to more simple code at the application level, how much flexibility to anticipate for the future use of the application, and what kind of access patterns the application will require (read, write, mutate).

Code Simplicity

The model that will lead to a simpler application code depends on the kind of data the application will be writing. For instance, if the data is representable by a tree-like model of 1-to-many relationships, the document model is likely a good fit for the data. However, one of its limitations is that one can’t refer to nested elements in the stored document directly, requiring one to use the implicit structure of the document to access the data.

The document model also has a bit of trouble naturally representing many-to-one and many-to-many relationships, which results in generally poor support for database-powered Join operations. One can reduce the need for joins somewhat by denormalizing your data (spreading copies of data around), but then the application will need to handle the joins by making multiple queries and stitching them together, which can be complicated and slow.

The relational model can be helpful in the case of many-to-many relationships, yet in the case of highly interconnected data, the graph model might be more useful. The graph model is flexible enough to handle these connections and supports natural querying of the data without the need for complicated cross-table joins.

Schema

Although usually lacking an explicit schema definition, document databases aren’t really without a schema. The documents written to them contain an implicit schema provided by the structure of the document itself. Arbitrary keys and values can be added to the document when written; when the document is read it is then up to the client to interpret the structure. This is often referred to as schema-on-read. This is opposed to the traditional approach of relational databases, where the constraints of a strict, predefined schema are enforced when the application writes to the database (referred to as schema-on-write).

These two approaches to schema enforcement are similar to the dynamic vs static type-checking used in programming languages. Schema-on-read is similar to runtime (or dynamic) type checking used by languages like Python, while schema-on-write is more akin to the static (or compile-time) type checking used by C++.

The difference between these two approaches can be seen in the context of a data format change. If one were to store dates in Month-Day-Year format and would like to change that to Year-Month-Day, one would add some code to the application to handle the old format when old documents are read in the case of a document database where the schema is enforced on reading.

In a relational model with the schema enforced on write which requires the Month-Day-Year format, one would need to perform a schema change. This would mean adding an additional column to a table to handle the new format, then copying the old data into the new format, and finally cleaning up the old column. This kind of change can be tricky, time-consuming on large datasets, and dangerous.

Alternatively, if one would like to modify the data contents, doing so in a normalized relational database might be easier. If the email address of a given user changes, all of the existing mail messages sent by that user do not reflect that change. To perform such an update, one would need to update every document individually, which invokes its own overhead and consistency concerns.

So, the question is: is there homogeneous data with the same structure which would benefit from a schema to document and enforce that structure? Or is there heterogeneous, unstructured data which would benefit from more flexibility when storing that data? Will the application be making modifications to the schema frequently? Will the strictness of the schema impede development?

These are the kinds of questions questions that should be asked when considering data model options.

Conclusions

It should be taken into account that each model has strengths and weaknesses, depending on the use case of the application. Also, most applications are made up of several layers of data model, each one attempting to present a clean API to the layer underneath it.

This overview of different kinds of data models will hopefully be enough to get a developer started in thinking about the choices behind the data models used by the applications.

This article is originally published on: Data Driven Investor

0 0 0

Share on

Community

Choosing Data Models for Big Data

Layers of Data Models

Data Models

One-to-many Relationships

Many-to-one Relationships

Many-to-many Relationships

Data Model: the Relational Model

Data Models: the Graph Model

Choosing a Data Model

Conclusions

Read next post:

Data Driven Investor

You may also like

Comments

Data Driven Investor

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

ApsaraDB for HBase

Tair