The Data Hive

11. November 2008 00:19

I've had some ideas bouncing around in the back of my head for a few days now.  One is that when using an application, specifically a large-scale web application, having a traditional RDBMS backend is a point of limitation in terms of scale.  RDBMS systems are really good for being able to aggregate data, and create reporting interfaces, but it isn't so great from being able to use structured object data on the front end.  Now I'm talking about going beyond ORM mappers, and even beyond an Object Database here.  What I see coming in the future is a Data Hive.

Think about being able to request a serialized object view.  You request the object based on the type, and an identifier.  You don't care where it was stored, or even so much how it was stored in terms of the front end, you want your data.  On the backend the hive client makes a request to the hive, that gets spread through the hive members, and one of said members makes a response stating it has said object/data.  Then a more static connection is made to retreive this piece of data.  Possibly having the hive traffic on a separate network, with very low level broadcast network calls for the request/response.  I realize that this may be very chatty in terms of traffic, especially when more than one resource will actually have the data being requested.

The hard part will be having the most used data distributed in such a way that it is both widely available.  In addition to distributing load to all the hive nodes.  Also worth thinking about is how to replicate data to different networks, that will reside in different locations.  In essense what is needed is something that is fast, reliable and scalable.  Something like Memcached, with redundancy, and persistance added to it.

Some things to consider are...  How to search for specific data, and maintaining lists, and updating said lists (indexes) within the hive.  How to manage scaling deeper as well as wider.  Having relays from one hive, to another, for the purpose of extending the data storage to deeper levels, in addition to wide hives.  How to segment which data gets replicated, and/or passed to which layered hives.   X-Tree indexing of paths, perhaps.  X being the unknown, not for a cool version "X".  As whatever is used needs to have some dynamic redundancy to allow for multiple storage and query paths.

Most of these thoughts come from the fact that many large scale sites are using things like memcached, to store rendered content because the backend is too sluggish to keep up, instead of rethinking how to store things on the backend.   You can add additional read-only databases for replication, then you are replicating data in excess of what is needed.  You can separate data into paired nodes, then it's more difficult to get related data, without a worse performance hit, and loose the biggest benefits of RDBMS/SQL databases.  You can pre-render content, or objects into a caching layer, but then you lose the persistance, and have to create fallbacks.  I think the future is a better data interface that simply scales.  When you need to do logging, or write reports, have the data cache out to the rdbms, instead of the other way around.

Comments are closed


Michael J. Ryan aka Tracker1

My name is Michael J. Ryan and I've been developing web based applications since the mid 90's.

I am an advanced Web UX developer with a near expert knowledge of JavaScript.