[sf-lug] Dynamically Scalable Data?

Ernest De Leon edeleonjr at gmail.com
Fri Apr 18 11:14:51 PDT 2008

Let me preface this question by saying that very often I have these insane
ideas which I then attempt to architect and flush out in my mind despite the
fact that I may never actually implement them or I may not have intended to
even before I thought through it.  This is probably one of those...

So, I was thinking about what it would take to build a new search engine (a
la google's) from the ground up.  I know that there are some open-source
projects out there designed to do exactly that like natch and hadoop, but I
was more concerned with the data warehousing and mining part of it.

Now I have never been much of a programmer aside from python scripts and
small applications here and there, so forgive me if I fumble something.
Let's assume that you have your main webserver software designed and ready
to deploy in some sort of massively scalable infrastructure.  Let's then
assume that you take a snapshot of all (loosely) internet content and that
becomes your data warehouse.  Your only task at this point is to take search
queries from your web servers, do the magic so to speak, then return
results.  Here is where I have the questions.

We all know that google uses economy hardware and a proprietary GFS system
to power their business.  Let's say, however, that I wanted to take a
different approach.  Let's say I was able to get a serious HPC cluster to do
all of the crunching for this system (something from SUN or IBM).  On the
data side, let's say that I got a hold of several NAS units that could hold
all of the data independently (thus providing redundancy), and that
array-based-replication kept these in sync.

I guess my question is more theoretical than anything.  How would the
dataset lie across the storage array?  Would there be certain pieces that
are requested more than others (like say a viral picture or article or new
vs old content)?  Would the requests against the dataset flatten over time
(as would be expected)?  How could you optimize the data storage both at
it's initial storage time and on-the-fly thereafter based on the data's
request frequency?  I would think that these problems would face anyone
(including google) because redundancy does not necessarily equal
availability, or rather, may not scale to the level of availability
necessary.  Something in the data storage schema needs to be dynamically
scalable, and how would you approach this?

I believe this will primarily be a software issue which must tie intimately
with the file system to overcome it.

Ernest de Leon

"They who can give up essential liberty to obtain a little temporary safety
deserve neither liberty nor safety." - A common 18th Century sentiment
voiced by Benjamin Franklin

"A patriot must always be ready to defend his country against his
government." - Edward Abbey

"All that is necessary for evil to triumph is for good men to do nothing." -
Edmund Burke, English statesman and political philosopher (1729-1797)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://linuxmafia.com/pipermail/sf-lug/attachments/20080418/3f216c65/attachment.html>

More information about the sf-lug mailing list