2007/01/15

no moore (part 2)

I was going to call it "the sequel"... :-)
Part 2 it is, then.


OK, so let's make a few things clear upfront:

1- Jim Gray's "Personal Petabyte" was used here as a metaphor to facilitate the discussion and increase awareness of the incoming problem: the ELDB - as in Extremely Large Data Base. It doesn't mean that everyone will store all your videos somewhere! Think blurfl RFID data here for a concrete example of what I'm talking about - I'm amazed wikipedia doesn't know what a blurfl is...

2- The problem is not the random access speed of disks! The problem is that even with bulk transfer rates of 1GB/s - which you MIGHT get in another 10 years! - you still will need 11 days to full scan 1PB! Therefore, any suggestion that brute force "full scans" will replace indexing and good design ignores the real problem. Which is one of speed of BULK data access. Not speed of random data access!

3- SSD addresses one problem only: the speed of random access. In other words: the seek times of traditional disks. No SSD so far has addressed the problem of bulk data transfer: they all are within the same ball park of current disks when it comes to bulk. Yes, folks: 100MB/s and 10K IOPS is perfectly possible with any low cost SAN right now, SSD has improved nothing in that chapter! As such, thinking SSD when you're dealing with a problem of ELDB and replacement of indexes by full scans is actually a perfect example of shooting oneself in the foot!


And now I'll go even more extreme!


The problem of handling ELDB is not one of indexing or design either! Think about it: if you are struggling to full scan 1PB in 11 days at data transfer speeds that you MAY be achieving in another ten years, do you seriously believe you can properly index that sort of volume? And maintain that index?

Dream on, bro!

Just the CREATE INDEX will take you a full month. Throw in 15 indexes on the same table and the task of inserting a record becomes herculean. Yes, parallel it, you may well say. So what? You still have to TRANSFER all that data somewhere else to create the index, dude! That, is THE problem: the actual transfer!

No, indexing is not the solution either. Not at these volumes.



So, what is the solution?



Well, let's backtrack for a second here. How did we locate very large volumes of data when computers didn't exist?

Perfect example: libraries. How did they manage to organize themselves so they could find the precise location of a book amidst tens of thousands, without the help of so much as a calculator let alone a computer?

They used ranking and catalogue systems to do so. Storage ranking as well.

Ie, they didn't just throw every book into any shelf and expect some magical "brute force" mechanism to find it, next year.

Well, some of us do so... :-(

But, I digress...

Books were catalogued and sorted into precise locations according to precise rules when they were added to the library and therefore were very easy to locate thereafter: all the client had to know was an author and/or the subject domain of the book, and bingo, up came the shelf or shelf unit where the book was. The "brute force" scan was reduced to a simple matter of looking up one book amid 300 or so, rather than one in the middle of 50000!

And if the librarian was a really good one, the books would be sorted out by title within the shelf. Which made locating the book kid's play.


I know: I was one of the kids playing!


Anyone notice a parallel (pun intended!) here between ranking/cataloguing and what we now know as "partitioning"? Anyone noticed also the similarity of the sorted shelf titles with local indexes in a partition?


Yes, folks. THAT is the solution I see. Sure, we need to evolve the SQL design and syntax itself as well as the storage architecture to make this possible. But that is a LOT easier than trying to subvert limitations of physics, believe me!


There is no way with current or future disk or SSD hardware that we can handle PB databases unless we reduce the problem to small portions. Simple divide to conquer, really. Apply it to very large volumes and we got it. Provided of course that we adjust our way of thinking to this brave new world.


How? Read on.


"OK, so we partition everything, Nuno. Have you thought of how long the CREATE TABLE statement will now become? Have you looked at the syntax for partitioning? Man, the CREATE TABLE section of the SQL manual is now half the size of the bloody book and you want to make it MORE complex? What you been smoking, dude?"


Well, no. Not really, folks. Notice that I said we need to evolve SQL design and syntax. Just like we need to evolve the hardware itself: it ain't gonna happen with what we know now!


Two pronged attack. Disks first:


As far as making disks faster, be they SSD or conventional, the solution is very simple: if you can't bring Mohamed to the CPU/memory "mountain" in a usable time, then bring the mountain to Mohamed! Slap a core2 Duo or whatever it will be in 10 years time, with 100GB of memory, on EACH disk!

Then, make it so that Oracle or db2 or SQL Server can download a simplified meta-SQL interpreter - the output of the optimizer - into each of these very large capacity disks.


Now, what was that access problem again for bulk data? Yup, as a rdbms kernel, you now can download an entire partition management subsystem into a single disk. And have that disk take care of managing the local indexes that handle the partition(s) residing in that disk! And all the other disk space paraphernalia needed to manage a single device capacity of 500TB. Anyone got the "Oracle ASM" bell ringing now? Yes: it's the start of that, folks.

With PMR, in another 10 years you'll be looking at a single disk of that capacity. Do you seriously think that the SAN and NAS makers -and their users and potential market - out there are gonna listen to BAARF and/or BAHD? Particularly when, as Kevin so well points out, the vast majority of data storage is going to be unstructured and outside databases?

No, folks: baarf and badh have as much chance of getting anywhere as I have of becoming the president. That's why I am not a member of either.

Don't get me wrong! Both address a very real problem. It's the solution they propose that is, IMO, completely inadequate: the problem is not speed of random access. The problem is speed of bulk access and how to master the ELDB with the h/w and s/w that realistically we will get in the next 10 years!


OK, that was disks. Now, software.


As in: RDBMS. Because the alternatives - OODB and other such flights of fancy - are not really there and have never proven to even be aware of the problem! Forget a working solution from these...

Clearly, what we need is a MAJOR simplification of the entire reality of the Oracle partitioning concept. No more 50 pages to describe all the nuances and subtleties of the partitioning syntax!

I want to type:
CREATE TABLE blurfl_table (
dept "datatype spec",
blurfl_type "datatype spec",
rfid_date "datatype spec",
etcetc...
.
.
.
)
aggregate on dept, blurfl_type,rfid_date;

Boom. Period. Sure: later on, I want to be able to indicate to the rdbms engine that I want the physical aggregation for dept to go to my "wonder disks" 44,45 and 46. And within that, I want the aggregation blurfl_type to be spread by individual value in as many partitions as needed to cope with the data spread. And so on.

Or maybe not: I just give all those to ASM and it knows how to physically aggregate - the library shelf metaphore - among all the storage units it knows about. And automatically have a local index on that aggregation key stored in meta-data somewhere else and local indexes in those devices for all PKs, RIs and any other grouping constraints I might also specify with the table.

Now, I write my humble end-user yearly "mother-of-all-queries":

select
dept,
blurfl_type,
trunc(rfid_date,'MONTH'),
count(*)
from blurfl_table
where rfid_date >= to_date('2019-01-01')
and rfid_date < to_date('2020-01-01)
group by dept,blurfl_type,trunc(rfid_date,'MONTH');

or words to that effect...

and instead of it becoming the legacy query that my grand-children will see the results of, the following happens:

1- The optimizer - FINALLY doing something smart! - will recognize from the syntax and the meta-data in the dictionary that I'm grouping on the exact aggregation keys of my data. It then reads the locations of the aggregations for those keys off the dictionary, slaps them on the ASM layer with the model SELECT for each one of them.

2- The ASM layer, having previously loaded the SQL meta-interpreter into each of those intelligent 500TB storage units, then fires off the modified model SELECT to each one of them with instructions to aggregate all that data and come back with the result.

3- When the results reach the ASM, it now further aggregates that, a-la Materialized View, into a result set for the end-user query.

4- Who then gets the result within a reasonable time. Of course, end-users being what they are, the speed will still not be satisfactory. And they'll still loudly complain they could do it much faster in Excel and "Windows Panorama" on their MacTel desktop!

But, what the heck:

We tried!...




Yes, yes: I know. There is a lot of "Steve Jobs presentation" in the above scenario.

But:

1- have you had a close look at how 10gr2 manages the AWR partitioning totally transparent to you? And how it partitions the performance data?

2- Have you had a look at what ASM keeps in its own instance of meta-data?

3- Have you noticed the 11g "wish list" for range-range composite partitioning?


Dudes, Dudettes: get ready! Because it's coming.


At light speed. And I can't wait to get my mitts on it!




(end of crystal ball gazing for another decade...)

6 Comments:

Anonymous Anonymous said...

Good stuff. Interesting, entertaining and contentious. I wish I could write blog postings like this!

Cheers,

Doug

Monday, January 15, 2007 8:58:00 pm  
Blogger Pete Scott said...

Noons,
You have spurred my Blog museā€¦ I will tackle a random thought on this later.

Of course the idea of minimising IO channel use by moving some processing to the disk is already there with some DW appliance vendors. But I am not convinced that moving dual (or more) core to the disk is the way to go. Maybe cool running processors from the gaming market may be the thing - especially if they are predicated on graphics manipulation after all there are big similarities between manipulating a texture map in graphics rendering and applying mining a raw data set.

Monday, January 15, 2007 10:20:00 pm  
Anonymous Anonymous said...

"Anyone notice a parallel (pun intended!) here between ranking/cataloguing and what we now know as "partitioning"?"

Ah, a libraries is a great analogy for partitioning. I think they're a case of composite range-list partitioning with dynamic partition pruing (partitioning on a key that is looked up in an external catalog).

"1- The optimizer - FINALLY doing something smart! - will recognize from the syntax and the meta-data in the dictionary that I'm grouping on the exact aggregation keys of my data."

Do-able with query rewrite at the moment, and with DBMS_Advanced_Rewrite you can make the CBO recognise the granularity of a table and avoid an uneccessary aggregation and group by.

http://oraclesponge.wordpress.com/2006/08/14/more-on-redundant-materialized-views/

Good posting, Noons. Hope the boy Sam recovers well.

Tuesday, January 16, 2007 3:15:00 am  
Anonymous Anonymous said...

I was going to submit this to reddit and noticed that it already has been. Here's the reddit link. Upvote it if you're a member (or sign up) as this is some significant original opinion on an important matter.

Tuesday, January 16, 2007 3:58:00 am  
Blogger Noons said...

Doug:
but you do, mate. wish I had time to share more and learn a lot more from the stuff you folks blog about!

Pete:
yeah, what I've done is examine trends based on existing technology, where real progress is happening, how, and what are the expectations. And extrapolated from there.

core2 will be very cheap in 10 years time, almost replacing what nowadays is done with embedded 486 and Pentium systems. A few dozen GB of memory will be small change.

I can see the current specialized DW h/w evolving to "pluggable storage modules" by then, with what we call a SAN or NAS replaced by a single or dual spindle high cap disk and its associated "intelligence".

With a twist: we'll be able to download db "firmware" into it. Rather than rely on Jurassic disk firmware that dates back to cylinder/head/sector allocation: totally inadequate.

Dave thanks for the stuff you sent. And thanks: Sam is fine and about now. Kids have an amazing capacity for recovery: I'd have been decked for a week with his injury!

Yeah, your blog and Doug's with all the parallel and DW stuff plus Kevin's is where I got a lot of the inspiration for these rantings. And my slowly increasing knowledge of the intricacies of partitioning.

IMHO, I was quite mild in some of the possible consequences of a db-aware disk and a smart optimizer.

I've tried to put as much into it as I can see at the moment but I fear it was not enough: the actual future in these areas will probably be much wilder than I can glimpse. Look at the progress in CBO with partitioning in the last 5 years, for example!

Give it another 10? No, I'm being too mild! Get Jonathan and Kevin together at this and they'd come up with stuff that'd blow these wussy rants out of the water.

Paul not that significant, really. Just a bit of crystal gazing which I like to do every once in a while. It was due. I joined Oracle in 86: it should have been done in 2006 to commemorate the 20th. Too busy.

Tuesday, January 16, 2007 7:47:00 pm  
Anonymous Anonymous said...

This comment has been removed by a blog administrator.

Wednesday, April 11, 2007 4:49:00 am  

Post a Comment

<< Home