2007/01/28

something new, something old

Sorry for the long break, folks. Been quite busy with family matters. Nothing major, just routine stuff that needed done.


Well, there's going to be some changes in my professional life. I've thought long and hard about where I want to be in the next 5 years and reached a few conclusions.


One of them is that the managed services market in NSW is very slow, if not dying. For some reason that partly escapes me, companies who can afford these type of services insist on staying with very old versions of their software and systems. We've got now a spread of clients with everything from release 7 to release 9r2 of Oracle, and only a sprinkle of 10gr2.

This makes any kind of organized and structured work to address the management of these sites almost a waste of time. Have you ever tried to write storage monitoring code that can cope with dictionary managed tablespaces and LMT? And stats gathering and management routines that can span 7 to 10?


Thought so...


Yet clients on release 7 seem to expect us to extract the same type of info and do the same type of work that we can do with release 10g. As if there was no difference whatsoever between Oracle releases!


The general attitude seems to be: if Oracle says they can do all these things, how come you guys can't?

Well, duh? Because Oracle never say it, but they mean 10g. Whereas you Jurassics are still using 8ir3!

Well, don't really say it that way but I certainly mean it! Unbelievable, how completely out of tune with reality some damagers can be...


Basically, after having done managed services on my own for over 13 years, then for someone else for over 5 years and finally for the current company for 7 months, I think it's time for a change.


Had already tested the waters with the job I got two years ago at 24/7 Realmedia. That was one heck of a team, with one heck of a group of real IT geeks to work with! If it hadn't caused so much stress in my personal life with the uncanny hours required by a single person worldwide support position, I'd still be there. A great company, with a great future. Pity they didn't want to invest in it properly.


I am now going back to a single site environment. A very large Australian company has contacted me late last year to go lead their dba team. They are in the process of bringing all their databases into Oracle, with 10gr2 and the latest application software. Some SQL Server in there as well, starting to get big. And expanding their sites into a worldwide capability with a central admin facility here in Sydney. Right up my alley, if you get my drift. And this time I'm given the team, the environment and the tools to do the job. As well as more than adequate financial support.

So, as of end of February I'll be starting in a new position. Which will be a challenge. But those were never a problem for me. Staleness is a much bigger problem and that's what's happening to managed services in NSW. That's the main reason for this change now.



Anyways...



Thought I'd entertain you with some old stuff as well. Been scanning some of my old Kodachrome: the Coolscan 9000 ED is the first non-drum scanner I've seen that does a reasonable job with this film.

This is something you folks have probably never seen:



No it's not a trick, it's a black mushroom! We used to go fishing in a place called Catherine Hill Bay in the mid-80s. It was a disused coal mine and some of the coal seams showed right up against the sea edge. Anywhere the rain water could concentrate along these seams, these mushrooms would appear. It's obvious to me where they get that colour from. But I've never been able to properly identify these little fellas.

And this is the rock platform where we fished from.



Mostly spinning off the rocks, with a bit of bait fishing when the season was right. Young Gus - who is now a fine young man in his 20s - is holding some of the bonito we caught this fine morning. These were then cut into cubes and sent back with a balloon and a big hook. When the big yellowfin tuna caught on to them, it was "hang on for your life" time!
Ah, good times...


catchyalata, folks

2007/01/15

no moore (part 2)

I was going to call it "the sequel"... :-)
Part 2 it is, then.


OK, so let's make a few things clear upfront:

1- Jim Gray's "Personal Petabyte" was used here as a metaphor to facilitate the discussion and increase awareness of the incoming problem: the ELDB - as in Extremely Large Data Base. It doesn't mean that everyone will store all your videos somewhere! Think blurfl RFID data here for a concrete example of what I'm talking about - I'm amazed wikipedia doesn't know what a blurfl is...

2- The problem is not the random access speed of disks! The problem is that even with bulk transfer rates of 1GB/s - which you MIGHT get in another 10 years! - you still will need 11 days to full scan 1PB! Therefore, any suggestion that brute force "full scans" will replace indexing and good design ignores the real problem. Which is one of speed of BULK data access. Not speed of random data access!

3- SSD addresses one problem only: the speed of random access. In other words: the seek times of traditional disks. No SSD so far has addressed the problem of bulk data transfer: they all are within the same ball park of current disks when it comes to bulk. Yes, folks: 100MB/s and 10K IOPS is perfectly possible with any low cost SAN right now, SSD has improved nothing in that chapter! As such, thinking SSD when you're dealing with a problem of ELDB and replacement of indexes by full scans is actually a perfect example of shooting oneself in the foot!


And now I'll go even more extreme!


The problem of handling ELDB is not one of indexing or design either! Think about it: if you are struggling to full scan 1PB in 11 days at data transfer speeds that you MAY be achieving in another ten years, do you seriously believe you can properly index that sort of volume? And maintain that index?

Dream on, bro!

Just the CREATE INDEX will take you a full month. Throw in 15 indexes on the same table and the task of inserting a record becomes herculean. Yes, parallel it, you may well say. So what? You still have to TRANSFER all that data somewhere else to create the index, dude! That, is THE problem: the actual transfer!

No, indexing is not the solution either. Not at these volumes.



So, what is the solution?



Well, let's backtrack for a second here. How did we locate very large volumes of data when computers didn't exist?

Perfect example: libraries. How did they manage to organize themselves so they could find the precise location of a book amidst tens of thousands, without the help of so much as a calculator let alone a computer?

They used ranking and catalogue systems to do so. Storage ranking as well.

Ie, they didn't just throw every book into any shelf and expect some magical "brute force" mechanism to find it, next year.

Well, some of us do so... :-(

But, I digress...

Books were catalogued and sorted into precise locations according to precise rules when they were added to the library and therefore were very easy to locate thereafter: all the client had to know was an author and/or the subject domain of the book, and bingo, up came the shelf or shelf unit where the book was. The "brute force" scan was reduced to a simple matter of looking up one book amid 300 or so, rather than one in the middle of 50000!

And if the librarian was a really good one, the books would be sorted out by title within the shelf. Which made locating the book kid's play.


I know: I was one of the kids playing!


Anyone notice a parallel (pun intended!) here between ranking/cataloguing and what we now know as "partitioning"? Anyone noticed also the similarity of the sorted shelf titles with local indexes in a partition?


Yes, folks. THAT is the solution I see. Sure, we need to evolve the SQL design and syntax itself as well as the storage architecture to make this possible. But that is a LOT easier than trying to subvert limitations of physics, believe me!


There is no way with current or future disk or SSD hardware that we can handle PB databases unless we reduce the problem to small portions. Simple divide to conquer, really. Apply it to very large volumes and we got it. Provided of course that we adjust our way of thinking to this brave new world.


How? Read on.


"OK, so we partition everything, Nuno. Have you thought of how long the CREATE TABLE statement will now become? Have you looked at the syntax for partitioning? Man, the CREATE TABLE section of the SQL manual is now half the size of the bloody book and you want to make it MORE complex? What you been smoking, dude?"


Well, no. Not really, folks. Notice that I said we need to evolve SQL design and syntax. Just like we need to evolve the hardware itself: it ain't gonna happen with what we know now!


Two pronged attack. Disks first:


As far as making disks faster, be they SSD or conventional, the solution is very simple: if you can't bring Mohamed to the CPU/memory "mountain" in a usable time, then bring the mountain to Mohamed! Slap a core2 Duo or whatever it will be in 10 years time, with 100GB of memory, on EACH disk!

Then, make it so that Oracle or db2 or SQL Server can download a simplified meta-SQL interpreter - the output of the optimizer - into each of these very large capacity disks.


Now, what was that access problem again for bulk data? Yup, as a rdbms kernel, you now can download an entire partition management subsystem into a single disk. And have that disk take care of managing the local indexes that handle the partition(s) residing in that disk! And all the other disk space paraphernalia needed to manage a single device capacity of 500TB. Anyone got the "Oracle ASM" bell ringing now? Yes: it's the start of that, folks.

With PMR, in another 10 years you'll be looking at a single disk of that capacity. Do you seriously think that the SAN and NAS makers -and their users and potential market - out there are gonna listen to BAARF and/or BAHD? Particularly when, as Kevin so well points out, the vast majority of data storage is going to be unstructured and outside databases?

No, folks: baarf and badh have as much chance of getting anywhere as I have of becoming the president. That's why I am not a member of either.

Don't get me wrong! Both address a very real problem. It's the solution they propose that is, IMO, completely inadequate: the problem is not speed of random access. The problem is speed of bulk access and how to master the ELDB with the h/w and s/w that realistically we will get in the next 10 years!


OK, that was disks. Now, software.


As in: RDBMS. Because the alternatives - OODB and other such flights of fancy - are not really there and have never proven to even be aware of the problem! Forget a working solution from these...

Clearly, what we need is a MAJOR simplification of the entire reality of the Oracle partitioning concept. No more 50 pages to describe all the nuances and subtleties of the partitioning syntax!

I want to type:
CREATE TABLE blurfl_table (
dept "datatype spec",
blurfl_type "datatype spec",
rfid_date "datatype spec",
etcetc...
.
.
.
)
aggregate on dept, blurfl_type,rfid_date;

Boom. Period. Sure: later on, I want to be able to indicate to the rdbms engine that I want the physical aggregation for dept to go to my "wonder disks" 44,45 and 46. And within that, I want the aggregation blurfl_type to be spread by individual value in as many partitions as needed to cope with the data spread. And so on.

Or maybe not: I just give all those to ASM and it knows how to physically aggregate - the library shelf metaphore - among all the storage units it knows about. And automatically have a local index on that aggregation key stored in meta-data somewhere else and local indexes in those devices for all PKs, RIs and any other grouping constraints I might also specify with the table.

Now, I write my humble end-user yearly "mother-of-all-queries":

select
dept,
blurfl_type,
trunc(rfid_date,'MONTH'),
count(*)
from blurfl_table
where rfid_date >= to_date('2019-01-01')
and rfid_date < to_date('2020-01-01)
group by dept,blurfl_type,trunc(rfid_date,'MONTH');

or words to that effect...

and instead of it becoming the legacy query that my grand-children will see the results of, the following happens:

1- The optimizer - FINALLY doing something smart! - will recognize from the syntax and the meta-data in the dictionary that I'm grouping on the exact aggregation keys of my data. It then reads the locations of the aggregations for those keys off the dictionary, slaps them on the ASM layer with the model SELECT for each one of them.

2- The ASM layer, having previously loaded the SQL meta-interpreter into each of those intelligent 500TB storage units, then fires off the modified model SELECT to each one of them with instructions to aggregate all that data and come back with the result.

3- When the results reach the ASM, it now further aggregates that, a-la Materialized View, into a result set for the end-user query.

4- Who then gets the result within a reasonable time. Of course, end-users being what they are, the speed will still not be satisfactory. And they'll still loudly complain they could do it much faster in Excel and "Windows Panorama" on their MacTel desktop!

But, what the heck:

We tried!...




Yes, yes: I know. There is a lot of "Steve Jobs presentation" in the above scenario.

But:

1- have you had a close look at how 10gr2 manages the AWR partitioning totally transparent to you? And how it partitions the performance data?

2- Have you had a look at what ASM keeps in its own instance of meta-data?

3- Have you noticed the 11g "wish list" for range-range composite partitioning?


Dudes, Dudettes: get ready! Because it's coming.


At light speed. And I can't wait to get my mitts on it!




(end of crystal ball gazing for another decade...)

2007/01/11

no moore (part 1)...

A recent string of comments in Alex's blog at Pythian got me into thinking about this whole subject of "where is database technology heading".

Sure: it's true that Moore's law has been valid for a long time.

Fact is: it ain't no more! I don't often use wikipedia as a reference as there is far too much plain incorrect content there. But in this case, it's spot-on. Read the whole entry, not just the definition!

There are huge consequences for IT from this. But before anyone looks into that, what of data management and databases?

Moore's law dealt with density of fabrication. Not speed, not capacity. And no, the three are not inter-dependent although they are related!

The undeniable fact is that processing speed - directly dependent on density of packaging due to propagation and temperature constraints - has been increasing exponentially.

The other undeniable fact is that disk storage access speed has NOT been increasing at anywhere near the same rate. We now have disk storage systems that can sustain 100MB/s IO rates at best. Ignore buffer cache speeds - those are not sustainable.

That is hardly fantastically more than the 10MB/s of 1996 and nowhere near the exponential growth rates of other electronic devices.

Those with an electrical engineering background will know perfectly well why this is so: the physics of transmission lines and their limits in speed have been understood for many, many decades. In fact the main reason one needs to reduce the size of CPU components in order to jack up the clock frequency is PRECISELY these limits!



So, for databases and more generally data managment, where to then?


Because undeniably, there is a need for this technology to progress at a rate than can supply with data all those speed hungry CPU and processing capacities created by the consequences of Moore's law in the last two decades.

There are those who propose that with the improvements in CPU speed, there will be less need for adequate indexing and optimisation of access methods: an approach based on simple brute force "full scan" of a solid state disk - SSD - storage will suffice, the hardware will be capable of returning results in usable time.

Anyone interested in this should check out the research carried out by Jim Gray of Microsoft. In one of his papers he proposes that the entire life history of a human being can be contained in a Petabyte (PB) of data - he calls it the "Personal Petabyte". Read it, it's very interesting and I believe he is 100% right. We are heading fast into a world where it will be possible to store that PB about a person in a finite storage element!

Any future marketing exercises wanting to address a given population better be prepared to be able to digest this sort of volume of information, in a usable time! Because it will happen, very soon, scaled out to the size of their audience. Yes, Virginia: we're talking Hexabytes - HB - here!

OK, so let's be incredibly optimistic and assume for a second that we'll see the same rate of growth for data access speed in the next 10 years. Yeah, that brings us to 1GB/sec by 2017, doesn't it?

Hang on, didn't good old Jim say 1PB? well, that's 1 million (10**6) of these GB/s, boys and girls! Think you can fast scan 1PB at these rates? Yeah, that's 1 million seconds, or aproximately 11 days of solid data transfer rate to go through it. Don't forget as well that your marketing campaign will be remarkably ineffective if all you can look at is the data of one client... and that you might as well be prepared to multiply all those times by 1000 as you may well be dealing with HB volumes, not PB!


'scuse the French folks, but "fast scan" my arse!


Of course: one can argue that disks will soon become solid state and much faster. That, is so not the problem. The problem was NEVER the speed of the disk but how fast that data can travel to the CPU and its memory!

That is what is taking 100MB/s now and was taking 10MB/s 10 years ago and 10 years from now - with luck - will take 1GB/s!

No, clearly we'll need MUCH MORE sophisticated indexing and cataloguing techniques to manage and derive any value out of these huge amounts of data.

"fast scans" and/or "full scans" are so not there, it's not even worth thinking about them!


So, what will these techniques be, you ask?


I'll blog about what I think they'll be very soon.


catchyalata, folks