2009/06/01

size sometimes does matter...

Not really. Well... You know what I mean. ;)

Anyways, sorry for the "catchy" title. This entry is prompted by some of the ongoing exercise in fine tuning our AIX db servers.

AIX is a great OS but the information needed to eek out the last ounce of performance for Oracle dbs is sparse, or across many documents. This blog entry is for my future reference on how to make Oracle use large pages in AIX. If it helps you, then even better!


One of the ways modern servers differ from older models is in the use of 64-bit memory and very large physical memory sizes.

However, traditional virtual memory uses 4KB physical page sizes in Unix. These pages have to be managed by the OS memory management code.

This is usually achieved through a mechanism referred to as address translation. In as simple terms as I can put it, the OS needs to know the status of every page making up its physical addressable memory in order to provide the virtual memory we all know and love.

Now, when we are talking about physical memory sizes of hundreds of Gigabytes, it is not hard to comprehend that managing those in terms of "chunks" - pages - of 4K bytes leads to a LOT of overhead.


Think of it this way: how many 4K pages fit into an addressable space of, say, 64GB?
I'll save you the time: in excess of 15 million entries.


It's not hard then to reason that some overhead will result from managing such large tables of memory pages.

This is typically indicated by excess CPU usage being noticed in kernel mode. Anything over 5% is usually a good indication that something may be amiss with memory management. Note that I said: "may be". It's not the only cause, but it's worth checking.

The problem is made worse by the simple fact all this activity is pre-emptive of any other processing. The OS simply cannot make CPU available to user processes until it has finished managing memory! A good indication of this type of event is when a relatively high kernel mode CPU usage is noted, together with unexplained wait time.

The important thing to note here is that very large numbers of pages can pose serious loads to an OS in terms of memory management. Add-in the occasional paging activity and/or the need to keep database buffer structures - such as the Oracle SGA - locked in memory and you have the recipe for serious memory management overhead.

So, what is the solution? We NEED those Gigabytes!

Many have been tried over the years. The most common is to increase the page size from the default of 4K to a much larger number. The bigger the page size, the less of them needed to represent very large physical memory, the less overhead in maintaining and managing said pages.

Simple arithmetic.

I like simple.


With AIX, the common use bigger page size is 16MB. These are commonly called "largepages". Large pages can coexist in memory with smaller ones: you don't need to re-define the entire adressable memory because not every process is a gigantic memory hog like Oracle - nothing wrong with that, it's the way dbs are supposed to operate, DB2 and others do exactly the same.

The number of largepages available in a system is a tunable parameter. In a nutshell, one reserves a large chunk of physical memory to be managed with large page algorithms. Then certain programs use that memory.

A simple yardstick for how much physical memory to reserve for large pages is to set aside 75% for their use. Say for example you have a 64GB system? Then reserve around 45GB for large pages.

If you are using 16MB page sizes, that is around 2800 pages. Instead of over 11 million 4KB pages. This is where the overhead of memory management gets a MAJOR reduction!


Another advantage of using large pages for databases is that by definition, they are not pageable. With Oracle, the SGA can be allocated using largepages and it will never get paged out. That has a distinct advantage in loaded systems: very large numbers of hard page faults in a busy db server are simply a no-no.

OK, so how do we make all this happen?

There are various documents produced by Oracle and IBM on this subject, at various times and various releases of the software. It took a while to distill from them what exactly needs to be done.

As usual, the level of information from Oracle on this subject can only be described as "sparse"...
Let's just say it is resumed to the usual: "lock sga in memory".

Yeah! Right...

Exactly HOW?

Well, setting the corresponding initialization parameter is a good start:

alter system set lock_sga=true scope=spfile;


But there is more. Much more...

Here is what you need to do for the following conditions:

AIX release: 5L, release 5.3
Oracle release: 10.2.0.3

(things might be different for other combinations of major releases, do some research on that. Metalink is a good starting point)

First, check out how the virtual memory is configured in your system:

$sudo vmo -a
among the resulting text, you will see something like this:
lgpg_regions = 0
lgpg_size = 0
lru_file_repage = 1
v_pinshm = 0


OK, we need to change these. Assuming the sizes I mentioned above, you need to tell AIX to set aside a certain amount of physical memory for large pages.

Now, BE CAREFUL!!!!

If you ask AIX to reserve those pages on a system that has a current instance of Oracle running, AIX will happily page out the 4K-page SGA while making up space for the large pages. With dire consequences for anything running in said SGA!

So: do NOT do this on a system where you are currently running a database!

Shut it down FIRST, to free up the memory for large pages!

Got it? Don't say I didn't warn you!

OK, let's then set the memory aside:
$ sudo vmo -p -o v_pinshm = 1 -o lru_file_repage=0
$ sudo vmo -p -o lgpg_size=16777216 -o lgpg_regions=2800

(you don't have to do all the changes in one line, multiple calls to vmo are OK)

Is that all you need to do? Narh, it couldn't be that simple, could it?

Unfortunately, a few other things are necessary...


First of all, we have to give the Oracle dba login the capabilities to lock memory and use large pages. This is how you do it, assuming your login is "oracle":

$sudo chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE oracle
and you can check it with:
$sudo lsuser -a capabilities oracle


Next, we enable the Oracle executable to use large pages. If you have re-linked Oracle in a system with largepages enabled, this is already done. If not, then just do it. AFAIK this cannot be checked, simply do it every time to be sure:

$cd $ORACLE_HOME/bin
$sudo ldedit -b lpdata oracle

After this, check that the oracle executable still has the required protection flags set. If not, then set them:

$chmod 6751 oracle


Next, we need to set an environment variable in the .profile of the oracle login:

export LDR_CNTRL=LARGE_PAGE_TEXT=Y@LARGE_PAGE_DATA=M


the above indicates to AIX that processes started by login "oracle" do indeed use largepages if they are available and usable. Do not use quotes around the string, just type it like above in the .profile file.
Yeah, I know it looks weird. Hey: blame IBM!

One final bit of tuning with vmo. If you are using asynchronous I/O - and who isn't? - you may want to make sure AIX reserves very little of its memory for file system cache and allows for very large sizes of pinned memory. This is how you do it:

$sudo vmo -p -o minperm%=5 -o maxperm%=90 -o maxpin%=80


And that's about it! (whew!)

You should see something similar to this on vmstat once you re-start oracle:

$ vmstat -l 30
System configuration: lcpu=8 mem=65536MB ent=2.00
kthr memory page faults cpu large-page
----- ----------- ------------------------ ------------ ----------------------- -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa pc ec alp flp
1 1 10873944 232145 0 0 0 0 0 0 610 2889 1997 5 2 84 9 0.14 7.1 2274 226


See those last two columns? In this case , I am using <40GB of large pages, with a little bit in reserve for the odd Oracle process I need to start locally - SQL*Plus, rman, whatever. The "flp" column (free large pages) will hover between 200 and 100 pages free all day.

That's fine, I like a little bit of margin left. Just in case...


Now before you take off doing this on all your AIX systems, some "sanity" disclaimers:

1- This is NOT gospel, I don't do religion. Examine your particular case, see if it makes sense to use large pages, then test all this first!

2- Large pages can be reserved dynamically, but they come out of total physical memory: you have to use the material between your ears to come up with a reasonable and workable number to start with. Don't take all of physical memory with large pages! And do it when not much else is running.

3- This is for AIX5L release 5.3, Oracle 10.2.0.3. Don't come back to me and whinge that it ain't working with Oracle 8.1.7/AIX 4.4 or Oracle11.1.0.7/AIX 6.2! Use your brains and do some research before going off on tangents!

In simple terms: the above is guideline information on how to go about this exercise.
It is NOT an exhaustive description of all nuances and combinations possible. You MUST use your knowledge of your systems and adapt it for your case.



Anyways, enjoy.

I'll get into some photos later, this entry is for my own future reference, I want to keep it technical.

Hopefully, it might save someone else a lot of searching for the solutions or a lot of external consultancy hours.

It's a pity this type of information is not easily available from Oracle and/or IBM, in a condensed form. I guess their professional services have to make a living?...


(Addendum from the comments section)


I made a reference to how Oracle uses AIX's Concurrent I/O (CIO) in one of the replies in the comments.

CIO is an option of JFS, available in the latest releases of AIX. It allows Oracle to bypass AIX buffer cache and use asynchronous I/O as well as avoiding single-threading on the Unix file system lock. Google it and you'll find plenty of details on what it does.

CIO is normally enabled at file system mount time, with the "cio" option.

With JFS2 - available from AIX 5 onwards - Oracle 10g and later releases can actually use a file-open flag that turns on CIO even if the file system was mounted without it. The flag symbolic name is O_CIO.

This is of course highly desirable: instead of having a whole file system in CIO mode, Oracle will only use it for its files, where it is of immediate benefit. Other non-Oracle files in that file system will benefit from buffered AIX I/O.

To signal AIX Oracle 10g onwards to use the O_CIO flag, you have to set the initialization parameter:

filesystemio_options = SETALL

Because this is a new feature, it is possible that other "features" (work with me here!...) may be present in the specific point release or patch level you are using.
10.2.0.3 doesn't have a problem but as Patty C. pointed out in the comments, 10.2.0.4 does - there is now a patch, check the reply.

How do you find out if Oracle is using the flag or not? Easy: trace the dbwr process.

How? Once again, easy:

(Warning: dba1.0 command line stuff coming up.
If you only use dba 2.0 GUI tools to manage your db, I'm afraid you're out of this one: keep clicking - and hoping all is well...)

Start the database in mount mode in a terminal session.

Then get the process id (with ps -ef|grep dbwr) of the dbwr process.

Use the process id to "truss" the dbwr process in a separate terminal session, while in the original you mount the database and then open it. Check the flags in the output of truss for each of the db files.

This is a typical output:

$truss -a -t open -p 12390536
open("/u31/oracle/oradata/dgprod/control01.ctl", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 14
...
open("/u31/oracle/oradata/dgprod/system01.dbf", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 16
...
open("/u31/oracle/oradata/dgprod/undotbs01.dbf", O_RDWR|O_CIO|O_DSYNC|O_LARGEFILE) = 17
...
etcetc.

As you can see, the O_CIO option is there in the open call and that's what opens files in CIO mode.

If you try this and notice (and become curious as to why) Oracle first opens all files in read-only mode before opening in read-write (O_RDWR) mode, the answer is:
it is how it checks the files are there! Any missing file will cause the open to fail and Oracle will not open the entire database.

Have fun.