Christian Bilien’s Oracle performance and tuning blog

March 25, 2007

HP-UX vpar memory: granule size matters

Filed under: HP-UX — christianbilien @ 6:55 pm

Memory is normally assigned to vPars in units called granules (although the vparcreate/vparmodify commands specify memory in multiples of 1MB, the vPar monitor will round up to the next multiple of granule size). As the granule size is specified when the vPar database is created and can not be changed without recreating the virtual partitions database, care must be taken to choose an appropriate granule size when the first vpar is created. Since this is a fairly complex subject, I thought the rules would deserve a note.


Each vPar will require one ILM granule below 2GB to load its kernel. vpmon uses one granule below 2GB. Therefore (2GB ÷ granule size) -1 = maximum number of vPars. For example, 7 is the maximum number of vPars for an ILM granule size of 256MB (2GB ÷256MB -1 = 7).

Integrity (Itanium)

There is a platform dependent maximum to the number of granules of CLM/cell and of ILM per nPar. These values can be displayed using the vparenv command. Remember that Memory Size ÷ Granule size <= maximum # of granules.



# vparenv

vparenv: The next boot mode setting is “vPars”.
vparenv: The ILM granule size setting is 128.
vparenv: The CLM granule size setting is 128.
vparenv: Note: Any changes in the above settings will become effective only after the next system reboot.
vparenv: Note: The maximum possible CLM granules per cell is 512.
vparenv: Note: The maximum possible ILM granules for this system is 1024

Given the values above the total amount of CLM per cell must be less than 64GB (512 * 128MB) and the total amount of ILM in the nPar must be less than 128GB (1024*128 MB) .

Matching firmware and vpar granule size (Integrity only)

On Integrity systems the memory is divided into granules by the firmware. It is critical that the firmware value for the granule size matches the size in the vPars database. You can examine and modify the firmware setting using the vparenv command. For PA-RISC systems the memory is divided by granules by the monitor and there is no firmware setting. . You can ensure the firmware is updated with the same size as the database by specifying the y option: : vparcreate -g ilm:Mbytes:y –g clm:Mbytes:y. I am not sure what use can be made of diverging granule sizes.


Memory partitioning strategy: avoiding design traps on high end HP-UX systems. CLM and ILM (2/2)

Filed under: HP-UX — christianbilien @ 6:12 pm



As seen in the first post on this topic , since HP-UXiv2, and only when cells are dual-core capable (PA-RISC or Itanium 2), it is possible to identify memory on a cell or across an nPar as noninterleaved. This is called Cell-Local Memory, or CLM. CLM can be configured as a quantity or percentage of an individual cell’s memory, or a quantity or percentage of the memory across the entire nPar. Interleaved memory (ILM) is used when a portion of memory is taken from cells of the system and is mixed together in a round robin fashion. With processors on various cells accessing interleaved memory the average access time will be uniform. In 11i v1 all memory is designated as ILM.

The designation of memory as ILM vs. CLM is done at the nPar level (parcreate or parmodify). You can then allocate it to one or more of your vPars (vparcreate or vparmodify).

Cell local memory (CLM) can still be accessed by any processor, but processors on the same cell will have the lowest access latency. Access by processors in other cells will have higher latencies. It is always better to use ILM than accessing CLM configured in another cell.Note that CLM can be used to handle the case when there is an uneven amount of memory in the cells: the delta would be configured as CLM.


CLM and Processor Sets (Psets) can be used together to avoid the inconsistencies of ccNUMA almost entirely. In this context, locality domain (ldom) is defined as the CPUs and memory required to run a thread. A Pset is a logical grouping of CPUs, a CPU partition so to speak. Oracle processes bound to a given Pset get thread run time only on the CPUs assigned to the given Pset. ccNUMA is eliminated because the data and CPUs are on the same cell or ldom.

Memory partitioning strategy: avoiding design traps on high end HP-UX systems (1/2)

Filed under: HP-UX — christianbilien @ 5:41 pm

I already mention how important logical I/O (see “Why you should not underestimate the LIO impact on CPU load” ), knowing that most data base systems need much more CPU to access memory than to execute actual code.

Like most high end servers, HP-UX servers use cells (domains in the Sun Solaris world), where CPU access to local memory access is must faster than to pages outside the cell memory scope. This is the behaviour known as Cache Coherent Non-Uniform Memory Access or ccNUMA.

To reduce wait time in run queues of busy CPUs (see “HP-UX processor load balancing on SMPs”), the system scheduler can decide to move threads to other CPUs on any cell in the same nPar; data interleaved memory can be fragmented among different cells; therefore, a thread has about the same chance of its CPU and data being on the same cell as it does of being on different cells. Different threads of the same process could have different memory reference times to the same portion of a data object, and different parts of a data object can have different memory reference times for the same thread.

Starting in HP-UX 11i v2, memory on a cell or across an entire nPar can be identified as interleaved (the default) or cell-local (non-interleaved). Both can be identified as quantities or percentages at nPar creation time or after creation with a modification and reboot.

Crossbar latency is really what ccNUMA is about on HP servers. When a CPU and memory are on the same cell, crossbar latency is null. Crossbar latency is at its lowest when the CPU and the memory being accessed are on different cells that share the same crossbar port. There is additional latency between cells in the same quad but different cell ports. The worst case is being between cell cabinets on a Superdome.

According to HP figures, memory latency (transfer time between memory and CPU), is 185ns on an sx2000 chipset running Itanium 2 Montecito CPU when memory access is local, or when interleaved with 4 or 8 cores on a single cell. The worst case (crossing cabinets) brings memory latency down to a whopping 397ns (64 cores interleaved).

The second post will consider Cell-local vs Interleaved memory.


Blog at