Christian Bilien's Oracle performance and tuning blog

Memory partitioning strategy: avoiding design traps on high end HP-UX systems (1/2)


I already mention how important logical I/O (see “Why you should not underestimate the LIO impact on CPU load” ), knowing that most data base systems need much more CPU to access memory than to execute actual code.

Like most high end servers, HP-UX servers use cells (domains in the Sun Solaris world), where CPU access to local memory access is must faster than to pages outside the cell memory scope. This is the behaviour known as Cache Coherent Non-Uniform Memory Access or ccNUMA.

To reduce wait time in run queues of busy CPUs (see “HP-UX processor load balancing on SMPs”), the system scheduler can decide to move threads to other CPUs on any cell in the same nPar; data interleaved memory can be fragmented among different cells; therefore, a thread has about the same chance of its CPU and data being on the same cell as it does of being on different cells. Different threads of the same process could have different memory reference times to the same portion of a data object, and different parts of a data object can have different memory reference times for the same thread.

Starting in HP-UX 11i v2, memory on a cell or across an entire nPar can be identified as interleaved (the default) or cell-local (non-interleaved). Both can be identified as quantities or percentages at nPar creation time or after creation with a modification and reboot.

Crossbar latency is really what ccNUMA is about on HP servers. When a CPU and memory are on the same cell, crossbar latency is null. Crossbar latency is at its lowest when the CPU and the memory being accessed are on different cells that share the same crossbar port. There is additional latency between cells in the same quad but different cell ports. The worst case is being between cell cabinets on a Superdome.

According to HP figures, memory latency (transfer time between memory and CPU), is 185ns on an sx2000 chipset running Itanium 2 Montecito CPU when memory access is local, or when interleaved with 4 or 8 cores on a single cell. The worst case (crossing cabinets) brings memory latency down to a whopping 397ns (64 cores interleaved).

The second post will consider Cell-local vs Interleaved memory.