Christian Bilien’s Oracle performance and tuning blog

March 25, 2007

Memory partitioning strategy: avoiding design traps on high end HP-UX systems (1/2)

Filed under: HP-UX — christianbilien @ 5:41 pm

I already mention how important logical I/O (see “Why you should not underestimate the LIO impact on CPU load” ), knowing that most data base systems need much more CPU to access memory than to execute actual code.

Like most high end servers, HP-UX servers use cells (domains in the Sun Solaris world), where CPU access to local memory access is must faster than to pages outside the cell memory scope. This is the behaviour known as Cache Coherent Non-Uniform Memory Access or ccNUMA.

To reduce wait time in run queues of busy CPUs (see “HP-UX processor load balancing on SMPs”), the system scheduler can decide to move threads to other CPUs on any cell in the same nPar; data interleaved memory can be fragmented among different cells; therefore, a thread has about the same chance of its CPU and data being on the same cell as it does of being on different cells. Different threads of the same process could have different memory reference times to the same portion of a data object, and different parts of a data object can have different memory reference times for the same thread.

Starting in HP-UX 11i v2, memory on a cell or across an entire nPar can be identified as interleaved (the default) or cell-local (non-interleaved). Both can be identified as quantities or percentages at nPar creation time or after creation with a modification and reboot.

Crossbar latency is really what ccNUMA is about on HP servers. When a CPU and memory are on the same cell, crossbar latency is null. Crossbar latency is at its lowest when the CPU and the memory being accessed are on different cells that share the same crossbar port. There is additional latency between cells in the same quad but different cell ports. The worst case is being between cell cabinets on a Superdome.

According to HP figures, memory latency (transfer time between memory and CPU), is 185ns on an sx2000 chipset running Itanium 2 Montecito CPU when memory access is local, or when interleaved with 4 or 8 cores on a single cell. The worst case (crossing cabinets) brings memory latency down to a whopping 397ns (64 cores interleaved).

The second post will consider Cell-local vs Interleaved memory.

 

Advertisements

March 5, 2007

HP-UX Processor Load Balancing on SMPs (2/2)

Filed under: HP-UX — christianbilien @ 3:18 pm

The first post I wrote on CPU scheduling was describing the circumstances in which thread stealing would be considered. This post will go further down the road: once there is enough cpu idleness, or all CPUs are starving threads, context switches may occur. This post describes the rules enforced by the HP-UX scheduler in versions 11.11 and later.

Before digging into the subject, locality domains (LDOM) should be explained. A locality domain is basically a cell. A cell is made of four single or dual core processors, as well as its own memory. The reason locality domain exist is the inter cell bus latency which may greatly impact memory access time. I’ll write a post one day about HP-UX partitioning which will go a bit more into details.

  • A mundane_balance() iteration is run within each LDOM. Each processor is assigned a score based on load average AND starvation (remember from post 1 that starvation occurs when a thread assigned to a given processor has not been running for ‘a long time’). According to the HP system internals course, starvation is given more importance than load average, which makes sense as a cpu hog will be able to run 80 to 100ms before giving up the processor to another thread. In any case, an idle processor is always one of the best processors.
  • In 11.22, the locality domain balancer is called to potentially move a thread from one domain to another.

The outcome is a pair of “best” and “worst” processors. If the pair has lightly loaded cpus, with a load average of less than 0.2, the system is considered to be well balanced and nothing is done. A thread running on the “worst” processor is otherwise selected (it must not be a real time or locked thread), removed from the run queue and inserted into the “best” processor run queue.

How is this “next” thread selected?

The selection is based on the virtual address of the kthread structure: the purpose of the algorithm is to ensure each thread is cycled through.

March 4, 2007

HP-UX Processor Load Balancing on SMPs (1/2)

Filed under: HP-UX — christianbilien @ 9:33 pm

Processor cache hits (data and instruction) are performance-wise extremely important. The TLB (translation lookaside buffer) is one of the most important CPU component as far as performance is concerned: it is a cache to the virtual to physical address translation process (the HP-UX Page Directory). As often showed, CPU accounts for a large part in SQL calls response times, of which most of it is memory access time. As threads move from one processor to the other, cache lines must be invalidated (if processor N°2 has to update a line loaded in a processor N°1 data cache), or at best reloaded when read access is required by both processors. I’ll write a post in the future about the in and outs of processor affinity. For now, I am interested by the rules that govern CPU switches, and understand what triggers them.

I’ll only consider versions more recent than HP-UX 11.11 :

The routine that does the load balancing is named mundane_balance()). This routine schedules itself into the timeout mechanism to be awakened once a second. It runs as an interruption service routine rather than within the context of another process. Thus it cannot be interrupted by some other event. Nor can it be starved by other real-time threads (it was call from stat_daemon() before 11.11).

A processor is in a state of starvation if it has one or more threads on its run queue that hasn’t executed for a long time (this “long time” varies with the CPU load). Only if there are no processors suffering from starvation, or all processors have starving threads (or could be forced into that condition), does HP-UX considers looking for balancing (see next post: HP-UX Processor Load Balancing on SMPs).

« Previous Page

Blog at WordPress.com.