[Beowulf] Re: dual core (latency)

Stuart Midgley sdm900 at gmail.com
Mon Jul 18 20:17:17 EDT 2005

The numactl tools won't generally help latency.  Latency isn't the  
issue with Opteron based systems (or any system with multiply  
connected distributed memory controllers).

The real issue is page locality (which is the case with most numa  
based systems).

If you run 2 processes on a dual cpu (single core) systems and they  
both happen to allocate their pages on the same memory controller,  
they will each only see 1/2 the memory bandwidth and 1 controller  
sits idle.  That's the real issue (and the extreme pathalogical case).

Linux2.6 generally does a good job of putting the pages on the memory  
controller attached to cpu that the process is running on.  However,  
it can't get it perfect.  There are always more than 1process/cpu on  
a system, so there is always a little noise... so there is always the  
chance that some pages can be spread around.  Also, the system buffer  
cache will get spread around effecting everyone.

Add into the mix the possibility of suspending processes and you can  
end up with a processes pages all over the place.  Since Linux  
doesn't yet have make migration, once a page is allocated it won't be  
moved to a different memory controller unless it is swapped out.

With numactl tools you will force the pages to be allocated on the  
right memory/cpu.  The processes buffer cache will also be locked  
down (which is another VERY important issue)...

I have used numa tools to double the performance of some codes (or  
perhaps its more correct to say to get back to the correct performance).


On 18/07/2005, at 22:38, Vincent Diepeveen wrote:

> I've been toying some with the numactl at dual core and it doesn't
> really seem to help much. It helps 0.00
> System: Ubuntu at a quad opteron dual core 1.8Ghz  2.6.10-5 smp  
> kernel.
> Latencies as measured by my own program (TLB trashing read of 8 bytes,
> each cpu 250MB buffer):
> #cpu latency
> 1   144-147 ns
> 2   174 ns
> 4   206 ns
> 8   234 ns
> That single cpu figure is pretty ugly bad if i may say so.
> All kind of numa calls just didn't help a thing. I've tried for  
> example:
>   if(numa_available() < 0 ) {
>     setitnuma = 0;
>   }
>   else {
>     int i,back;
>     nodemask_t nt,n2,rnm;
>     maxnodes = numa_max_node()+1; // () returns 3 when 4 controllers
>     printf("numa=%i maxnodes=%i\n",setitnuma,maxnodes);
>     nt = numa_get_interleave_mask();
>     for( i = 0 ; i < maxnodes ; i++ ) {
>       printf("node = %i mask = %i\n",i,nt.n[i]);
>       nt.n[i] = 0;
>       n2.n[i] = 0;
>     }
>     numa_set_interleave_mask(&nt);
>     nt = numa_get_interleave_mask();
>     for( i = 0 ; i < maxnodes ; i++ )
>       printf("checking memory interleave node = %i mask = %i 
> \n",i,nt.n[i]);
>     rnm = numa_get_run_node_mask();
>     printf("numa get run node mask = %i\n",rnm);
>     back = numa_run_on_node(0);
>     if( !back )
>       printf("set to run on node 0\n");
>     else
>       printf("failed to set run on node 0\n");
>   }
> Whatever i try, single cpu latency keeps 144-147 ns.
> A dual opteron dual core with 2.2Ghz dual core controllers shows  
> similar
> latencies. 200 ns for example when running 4 processes with the same
> testprogram.
> This single cpu latency behaviour of dual core opteron is ugly bad
> compared to other dual opterons which are not dual core.
> Nearly identical Tyan mainboard with dual opteron 2.2Ghz gives  
> single cpu
> with SAME kernel, with SAME program 115 ns latency. When turning  
> off ECC at
> that dual opteron it gets down to 113 ns even.
> The frustrating thing is, the dual opteron 2.2Ghz has pc2700,
> whereas the quad opteorn dual core has all banks filled
> with pc3200 registered ram, a-brand.
> Vincent

Dr Stuart Midgley
sdm900 at gmail.com

Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list