Intel Xeon E5 chips are clocked down to improve performance per watt
Performance scales poorly beyond 3GHz
By Lawrence Latif
Thu Jan 31 2013, 13:49
Intel Xeon 5500
BOLOGNA: HIGH PERFORMANCE COMPUTING (HPC) vendor Eurotech said it scaled down the clock speed of Intel's Xeon E5 processors in order to meet its performance per watt goals.
Eurotech, which took the wraps off its Aurora Tigon cluster at the Cineca HPC datacenter and is in pole position to sit on top of the Green 500 list in May, said it scaled back the clock frequencies of the Xeon E5 processors in order to attain higher performance per watt characteristics. The firm also showed results that indicate CPU performance gains hit a plateau after 3GHz.
Giampietro Tecchiolli, CTO of Eurotech presented results using the HPL benchmark that showed not only that CPU performance gains plateaud after 3GHz but power consumption increased. Tecchiolli said the upshot of this was a measured drop in performance per watt as CPU frequency increased.
Tecchiolli confirmed that Eurora's Aurora Tigon cluster uses Intel Xeon E5-2687W processors that have frequencies scaled down, though he wouldn't give exact figures. He added that most of the compute power, especially in the publicised Linpack figures, comes from Nvidia's Tesla K20 GPGPU accelerator boards rather than Intel's Xeon chips, therefore running the CPUs at a lower frequency has a relatively small peak performance hit.
Geoff Ballew, senior manager of Nvidia's Tesla Compute business unit told The INQUIRER that its Tesla GPGPU accelerator board can also have its clock speed tuned to specific applications within certain parameters, though he wouldn't say whether customers are having to do so right now in order to attain high performance per watt figures. Ballew said Tesla clock speeds can be altered in order to meet overall system power budgets put forth by datacentres.
That Eurotech, and presumably other HPC vendors, are having to underclock CPUs in order to improve performance per watt characteristics is deeply worrying for the future ability of x86 CPUs to push HPC performance into the exascale region. However Eurotech's figures suggest that since accelerators are doing the vast majority of the heavy lifting, lower power ARM cores could find their way into HPC clusters acting as little more than job schedulers, making a Tegra and Tesla HPC cluster seem all the more feasible once Nvidia adopts the ARMv8 architecture.
ARM chips can't even reach 3 GHz and there are none in production mobile chips that even reach 2 GHz yet. The fact that Xeon holds its performance/watt up to 3 GHz is actually impressive and bodes well for having future mobile Haswell chips with turbo speeds up to 3 GHz. You are too stupid to realize when you are actually posting good news about Intel chips !! LOL
The HPC benchmark used to rank these machines is the Linpack benchmark which is a DP floating point matrix math operation. There are probably two ways to arrange the HW.
1. SMP: where you have a CPU with a lot of cores.
2. Accelerators: where you add boards to the system that you use to speed execution of the code on (commonly some kind of ARM GPU solotion: Eurotech).
When you add Accelerators, there are two ways to run the benchmark.
1. run entirely on the accelerators (Eurotech because accelerator is different from host)
2. run on both the host CPU and the accelerators.
For the Eurotech configured machine, they are using the host CPU to shuffle data. Lowering the frequency on the host CPU and therefore the power it consumes makes sense because they are running the benchmark on the Tesla accelerators.
They displaced a Xeon + Phi machine from the top. The Xeon + Phi configuration can run code on the host Xeon only, Phi only or both Xeon and Phi using the OpenMP API. Intel released a new C and Fortran compiler last January that supports these execution modes.
For floating point code, a big difference will be the FMA instruction support where a FP multiply and add happen together. FMA makes a big difference in FP code. For the Green 500, Haswell will not make much of a difference, since much of the work is done on the accelerators. Putting the FMA on the Phi would make the big difference.
This piece is filled with broad speculation but very little fact. In HPC performance is hugely dependent on the code. Since most real world code is rarely perfectly parallel the CPU performance for the non-parallel portions of an algorithim still greatly influences the shortest time to completion for any task. And as suggested by the author, if the majority of work is done by accelerators and not the CPU performing administration functions, there is little to gain in power efficiency by using an ARM.