Iowa State University made a splash late last month when it rolled out its latest high-performance computer, Cyence. News releases and stories touted the $2.6 million machine’s speed: just over 183 teraflops (trillion scientific calculations per second). It would take a single human 5 million to 6 million years to do as many calculations as Cyence can do in a second, the press said.
For ISU, it’s a terrific machine – although it’s a shadow of the world-class supercomputers at Department of Energy, Chinese and European laboratories. Cyence definitely will let ISU researchers do cool things, leading to insights that will advance science.
But the releases and the stories aren’t telling the full story. And they may paint a somewhat inaccurate picture of Cyence’s capabilities.
What the PR doesn’t say, but ISU engineers readily admit, is that Cyence is unlikely to ever run at the touted 183.043 teraflops.
That’s the theoretical peak speed. Basically, it’s the total speed of all the hundreds or thousands of processors added together. If they all ran perfectly, constantly received data and instructions to work on, and shared and transmitted data as fast as they’re produced, Cyence could hit its rated speed.
In practice, that never happens, as Arun Somani, the ISU associate engineering dean who led Cyence’s acquisition, noted. “That is almost a given,” he told me. “Nobody gets the theoretical peak speed.”
Like all giant computers, Cyence is a parallel processing machine. It breaks up big problems and parcels the pieces out to individual processors. So instead of having one person do 183 trillion calculations over millions of years (working in serial), thousands of processors do trillions of calculations simultaneously (in parallel). The computer then combines the pieces into a complete answer.
Actually, rather than an answer, high-performance computer typically produce something like a picture. They run models, using fundamental physics equations to portray (and if they’re really good, predict) what happens in things like the climate, a flame or a group of molecules. Each processor calculates the physics in a small chunk of the model. The pieces are assembled into a complete picture, like pixels comprising a digital photo.
But in reality, not all processors work simultaneously. Some may finish their work before others and have to wait. Processors also must pause to send and receive information. If it’s a lot of data, it can overwhelm the pipelines connecting processors to each other, to working memory or to long-term storage, causing more delays.
So how fast is Cyence really? Typically, computer scientists run benchmarks to get an idea of actual performance. The benchmarks put demands on the computers similar to those they may face in real calculations. Using a standard benchmark also lets scientists see how different computers and computer architectures compare, like using the same ruler to see who has the longest … fingers. The LINPACK benchmark, for instance, is the basis for the famous TOP500 ranking of most powerful computers.
Somani, who also is a professor of electrical and computer engineering, said ISU has benchmarked Cyence, but the results aren’t yet available.
There’s another reason Cyence’s 183 teraflops speed is a bit misleading: To get anywhere near that speed, an application would have to run on every part and processor.
But few applications run on the entire machine. Instead, like most big machines, Cyence will be partitioned to run multiple applications simultaneously, each on a piece of the computer.
Somani said Cyence typically will have at least four simultaneous jobs. It has to be that way: When ISU applied for the $1.8 million National Science Foundation grant that helped pay for the machine, it promised it would be used to support 17 research projects from eight departments, ranging from bioscience to energy systems. (Read the release for more on how ISU paid for Cyence.) If each project had to wait for another to finish using the entire machine, researchers would be unhappy.
As it is, using even part of the machine will produce results faster for each researcher than using all of Cystorm, the 28.16 teraflops (but benchmarked at 15.44 teraflops) system Cyence succeeded as BCOC (Big Computer On Campus).
And “if somebody wants to run a single job on the full machine, they can do it but they will have to make a reservation.” Engineers and computer scientists also will use the full machine for benchmarking and other tests.
I’m not saying ISU was intentionally deceptive. When Cystorm was installed, the university noted its benchmarked speed as well as its theoretical peak.
And I’m not denigrating Cyence. It’s a powerful machine with some unusual features that represent the future of high-performance computing.
First, there’s Cyence’s architecture – its innards.
The machine has 248 nodes (a node has multiple processors sharing memory and network connections), each comprised of two Intel Many Integrated Core (MIC) chips bearing eight processor cores each for a total of 16 cores per node. (The most powerful home PCs may have six cores per chip.)
Those are paired with 128 gigabytes of working disc memory – where data and programs are loaded for crunching and results are held for outputting. It’s like the 8 or so gigabytes of random access memory (RAM) on your laptop or desktop computer.
Cyence also has 1,000 terabytes (a petabyte) of long-term storage, similar to the disc drive on your computer.
That part of Cyence looks like most standard high-performance computers, such as CyBlue, the Blue Gene/L ISU installed in the 1990s: Lots of standard processor cores linked together.
But another 48 Cyence nodes have a heterogeneous architecture: they combine standard multicore processors with other kinds to accelerate calculations.
So in 24 Cyence nodes the 16-core chips are matched with two NVIDIA graphics processing units (GPUs). GPUs are descended from the chips first used to quickly redraw images for computer and video games. They perform the same operation on lots of data, boosting overall speed while consuming less power than standard chips.
Another 24 nodes pair the MICs with another kind of accelerator: Intel’s Xeon Phi coprocessor. The idea is the same as with the NVIDIA GPUs; in fact, one software company recently compared how the two performed on a typical finance problem.
Heterogeneous architectures are becoming common on the world’s largest computers, like Titan at Oak Ridge National Laboratory. They help address a major obstacle: Big computers consume lots of power. In fact, unless power consumption is addressed, just one of the next generation exascale machines expected in the next decade will eat as much electricity as a city.
Accelerators give big machines a speed boost with little additional electric consumption. But heterogeneous architectures can give programmers headaches.
ISU wanted the mixed configuration to provide added research options. Some researchers “like to use GPUs. Some want to run on a different set of chips sometimes, like to compare” how codes run on different configurations, he adds.
ISU has been silent on the system’s manufacturer. CyBlue was an IBM. Cystorm was made by Sun Microsystems, something ISU announced when it was turned on. But I had to ask to learn that Atipa Technologies assembled Cyence.
Atipa builds clusters kind of like how Dell or Toshiba builds PCs: assembling mostly off-the-shelf parts. It’s another trend in high-performance computing, Somani said: “It’s becoming assembly by unit now. You pick different parts” and a company delivers it. “Even if we bought it from IBM, they would have done this the same way.”
Cystorm will continue operating for now. It’s only four years old, which makes it ancient in supercomputer terms. (Los Alamos National Laboratory recently retired Roadrunner, the first computer to hit a quadrillion calculations per second. Machines 20 or 30 times faster had superseded it.
Why did ISU think it needed a computer more than six times faster than what it had? “The size of the science we want to do is going up,” Somani says. “The models we want to run are bigger” – more molecules, more climate, more everything, captured in greater detail over longer periods of time.
We’ll be watching to see what comes out.