|
About 5 Tbytes of SATA hard drives, a combination of 400-Gbyte and 500-Gbyte drives from Western Digital, are attached directly to the super node, Daninger said. They are controlled by a high-performance Areca RAID controller from Brea, Calif.-based Tekram Systems. One feature of the Areca that Daninger likes is its error lights that indicate which drive has failed. The cluster nodes run on the CentOS Linux operating system, an open source derivative of Red Hat Enterprise Linux. Daninger said this helped keep costs low by avoiding having to pay license fees on all the nodes. "It works for a fairly savvy end-user like the University of Minnesota," he said. To tie the cluster nodes together, Reason used Lawrence Berkeley National Laboratory's Warewulf Linux solution for managing Linux clusters. "It works great with diskless nodes and works hand-in-glove with CentOS," Daninger said. Also included was the Sun Grid Engine for allocating processors to various jobs depending on priority. Reason also brought in Ganglia, an open source application that provides a graphical view of how busy the nodes are. Reason initially built an eight-node test cluster, but found that eight nodes was the maximum when tied together with Ethernet. So Reason brought in Myrinet, a proprietary networking solution from Myricom, Arcadia, Calif. Latency for each transaction using Myrinet is one-fifth that of Ethernet, Daninger said. "In these parallel environments, code is written so that one node can do its part of an operation and hand it off to another node," he said. "This all takes time." Before installing the laboratory's application, Reason downloaded a fluid flow computational software application from NASA in order to benchmark the cluster for the laboratory. "Its benchmarks are well known in the fluid dynamic market," Daninger said. "It let us prove the cluster before the university added their software." The configuration and deployment of the cluster fortunately went well because the contract called for it to be up and running 30 days from the day the purchase order was signed. "We delivered it on day 30," Daninger said. "My understanding is there was a time limit on the grant. If it was not done on time, they could lose the money. So there was some stress." But there were minor problems, including the occasional bad driver and a few bad memory modules, as well as some driver issues with the Myrinet cards, and Daniger recalls a lot of late-night lunches. Ge, at least, had some fun with the memory modules, some of which caused a compute node to randomly crash. It took some time to realize it was a memory module problem. "During that time, our IT guys and users would play a game to predict on which day the next crash would occur and on which node," he said. "I won both times." The pressure was double for Reason, which was moving to a larger facility at the same time it was building the cluster. The move required installing enough power in the facility to test clusters the size of the one that was being built, Daninger said. "I told the electrician we needed power connects of 14 kilowatts for a computer system in the new office. He said, 'What the heck kind of a computer you putting in?' " Dr. Fotis Sotiropoulos, director of the St. Anthony Falls Laboratory, said his laboratory does a wide range of projects related to fluid dynamics. In fact, the cluster built by Reason is also used to study the water flow in rivers and streams to aid in river restoration projects and to see how the flow of fish is affected by hydroelectric facilities, Sotiropoulos said. "Biology is just one part, but a big part," he said. For scientific applications, there's never enough computing resources and more grants could mean adding more resources. That is what is likely to keep Reason and its competitors busy for a very long time. |
