As Engineering Branch Chief for NASA's Advanced Supercomputing Division, Bill Thigpen led the team that built and deployed the 10,240-processor Columbia supercomputer in just 120 days. Listed as one of the world's fastest and most powerful supercomputers, Columbia is just part of the computing resources currently being managed by Mr. Thigpen.
NASA Tech Briefs: As chief of the engineering branch for NASA's Advanced Supercomputing Division, what are your primary responsibilities?
Bill Thigpen: My primary responsibilities are to make sure that the systems are up and running, and that they're being used effectively by the scientists and engineers at NASA.
NTB: One of your most significant achievements to date at NASA has been leading the team that built and deployed the Columbia supercomputer. Tell us about that project and some of the challenges you faced managing it.
Thigpen: Well, the first thing is that it was done in 120 days, which far exceeded any other system of its size as far as deployment time goes. It normally takes several years to put in a system this big. From the time that we gave the order to SGI (Silicon Graphics Inc.) until we were fully operational was 120 days. That was done on an operational floor, so we actually didn't bring the existing systems down until we had enough of Columbia built to actually provide the users with more capability than they had prior to Columbia.
We say 120 days, but the first users actually went on the system in July – the first week of July – so the order went to SGI in the middle of June 2004, and by the first week of July we started putting users on the first nodes of Columbia that came in. Then we built the system as it went across.
As for challenges, there were several. One was keeping the floor operational while we were bringing in that much of a system. The system actually filled our entire floor. We were building this system basically from one side of the computer floor to the other side, and we were taking out all of the existing systems that were there. In one 10-day period we actually got nine 512-processor nodes in. Each of those nodes is more than 3 teraflops, which is more compute capability than a lot of high-end computing centers have. So, the electrical upgrades had to happen, all the networking had to happen, we had plumbing that went into some parts of the floor because of the 20 nodes that went in, 8 of those nodes needed liquid cooling. It was a very intense time.
Basically, we had a standup meeting every day to make sure that everything was on track and everything was going well.
NTB: What types of projects are typically run on Columbia?
Thigpen: Oh man! It's a high-end computing resource for the agency overall, so all four mission directorates are running on this system. And the mission directorates are all doing different types of work on it.
If you look at the Science Mission Directorate, who is our largest user, they're doing both earth science and space science. In the earth science arena they're looking at things like climate change, ocean modeling, earthquake modeling, any data that the NASA satellites are gathering, the processing of a lot of that information is occurring on this system.
The Exploration Systems Mission Directorate is our second largest user, and they're doing a lot of work in the next-generation spacecraft. The Constellation program and the ARES rockets, they're looking at both the safety and design aspects of these systems, and how we're going to get to the Moon, how we're going to get to Mars.
I didn't talk about the Space Science Division. Space Science is looking at things like solar weather, and there was a lot of work done on colliding black holes on the system. Then, in the aeronautics arena, they're looking at how to make engines quieter, how to fly planes on Mars, and also research into fundamental aeronautics is going on.
And then the Space Operations Mission Directorate does a lot of work on safety for the Space Shuttle.
NTB: Have any significant updates or upgrades been made to the system since it first became operational?
Thigpen: Yes, as a matter of fact, there have. We have increased the system by 40-percent. That was done in two steps. First, the Exploration Systems Mission Directorate (ESMD) needed more processing capability, so they paid for an addition to the system and they get all of that addition. That was 2,048 processors that went in as two 1,024 core nodes.
Then there was also an addition that we did to the system in looking at the next generation of systems for the NAS (NASA Advanced Supercomputing) Division, and that was a 2,048 core node that is now being used by all four mission directorates.
NTB: So how many total processors does the system now have?
Thigpen: The total is over 14,000.
NTB: What are Columbia's current performance characteristics in terms of speed, storage capacity, etc., and how does it compare to some of the world's other top supercomputers?
Thigpen: We just ran a LINPAC benchmark on the system on 23 out of the 24 nodes — we actually didn't run it on all 24 because we didn't want to take the system down long enough to bring the 24th node in — and we got 66.5 teraflops. That places the system — if the numbers held, and they won't hold because there's other new systems that are coming in — 13th in the world as far as speed of the computer. There's over one-and-a-half petabytes of disc storage that's part of the Columbia system.
The other thing that's happening now that I think is pretty significant is, we're delivering about 1.9 million hours to NASA every week from Columbia.
NTB: That actually relates to my next question. I assume there's a lot of demand for computer time on Columbia. How do you prioritize the projects and decide which ones will run and which ones won't?
Thigpen: Each of the mission directorates gets a certain allocation on the system and that's set at headquarters. That can vary, and we can change priorities based on what the agency requirements are. So there's an allocation that is sort of handed out to the mission directorates, and there's an agency priority allocation that can be given as the agency deems appropriate whenever needed. Then, within each of the mission directorates they prioritize the work that gets done, and then we just implement whatever it is that they prioritize.
NTB: Computer technology becomes obsolete very quickly these days. What is NASA doing to ensure that Columbia stays up-to-date and state-of-the-art?
Thigpen: We're right in the middle of a NAS technology refresh, where we're looking at what the next generation of system is going to be. Basically there are three technologies that are very promising that we are looking at right now. Within the next month, probably, we should make a decision on where that is going to go.
We're looking at a follow-on, basically, to Columbia, with the Columbia technology. We're also looking at IBM's next technology, which is the POWER Series computer. We're looking at the POWER6, which is coming out this year. And then we're also looking at a more standard cluster based on the Xeon processor. That would be a Quad-Core processor, and those look real promising right now.
NTB: What about security? I would think one of the world's most powerful supercomputers would be a very attractive target for hackers and spies. How do you protect a system like that from unwanted intruders?
Thigpen: We're pretty much attacked 24 hours a day, 7 days a week. We have multiple layers of security that protect us, and we have a team of people whose job it is to make sure we're protected.
In front of the system there are secure front ends. The secure front ends are very pared-down Unix machines, and that's all that faces the world. You can't get into Columbia without going through those secure front ends, so they're like walls around the system. It requires dual-factor authentication to get into the front end and the passwords are changing a minimum of once a minute. People have a code that they enter and they have a fob that is basically generating a new password — mine generates them every 30 seconds — so you put in a combination of those two things, and then you have a standard password that also has to meet strict criteria as far as how many letters; how it can't have words in it; it has to have a combination of special characters, regular characters, numbers, and that whole process.
NTB: Aside from Columbia, what other resources does NASA's Supercomputing Division have to offer?
Thigpen: We have a new IBM POWER5 — it's called a 575+ — and it has 40 compute nodes, each with 16 SMPs in 8 Dual-Cores. It's got a peak teraflops rating of 4.9 teraflops and it has 180 terabytes online. That is one of the next-generation systems we are looking at in the NASA technology refresh; the follow-on to that is the POWER6 that I alluded to earlier.
There's also a cluster that we'd bought for ARMD, so it's not available to the entire agency. Just like ESMD, ARMD wanted to have more compute capability than came in under the regular SCAP (Shared Capability Assets Program) effort.
NTB: What does ARMD stand for?
Thigpen: Aeronautics Research Mission Directorate. That is an Altix ICE machine. It has 4,096 cores, it has 4 terabytes of memory, and it has 43 teraflops peak capability. There's 240 terabytes of disc connected to it, and that's based on the Xeon Quad-Core. We also have a new data analysis and visualization system going in.
NTB: What will that do?
Thigpen: This will provide the analysts with a way of looking at this huge amount of data that's generated. This will actually have 245-million pixels; it's an 8 by 16 LCD tiled panel display. It's a fairly significant computer system all on its own. It will have 1,024 Opteron cores, which is an AMD Quad-Core processor. It will have 128 graphic processing units (GPUs) that are NVIDIA 8800 GPX cards, and it will have about 74 teraflops of peak processing and 350 terabytes of storage. We're in the process of building that right now, and that will be directly connected to our high-end computers through InfiniBand. It allows for visualization to be embedded into the simulations that are running, so you get higher temporal resolution on it, and it just works a lot quicker and provides the analyst with a lot better visualization of the data. It can also be piped out to any of the other centers as the visualization is occurring.
NTB: Looking ahead, what do you see as being some of the biggest technological challenges facing NASA's Advanced Supercomputing Division in the future?
Thigpen: Being able to harness the multiple cores that are going to be in these chips. What the vendors are doing is getting more and more cores per chip. The challenge is really going to be extracting the full capability of those cores.
To listen to the podcast, click here