After a 23 year career at Bell Labs, Dr. Gerard Holzmann joined NASA’s Jet Propulsion Laboratory in 2003 to help create the Laboratory for Reliable Software (LaRS), which he currently manages. Dr. Holzmann is credited with inventing the SPIN model checker for distributed software systems and a Method and Apparatus for Testing Event Driven Software, as well as authoring The Power of 10: Rules for Developing Safety Critical Code, and the groundbreaking book Beyond Photography – The Digital Darkroom.
NASA Tech Briefs: After completing your education at Delft University of Technology in the Netherlands, you spent the first 20-plus years of your career at Bell Labs working in their Computing Science Research Center. What made you decide to leave Bell Labs in 2003 and work for NASA?
Dr. Gerard Holzmann: That’s a good question. I consider myself fortunate to have spent such a long time at Bell Labs during what many people call “the golden days of research at Bell Labs.”
NTB: It was arguably one of the most famous think tanks in the world.
Holzmann: Yes, that’s right. A lot of really important inventions came out of that lab. Also, at that time – the 1980s and 90s – Bell Labs had a lot of really talented people working there. The company that owned Bell Labs, which was AT&T at that time, was quite powerful and rich, so it was a very special research environment with almost no restrictions and virtually complete freedom for researchers.
But 23 years is a very long time and life is too short to spend it all in one place. When many of my colleagues started leaving Bell Labs around 2001 and 2002, I also started looking. In some sense, most of us tried to find a research environment as close as possible to the old Bell Labs. Many of my friends – about 20 of my colleagues from the Computer Science Research Center – found it in the somewhat unconventional culture at Google, which indeed is very similar to Bell Labs in the early eighties. I was more attracted to the passion that I saw in the scientists and engineers working on space missions at JPL. I’d had a lot of interaction with JPL, working on software verification problems, so that really attracted me to JPL. I was sort of partially recreating that spark from the early eighties of being part of a research agenda that people really felt passionate about, like the exploration of the solar system and the universe. I’m really happy I made that change.
NTB: You currently head-up NASA/JPL’s Laboratory for Reliable Software. What is the Laboratory for Reliable Software, and what types of projects does it typically get involved with?
Holzmann: We started LaRS more or less as an experiment, to see if we could help figure out some of the more difficult software related problems that can complicate the execution of ever more complex space missions. We’ve seen our share of anomalies on missions that we’ve flown in the past, some of them caused by software malfunctions. If you look at the trends, the amount of software that we use to control spacecraft is increasing very rapidly. The next mission to Mars, for instance, — the Mars Science Laboratory (MSL) mission — will fly more code than all earlier missions to Mars combined. We want to make sure that this software is as robust as it can be. We want to avoid having to debug a crashed computer that sits on the surface of another planet, a few million miles away.
As one first step in this process, we introduced new static source code analysis tools at JPL, and they are now pretty much a standard part of the software development life cycle here. We also introduced a strong new coding standard that has meanwhile become the JPL Institutional Standard for flight code. Within LaRS we hired some strong researchers in software verification and testing techniques, and their influence is starting to show. We’re working closely with missions such as MSL to help figure out ways to reduce risk and to improve the tools we use in developing really robust software.
NTB: You are probably best known in the field of software verification for inventing the SPIN (Simple Promela Interpreter) model checker for distributed software systems back in the 1980s. That tool is still in widespread use today. What sets SPIN apart from other verification systems, and to what do you attribute its incredible longevity?
Holzmann: I would also not have predicted that SPIN would last so long. In the best tradition of Bell Labs, it was meant to be simple to use, efficient, and based on good theory. A lot of work went into making the performance of that tool as good as it could be. The dictum at Bell Labs used to be, “Theory into tools and tools into practice.” SPIN sort of illustrates that trend. Of course, I developed it with a lot of help from my Bell Labs colleagues. There were some really, really good programmers there, people who developed Unix and C and C++, so I really benefited from that. A lot of the tool’s performance comes from their insights.
What amused me in the beginning was that a lot of people thought and said it was unfair to compare their verification tools to SPIN because the difference in performance was so large. I always found that amusing. Meanwhile, the field has become a lot richer. Today there are many more tools to choose from, so I am also surprised that SPIN can still hold its own and still perform as well as, if not better, than many of the tools that were developed by people that weren’t even born yet when SPIN was created. That’s very rewarding.
NTB: That has to give you a tremendous amount of pride.
Holzmann: Yes. That is very nice to see. At some point it will end, but I’m still waiting for that to happen.
NTB: Shortly after joining the Jet Propulsion Laboratory in 2003, the Research and Development Council of New Jersey awarded you a Thomas Alva Edison patent award in the information technology category for another invention you came up with at Bell labs called “Method and Apparatus for Testing Event Driven Software.” Tell us about that invention and how it was applied to the Mars Pathfinder, Deep Space 1, and Deep Impact missions.
Holzmann: Excellent question. In the late 1990s we had an opportunity to work with the people who were designing a new switch for Lucent Technologies. That’s something that doesn’t happen a lot because basically the technology that goes into telephone switches is fairly stable. There’s not a lot of renewal. So, they were starting a new development and they invited me to participate by trying to do formal verification on the software.
That patent covered a method that we came up with to convert ordinary C code into SPIN verification models that we could then mechanically analyze. I think that was the first time that anyone had dared to try to automate the verification process end to end: starting with implementation level code (rather than a formal design model of an algorithm), through a mechanical verification process, back to error scenarios that can trace an execution back at source level. We’ve applied some of the same techniques to recent space missions, but we also want to push the technology further and also make the conversion step unnecessary. Within LaRS we’re developing what we call a ‘model-driven’ verification technique where we use the compiled executable from an application, and drive it from a test harness that is controlled by the SPIN model checker. We’re getting very good results with this approach. The results are very encouraging.
NTB: In 2005, the Laboratory for Reliable Software, together with collaborators from NASA Ames, Kestrel, Northrop-Grumman, Princeton University, and your alma mater, Bell Labs, conducted a research project called Reliable Software Systems Development (RSSD). Tell us a little about that project and what it was designed to accomplish.
Holzmann: That was a very ambitious attempt to redesign the software development process from beginning to end. It was part of the Code-T funding stream. We participated in a fairly large competition with a proposal to basically redefine the software development lifecycle. It started with the observation that in every phase of the software development process — requirements capture, high-level design, low-level design, coding, testing, and operations — mistakes can be made. By developing the right tools, one can reduce the number of mistakes made, and improve the early detection of others. The plan was to develop a strong tool-based process to reduce the current reliance on much less rigorous human-based manual inspection processes.
We assembled a really great team and won the proposal, which initially was awarded at about $15 million over 5 years, which was really exceptional. And we made a heroic effort to deliver on the ambition. We started by taking a close look at the requirements process, and basically our premise was that the design process is human, it requires human beings to make assessments and decisions, and human beings, as we know, are not perfect. They do, occasionally, make mistakes. So, the goal was catching those mistakes, making it less likely that they could enter into the development lifecycle, and making it more likely that they could be intercepted early on. We decided that a tool-based process – increasing the tool component in every step of the software development process – could dramatically help to reduce the number of defects.
So, we started with requirements and built a requirements capture tool (RCAT) that was linked with an analysis capability based on the SPIN model checker. If you can capture design requirements in such a way that they don’t just live in a PowerPoint view graph or in an Excel spreadsheet, but they actually have semantics where you can analyze them for consistency and completeness, then you can plug those requirements into a design verification tool like SPIN and use them to generate test cases, etc.
The next step was design verification based on model checking techniques, improvements in coding by using static source code analyzers, and improvements in testing by developing randomized test methods and simulation methods, prototyping methods, etc. in the final deliverable using fault containment methods to mitigate the effects of any remaining residual defects. So it was fairly ambitious, and about one-and-a-half years into the project we were really starting to produce very good results. Then, as usual, the Code-T funding stream was redefined and we lost our funding under that project because of the nature of the reorganization. Our project was one of the best evaluated from the set, so we still have the same ambition and we’re continuing to work on it, but at a much slower pace right now. We’re still dedicated to getting those results in sooner or later.
NTB: Several years ago, you created quite a stir in the computer world with a groundbreaking article called “The Power of 10: Rules for Developing Safety-Critical Code.” Please explain the Power of 10 concept and how you came up with it.
Holzmann: Well, again, that’s an excellent question. What I noticed when I came here was that every mission and project within JPL – and within NASA in general – would begin by defining a new project-specific coding standard for their software development effort. That coding standard would typically be based on the one that was used on the previous project by other people, but it would then be adjusted to the personal preferences of the new lead on the project. All of these standards differed in mostly insignificant ways, but one thing they all had in common was that they contained hundreds and hundreds of rules that no developer ever seemed to read or take seriously. To make things worse, most of these hundreds of rules would be completely unverifiable. You couldn’t even check mechanically if a developer was complying with the rules.
So, I did a quick scan of the types of software-related anomalies that we have seen in our missions in the last few decades and came up with a short list of problems that seem to be plaguing just about every mission that we fly. This led to the idea of just defining a very small set of rules that could easily be remembered, that clearly related to risk, and for which compliance could mechanically be verified. I published “The Power of 10: Rules for Developing Safety-Critical Code” in IEEE Computer in 2006, and you’re right, the reaction was rather impressive. The most interesting thing for me was that almost everybody who responded said, “Oh, excellent idea. That’s a great set of rules. There’s just one of the ten rules that I disagree with, and here’s why?” Then they’d give an exception, and every single person who responded would pick a different rule. That made me feel that the choices were probably right.
There is an exception to every rule, no matter what rule you can come up with, even if it’s a traffic rule like stop for a red light. You can think of exceptions, like you’re on your way to the hospital, or you’re a fire engine on your way to a fire. There are always exceptions, but as a general principal it is good to have these rules and in most cases you try to live by those rules. If there’s an exception, you have to think about it really hard and justify the exception.
NTB: It provides traceability. At least when you take an exception to one of the rules, you know where the deviation is and if there’s a problem, you can track it back, correct?
Holzmann: Exactly right! And you can read the justification for the exception as well. It’s not just a hidden thought process; it’s now explicit.
So, over the next two years we encapsulated these rules in a slightly broader standard that was formally adopted as an Institutional Coding Standard for the Development of Flight Software at JPL. We now have a compact, single standard that covers all of our projects and missions, and we have the tools to check compliance with that standard. The MSL mission is the first large project to enforce the standard, and I’m convinced that it will help make that code more robust.
NTB: Some of the rules just seem like common sense, so why do you think the article generated so much interest when it was published?
Holzmann: Well, people do get excited about rules, especially if you want to restrict what a developer can do. If you are a developer, of course, you want no part of that. So I was actually surprised that there was such a positive response to the rules. Of course, the fact that we chose ten rules when there are other sets of ten rules that are very popular, maybe that sparked people’s interest as well.
NTB: Some people might argue that you’re trying to do the impossible – take very complex computer code written by humans and make it 100 percent error free. How do you respond to that?
Holzmann: That’s a very good point. That’s actually not what we’re trying to do. Writing software requires human ingenuity, and human beings are not perfect; no matter how careful and precise we try to be, we do occasionally make mistakes. We can certainly reduce the number of uncaught mistakes in software by a very wide margin, and we’re working hard to make that happen, but I don’t think it is reasonable to expect 100% perfection 100% of the time. An important part of our strategy is, therefore, to develop stronger fault containment techniques for flight software. When one software module gets the hiccups, for instance, there is no need for another software module to sneeze. More to the point, if one module experiences a fault, other modules that perform independent functions should be able to continue safely while the misbehaving module is restarted or replaced.
This is not how most of our current flight systems work. When one module faults, typically the entire system reboots abruptly. That can be fine on the surface of Mars when nothing much exciting is happening, but it would be unsurvivable during mission-critical events such as launch, orbit-insertion, or in the entry-descent and landing phase. We have to learn to develop software systems with stronger firewalls between modules. The trick in building reliable systems is of course not in producing perfection. One doesn’t build a good skyscraper by using only perfect beams. Reliability is a system property, and not the property of any one component of a system.
We have to assume that every component of a system can break, so good system design means building reliable software systems from potentially unreliable software parts – just like we already know how to do by using redundancy in spacecraft hardware. Now, what works in hardware does not easily translate into software. You don’t make a software system more reliable, for instance, by running the same identical software on two computers in parallel, so part of our research agenda here at JPL is to come up with software architectures that allow individual components to fail without bringing down the whole system. Those are fault containment techniques that allow individual modules, if they fail, to be replaced or restarted without bringing down the whole system.
NTB: One of your other interests in life is photography. Back in 1988, years before digital photography caught on, you authored a book called “Beyond Photography – The Digital Darkroom,” published by Prentice Hall. In addition to coining the term “digital darkroom,” the book also boldly predicted the invention and development of digital cameras and the resulting demise of film cameras. How did you see that coming?
Holzmann: That book was based on some fun work I had done on developing digital darkroom software at Bell Labs starting in 1984. When I saw some of my colleagues at Bell Labs who were into graphics working with synthesized images on a computer, I got this idea. Being a photography enthusiast, my idea was to use digitized photos and try to replicate anything you can do in the darkroom with a computer.
We had a lot of fun building that software. Part of the fun was that we had a fairly large database of portraits that a colleague of mine at Bell Labs, Rob Pike, and I had taken. We had taken 4” x 5” Polaroid portraits of all of our colleagues – most of them quite famous people – so we digitized that database and then it was just begging for us to do something with it, like changing people’s faces and things like that.
So, we built this digital darkroom software, and initially it led to this thought that it should be unnecessary to first take a conventional picture, digitize it on an expensive scanner, and then manipulate it on a computer and print it back out. It should be possible to just take the picture digitally in the first place, so I tried for a while to get people excited about building a digital camera, but my colleagues told me that it was impossible to get large enough CCDs for such a camera with the existing technology. CCDs were invented at Bell Labs, so I was in the right place to pursue it, but at that time, in 1984, the resolution of the CCDs was not yet sufficient. That, of course, meant that it was just a matter of time. It would happen, and once it did happen, since the digital photography software was so powerful and so much better than the conventional darkroom process with chemicals and everything, it was inevitable that the change would happen. So I made that prediction in the book that I published in 1988.
Actually, that book was published under unusual conditions. Prentice-Hall had asked me to write a textbook about software verification techniques, and I told them I really didn’t feel like it, but that I’d consider it if they let me publish the photography book first. The digital photography book was much more fun to create than the software verification book.
NTB: And very much ahead of its time.
Holzmann: It was ahead of its time. I still occasionally exchange emails with my publisher, John Wait, sending him snippets from the news about Kodak shutting down its conventional film business and things of that nature. I tell him, “See, I told you so,” because he, of course, did not believe it at the time.
Today, every amateur can do more with Photoshop on a computer than he or she could ever have dreamt of doing in a standard darkroom. I still have all my old film cameras, but they’ve been joined by a growing family of digital siblings. Like most people I haven’t taken a picture with a film camera in a long time, but I think I’ll only really be happy when I can mount a full-frame sensor on the back of my 8” x 10” Cambo view camera. At the old film resolution, that should produce images of about 150 gigapixel. I don’t think we’re quite ready for that yet, but very likely that, too, will come. Give it ten years.
To listen to this interview as a podcast: