Monday, October 31, 2016

R-17 VS Patriot: a Rounding Issue

This is another piece in our series of articles where we talk about the importance of high-quality code in computer systems whose failure can cause huge expenses or casualties. This time we will talk about reliability of embedded software in military equipment.
Picture 1
February 11, 1991, the Israeli forces inform the Patriot Project Office about a defect found in the Patriot surface-to-air missile defense system. They discovered that running the system for consecutive 8 hours resulted in a 20% targeting precision loss, and estimated that after continuous operation for 20 hours the inaccuracy would grow so big that the Patriot would no longer be able to lock on, track, and intercept ballistic missiles. The U.S. commanders underrated the importance of the discovery, presuming that the system would never be used for over 8 hours as it had been designed as a mobile system to be used for short-time defense operations.
February 16, a bug fix is issued, but applying it to every unit requires some time due to the ongoing war.
February 21, the commanders issue a directive that the system should not run "for a very long time". It is not specified how much exactly "a very long time" is.
February 25, a ballistic missile R-17 (also known as Scud) strikes a U.S. Army barracks in Dhahran, Saudi Arabia, killing 28 and injuring 96 soldiers. The Patriot battery failed to intercept the missile due to a software error.
February 26, the bug fix is delivered to Dhahran.
Picture 2
Picture 6
R-17 (NATO reporting name SS-1C Scud-B; exported under the name R-300) is a Soviet single-stage ballistic missile propelled by storable liquid fuel.
Picture 7
Picture 12
Officers examining an R-17 missile shot down by a Patriot MIM-104 SAM system in the desert during Operation Desert Storm
The MIM-104 Patriot is a U.S. surface-to-air missile (SAM) defense system used by the USA and several allied nations.
Picture 13
Picture 17
Picture 19
A detailed view of an AN/MPQ-53 radar set. The circular pattern on the front of the vertical component is the system's main phased array, consisting of over 5,000 individual elements, each about 39 millimeters (1.535 in) diameter.
Picture 32
PAC-3 missile launcher, note four missiles in each canister
An investigation discovered a bug in the Patriot's tracking software that caused the system's internal clock to drift gradually from the real time.
The time was stored as an integer number in a 24-bit register with an accuracy of 1/10 of a second. This resulted in some portion of the time value being lost as it incremented each 0.1 seconds. To calculate a target's location, the data had to be cast to real numbers [source].
1/10 is 1/24+1/25+1/28+1/29+1/212+1/213+... In other words, binary expansion of the value 1/10 is 0.0001100110011001100110011001100.... That's why this value, stored in a 24-bit register, was rounded to 0.00011001100110011001100, resulting in a precision error of 0.0000000000000000000000011001100... in binary format, or about 0.000000095 in decimal format. During 100 hours of continuous operation, this error would build up to 0.000000095×100×60×60×10=0.34 seconds.
An R-17's velocity is 1676 m/s, so it covers over half a kilometer in 0.34 seconds, which is more than enough for the missile to slip past the Patriot's intercept range. The funny thing is that this time-calculation bug was fixed only in some parts of the software, but not in all of it.
The software had been written in an assembler language 15-20 years earlier and was modified a number of times by different programmer teams during the subsequent years.
The slides shown below are taken from the report on the Patriot system's failure:
Picture 33
Picture 35
Golden rules for programmers:
  • Choose adequate sizes for your variables. Always check twice how many bits each of them requires for storing values (longintdoublefloat, etc.) in a given language and a given operating system.
  • Use integer numbers instead of floating-point ones wherever possible. Measure money in cents, not in dollars. If you can't do without float, use double-precision format.
  • Never use floating-point numbers as loop counters.
  • Avoid mixing types (signed -- unsigned; integer -- floating-point; single precision -- double precision). Be careful with type casts.
  • Check for possible overflows and division-by-zero operations.

More on Patriot

Conclusion

Our goal is to attract the community's attention to the issues of software reliability. The times when computer programs were all about some strange, obscure scientific calculations in Fortran or video games are long over. Now they surround us and permeate every area of our activity.
In earlier times, critical software bugs affected narrow, specific areas, for example civil (Ariane 5) and military rocket industry. Nowadays, you may encounter them not only when working on your computer, but also when driving a car (Toyota) or undergoing medical treatment (Therac-25). We are among those who support programmers in their fight against bugs. Static code analyzer PVS-Studio developed by our team helps detect many of the errors in C, C++, and C# programs as early as at the coding stage. Taking this opportunity, I'd also like to remind you that starting with October 25, 2016, there is a Linux version of PVS-Studio available in addition to the existing Windows version.

This article was originally published (in Russian) on habrahabr.ru. The original and translated versions were posted on our blog with the permission of the author.

Tuesday, October 25, 2016

PVS-Studio for Linux

Finally! Today we released the first version of PVS-Studio analyzer for Linux. Now Linux developers are getting a new powerful tool to fight bugs in the code. We ask you to spread this news in the world. Tell your colleagues, post it on Twitter and Facebook! Let the programs be more stable and safe!
Picture 1
Starting with the 6.10 version, PVS-Studio analyzer supports not only Windows, but the Linux too.
PVS-Studio performs static code analysis and generates a report that helps a programmer find and fix bugs. PVS-Studio performs a wide range of code checks, it is also useful to search for misprints and Copy-Paste errors. Demonstrative examples of such errors: V501V517V522V523V571V611.
Windows-based version of the analyzer is still available here. The analyzer integrates with Visual Studio 2010-2015 or can be used separately in the Standalone mode.
The new Linux version (.deb, .rpm, .tgz) is available for download on the page:
We also recommend reading the documentation section "How to run PVS-Studio on Linux". If something is not clear or something does not work, we will gladly help you, feel free to ask questions in the mail.
If you want to get a registration key to try out the tool - contact us. Over the time the process of getting the trial version may change, but now it is important for us to understand who downloads the analyzer, how the person uses it and which issues appear during the usage.
P.S. Right after the release of the Linux-version we may be a little overwhelmed by the amount of feedback and questions. That's why we ask for your understanding, if we answer them with a little delay.

Wednesday, October 19, 2016

"Historical bugs” section. Black Monday

October 19, 1987, Monday. Exactly on this day happened something that involved stock markets of USA, Australia, Canada, Hong-Kong and several other countries. 29 years ago happened the biggest fall of the Dow Jones Industrial Average in its history — 22.6%. This event became known as Black Monday.
The reason - the “rise of the machines”. A software bug of the program trading caused the electronic assistants to get rid of the securities losing their value instead of correcting the situation at the moment of the market crush. It affected the programs of other players and a chain reaction started. That’s how the main actors of the financial market became computers, not people.
Black Monday is another fact proving that a software bug can lead to extremely bad consequences. Fortunately, now the programmers have great helpers - static code analyzers.
We recommend reading articles on the topic of bugs and their consequences:

«Toyota: 81 514 issues in the code»
http://www.viva64.com/en/b/0439/
«Killer Bug. Therac-25: Quick-and-Dirty»
http://www.viva64.com/en/b/0438/
«A space error: 370.000.000 $ for an integer overflow»
http://www.viva64.com/en/b/0426/ 

Thursday, October 13, 2016

Toyota: 81 514 issues in the code

A story about the fact that the software is penetrating more and more in our daily life. However, with the comfort and usefulness come new dangers. Now we deal with the bugs not only sitting at the computer, but driving on a road.
Picture 1
  • People: Hey, Toyota, we counted that 89 people died from 2000 to 2010 because of your screwed up your electronics and software.
  • Toyota: Yes, but these are people to blame, they confuse the pedals.
  • People: Houston, we have a problem.
  • NASA: Wait a little, we'll sort it out. We'll need 10 months and 3 million dollars.
  • People: Here, take it.
  • Toyota: 3 million isn't enough, here's some more cash.
(10 months later)
  • NASA: Hey, Toyota, we found a couple of bugs in your code, namely 7134 MISRA standards violations, recursion, 740-string long function and 9000 global variables.
  • Toyota: We have our own standards. Have you guys, been to the Moon?
  • NASA (publicly): Toyota is not to blame.
  • (Toyota Shares went up by 4.6%)
  • People: What was that?
(3 years later)
  • Two American testers (whose grandfathers died at Pearl Harbor): No bugs you say? What if we find them?
The National Highways Traffic Safety Administration (NHTSA), have evaluated that in the 10 year period from 2000 till 2010, in accidents caused by defective electronics, 89 people died and 57 people were seriously injured.
Toyota denies its fault, and states, based on their own research, that it's all because of "sticking" accelerator pedals, and a design flaw that enabled accelerator pedals to become trapped by floor mats, but recalls nearly 8 million vehicles around the world because of these two defects.
Still, there are more complaints coming.
We recommend that those with a nervous disposition not watch this video.
NHTSA started their own investigation, asking NASA to help.
During the 10-month investigation, NASA specialists claimed that the software does not comply with MISRAstandards (Motor Industry Software Reliability Association), and contains 7134 violations. Toyota responded, saying that they have their own standards.
December 20 2010, Toyota rejects all the accusations, but pays 16 billion dollars in pre-trial actions, releases software updates for some car models, and recalls 5.5 million vehicles.
After the announcement of the results of NASA's research, Toyota shares on the Tokyo Stock Exchange went up by 4.6%.
In the year 2013, an action is filed in Oklahoma Court in regards to an accident in 2007, involving two girls in a 2005 Toyota Camry. One of them died, the other spent five months in a hospital with injuries to the back and head. Toyota has not admitted its guilt. They said that the cause of the accident was the driver confusing the gas pedal and the brakes; when she realized her mistake and started braking — it was too late.
Picture 22
Two engineers started the investigation. Michael Barr and Philip Koopman. It took them 20 months to review 280 000 lines of code and write a 800-page long report. Each.
The address was kept in secret. The hotel room, where the engineers worked was guarded 24 hours a day - security ensured that nobody was bringing in or taking out any papers. All the phones and internet connections were disabled.
Toyota recalled more than 10 million vehicles worldwide. Still, they have never admitted their guilt.
According to Michael Barr, their report was classified as secret. The same thing was done with the contract which gave them access to Toyota's source code. Barr recommends Googling the transcript of the hearing material.
Here is where the analysts worked:
Picture 4
Here is the report they wrote:
Picture 6

What they looked for and what they found

The main program in the dock is the electronic throttle control system (ETCS).
Picture 8
Picture 10
NASA experts scanned the chips with x-rays.
Picture 12
Cosmic rays were also considered as a possible cause of errors.
They checked the C code:
Picture 14
And then they finally got at the code.

Violations of MISRA (and NASA) standards

According to estimates, every 30 MISRA standard violations lead to one "serious bug".
  • In MISRA-C:1998 - the list contains 127 rules (93 mandatory and 34 advisory)
  • In MISRA-C:2004 141 rules (121 mandatory and 20 advisory). The rules are divided into 21 categories.
  • In MISRA-C:2012 there are 143 rules (each of them can be checked by a static code analyzer) and 16 directives (whose compliance is more open to interpretation, or relates to process or procedural matters). The rules are divided into mandatory, required, and advisory; can be applied to individual units or the entire system. The rules are divided into Decidable and Undecidable.
Toyota took only 11 rules from MISRA.
Picture 15
Picture 17
NASA analysis tools were able to check 35 MISRA rules, and 14 of them were violated.
Picture 19
The source - NASA report, appendix A: Software, page 28]
Total: 7134 (NASA estimation), or 81 514 (according to Michael Barra's estimations).
10 rules of NASA
The Power of Ten - 10 Rules for Writing Safety Critical Code
  • Restrict to simple control flow constructs.
  • Give all loops a fixed upper-bound.
  • Do not use dynamic memory allocation after initialization.
  • Limit functions to no more than 60 lines of text.
  • Use minimally two assertions per function on average.
  • Declare data objects at the smallest possible level of scope.
  • Check the return value of non-void functions, and check the validity of function parameters.
  • Limit the use of the preprocessor to file inclusion and simple macros.
  • Limit the use of pointers. Use no more than two levels of dereferencing per expression.
  • Compile with all warnings enabled, and use one or more source code analyzers.
[The source - spinroot.com/p10]
Picture 21
The function length is limited to 60-75 code strings, after removing empty strings and comments. More than 200 functions in Camry05 code exceeded the specified length. One of the functions was 740 strings long.

Variables

31 names were declared several times in different scopes. The most frequent name is sts_flags1, which appeared in 57 different scopes.
Picture 42
Picture 26
This is worth a closer look.
Picture 28
Picture 30

Misleading code

Picture 31
A graph of the flow control of a simple program.
Cyclomatic complexity of the program above 50 - an indicator that the program cannot be tested.
Picture 33
In the ETCS-code Toyota has:
  • 67 functions with complexity above 50
  • The complexity of Throttle angle function = 146; 1300 code strings without the plan for unit testing.

Recursion

Picture 35
Programmers used the recursion in the Toyota code, every issue related to its usage led to the restart of the processor (CPU reset).

And so?

The amount of shitty code, on which the lives of people depends on, gets bigger. The example of the Toyota company, shows that the system developers can screw the code on an elementary level, not to mention, on the level of accepting the ethical decisions of the artificial intelligence. Although the main trouble is not that there are errors, but the fact that the owners hinder their process of finding and fixing the issues. These people are powerful enough to push on NASA.
Picture 37
"Applications programming is a race between software engineers, who strive to produce idiot-proof programs, and the universe which strives to produce bigger idiots. So far the Universe is winning."
- Rick Cook, writer

Media

Investigation report

An exhaustive presentation of Philippe Kupmana's:
NASA Report on Toyota Unintended Acceleration Investigation
NHTSA Report on Toyota Unintended Acceleration Investigation
Picture 39
Four years before that
Picture 40
Wherever I'm going, I'll be there to apply the formula. I'll keep the secret intact.
It's simple arithmetic.
It's a story problem.
If a new car built by my company leaves Chicago traveling west at 60 miles per hour, and the rear differential locks up, and the car crashes and burns with everyone trapped inside, does my company initiate a recall?
You take the population of vehicles in the field (A) and multiply it by the probable rate of failure (B), then multiply the result by the average cost of an out-of-court settlement (C). A times B times C equals X. This is what it will cost if we don't initiate a recall.
If X is greater than the cost of a recall, we recall the cars and no one gets hurt.
If X is less than the cost of a recall, then we don't recall.
- Chuck Palahniuk "Fight club", 1996
- How often do such accidents happen?
- You won't believe it.
- Which company do you work for?
- Oh, it's a very large one.
- "Fight club", film, 1999.
This article was originally published (in Russian) on habrahabr.ru. The original and translated versions were posted on our blog with the permission of the author.