Tuesday, September 12, 2017

Give my Best Regards to Yandex Developers

by Andrey Karpov 

Approximately every six months someone writes to us from the employees of Yandex company, asks about licensing of PVS-Studio, downloads the trial and disappears. It's normal, we got used to a slow processes of selling our analyzer to large companies. However, once I have an opportunity, it wouldn't be an extra thing to say hello to Yandex developers and remind about PVS-Studio tool.
Picture 1
Honestly, the article turned out to be random in many respects. We have already been offered to check ClickHouse, but somehow this idea was forgotten. The other day, surfing the Internet, I met again the mention of ClickHouse and got interested in the project. This time I decided not to postpone and check out this project.

ClickHouse

ClickHouse is a column database for OLAP (online analytical requests processing). ClickHouse was designed in Yandex to meet the challenges of Yandex.Metrica. ClickHouse allows you to perform analytical requests on updated data in real time. The linearly scalable system is able to work both with trillions of records and petabytes of data. In June of 2016 ClickHouse was posted in open-source under the Apache license 2.0.

Analysis of project using PVS-Studio

I checked the ClickHouse source code taken from repository of August 14, 2017. To test, I used the beta version of PVS-Studio v6.17. By the time we published the article, this version have already been released.
The following directories were excluded from the check:
  • ClickHouse/contrib
  • ClickHouse/libs
  • ClickHouse/build
  • various tests were also excluded, for example, ClickHouse/dbms/src/Common/tests
The size of the rest of the source code in C++ is 213 KLOC. At the same time, 7.9% of lines are comments. It turns out that the size of code itself that has been checked is not that great: about 196 KLOC.
As you can see, the ClickHouse project has a small size. Besides that, the quality of code is uniquely high and I won't be able to write a shocking article. In total the analyzer issued 130 warnings (General analysis, High and Medium warnings).
I'm not sure about the number of false positives. There are many warnings, that formally cannot be named as false, but at the same time there is no practical use in them. The easiest way to explain it, is to give an example.
int format_version;
....
if (format_version < 1 || format_version > 4)
  throw Exception("Bad checksums format version: " + ....);
if (format_version == 1) return false;
if (format_version == 2) return read_v2(in);
if (format_version == 3) return read_v3(in);
if (format_version == 4) return read_v4(in);
return false;
Analyzer draws attention to the fact that if the expression (format_version == 4)starts evaluating, then it will always be true. As you can see, there is a check above, that if a value format_version goes beyond [1..4], then an exception is thrown. The operator return false will never get executed.
Formally, the analyzer is right and it is not clear how to prove that it is a false positive. On the other hand, it is obvious that this code is correct and is simply written with a "safety margin".
In such cases, a programmer can suppress the analyzer warnings in various ways or rewrite the code. For example, you can write as follows:
switch(format_version)
{
  case 1: return false;
  case 2: return read_v2(in);
  case 3: return read_v3(in);
  case 4: return read_v4(in);
  default: 
    throw Exception("Bad checksums format version: " + ....);
}
There are some warnings about that I just can't say if they point out an error or not. I'm not familiar with the project and have no idea how some code fragments have to be run. Let's consider such a case.
There is some scope with 3 functions:
namespace CurrentMemoryTracker
{
    void alloc(Int64 size);
    void realloc(Int64 old_size, Int64 new_size);
    void free(Int64 size);
}
The names of formal arguments of functions suggest that some sizes are passed into the functions. Some cases are suspicious for the analyzer. For instance, when the size of a pointer, but not the size of a structure, is passed to the alloc function.
using Large = HyperLogLogCounter<K, Hash, UInt32, DenominatorType>;
Large * large = nullptr;
....
CurrentMemoryTracker::alloc(sizeof(large));
The analyzer does not know if it is an error or not. I also do not know, but in my opinion, this code is suspicious.
Well, I will not write about such cases. If ClickHouse developers are interested, they can check the project themselves and explore the list of warnings in more detail. I'll review in the article only those code fragments that seemed to me the most interesting.

Interesting code fragments

1. CWE-476: NULL Pointer Dereference (3 errors)
bool executeForNullThenElse(....)
{
  ....
  const ColumnUInt8 * cond_col =
    typeid_cast<const ColumnUInt8 *>(arg_cond.column.get());
  ....
  if (cond_col)
  {
    ....
  }
  else if (cond_const_col)
  {
    ....
  }
  else
    throw Exception(
      "Illegal column " + cond_col->getName() +            // <=
      " of first argument of function " + getName() +
      ". Must be ColumnUInt8 or ColumnConstUInt8.",
      ErrorCodes::ILLEGAL_COLUMN);
  ....
}
PVS-Studio warning: V522 Dereferencing of the null pointer 'cond_col' might take place. FunctionsConditional.h 765
Here the situation is handled incorrectly when an error occurs. Instead of throwing an exception, null pointer dereference will occur.
To create an error message, the function call occurs: cond_col->getName(). You cannot do this, because the cond_col pointer will be null.
A similar error is found here: V522 Dereferencing of the null pointer 'cond_col' might take place. FunctionsConditional.h 1061
Let's consider another variant on the issue of using a null pointer:
void processHigherOrderFunction(....)
{
  ....
  const DataTypeExpression * lambda_type =
    typeid_cast<const DataTypeExpression *>(types[i].get());

  const DataTypes & lambda_argument_types =
    lambda_type->getArgumentTypes();

  if (!lambda_type)
    throw Exception("Logical error: .....",
                    ErrorCodes::LOGICAL_ERROR);
  ....
}
PVS-Studio warning: V595 The 'lambda_type' pointer was utilized before it was verified against nullptr. Check lines: 359, 361. TypeAndConstantInference.cpp 359
In the beginning the lambda_type pointer is dereferenced, and only then it is checking. To fix the code, you need to move the pointer checking higher:
if (!lambda_type)
  throw Exception("Logical error: .....",
  ErrorCodes::LOGICAL_ERROR);
const DataTypes & lambda_argument_types =
  lambda_type->getArgumentTypes();
2. CWE-665: Improper Initialization (1 errors)
struct TryResult
{
  ....
  explicit TryResult(Entry entry_)
      : entry(std::move(entry))        // <=
      , is_usable(true)
      , is_up_to_date(true)
  {
  }
  ....
  Entry entry;
  ....
}
V546 Member of a class is initialized by itself: 'entry(entry)'. PoolWithFailoverBase.h 74
Due to typos, entry member is initializing itself and as a result it actually remains uninitialized. To fix the code, you must add the underscore symbol correctly:
: entry(std::move(entry_))
3. CWE-672: Operation on a Resource after Expiration or Release (1 error)
using Strings = std::vector<std::string>;
....
int mainEntryClickhousePerformanceTest(int argc, char ** argv)
{
  ....
  Strings input_files;
  ....
  for (const String filename : input_files)   // <= 
  {
    FS::path file(filename);

    if (!FS::exists(file))
      throw DB::Exception(....);

    if (FS::is_directory(file))
    {
      input_files.erase(                      // <=
        std::remove(input_files.begin(),      // <=
                    input_files.end(),        // <=
                    filename) ,               // <=
        input_files.end() );                  // <=

      getFilesFromDir(file, input_files, recursive);
    }
    else
    {
      if (file.extension().string() != ".xml")
        throw DB::Exception(....);
    }
  }
  ....
}
PVS-Studio warning: V789 Iterators for the 'input_files' container, used in the range-based for loop, become invalid upon the call of the 'erase' function. PerformanceTest.cpp 1471
Input_files container is used in range-based for loop. At the same time, inside the loop, the container may vary due to the removal of some elements. If it is not very clear to a reader why you cannot do so, I suggest to read the description of the diagnostics V789.
4. CWE-563: Assignment to Variable without Use ('Unused Variable') (1 error)
struct StringRange
{
  const char * first;
  const char * second;

  ....

  StringRange(TokenIterator token_begin, TokenIterator token_end)
  {
    if (token_begin == token_end)
    {
      first = token_begin->begin;                // <=
      second = token_begin->begin;               // <=
    }

    TokenIterator token_last = token_end;
    --token_last;

    first = token_begin->begin;                  // <=
    second = token_last->end;                    // <=
  }
};
The analyzer issues two warnings:
  • V519 The 'first' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 26, 33. StringRange.h 33
  • V519 The 'second' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 27, 34. StringRange.h 34
When a certain condition in the beginning first and second variables are assigned to the token_begin->begin value. Further on, the value of these variables in any case is changing again. Most likely this code contains a logical error or something is missing. For example, the return operator may be forgotten:
if (token_begin == token_end)
{
  first = token_begin->begin;
  second = token_begin->begin;
  return;
}
5. CWE-570: Expression is Always False (2 errors)
DataTypePtr
getReturnTypeImpl(const DataTypes & arguments) const override
{
  ....
  if (!((.....))
      || ((left_is_string || left_is_fixed_string) && (.....))
      || (left_is_date && right_is_date)
      || (left_is_date && right_is_string)
      || (left_is_string && right_is_date)
      || (left_is_date_time && right_is_date_time)         // 1
      || (left_is_date_time && right_is_string)            // 1
      || (left_is_string && right_is_date_time)            // 1
      || (left_is_date_time && right_is_date_time)         // 2
      || (left_is_date_time && right_is_string)            // 2
      || (left_is_string && right_is_date_time)            // 2
      || (left_is_uuid && right_is_uuid)
      || (left_is_uuid && right_is_string)
      || (left_is_string && right_is_uuid)
      || (left_is_enum && right_is_enum && .....)
      || (left_is_enum && right_is_string)
      || (left_is_string && right_is_enum)
      || (left_tuple && right_tuple && .....)
      || (arguments[0]->equals(*arguments[1]))))
      throw Exception(....);
  ....
}
In this condition three sub-expressions are repeated twice. PVS-Studio warnings:
  • V501 Instantiate FunctionComparison < EqualsOp, NameEquals >: There are identical sub-expressions '(left_is_date_time && right_is_date_time)' to the left and to the right of the '||' operator. FunctionsComparison.h 1057
  • V501 Instantiate FunctionComparison < EqualsOp, NameEquals >: There are identical sub-expressions '(left_is_date_time && right_is_string)' to the left and to the right of the '||' operator. FunctionsComparison.h 1057
  • V501 Instantiate FunctionComparison < EqualsOp, NameEquals >: There are identical sub-expressions '(left_is_string && right_is_date_time)' to the left and to the right of the '||' operator. FunctionsComparison.h 1057
There are two options. First, there is no error, the condition is simply superfluous and can be simplified. The second - there is an error here and some conditions are not checked. In any case, the authors should check this code fragment.
Let's look at another case where a condition is always false.
static void ipv6_scan(const char *  src, unsigned char * dst)
{
  ....
  uint16_t val{};
  unsigned char * colonp = nullptr;

  while (const auto ch = *src++)
  {
    const auto num = unhex(ch);

    if (num != -1)
    {
      val <<= 4;
      val |= num;
      if (val > 0xffffu)         // <=
        return clear_dst();

      saw_xdigit = 1;
      continue;
    }
    ....
}
PVS-Studio warning: V547 Expression 'val > 0xffffu' is always false. The value range of unsigned short type: [0, 65535]. FunctionsCoding.h 339
When parsing a string containing a IPv6 address, some invalid IPv6 addresses will be taken as correct. It is expected that numbers can be recorded between the separators in hexadecimal format, with a value of less than FFFF. If the number is greater, then the address must be considered incorrect. To identify this situation in code there is a test "if (val > 0xffffu)". But it does not work. Val variable is of uint16_t type, which means it cannot be greater than 0xFFFF. As a result, the function will "swallow" the incorrect address. As a regular part of the address, 4 last hexadecimal numbers before the separator, will be represented.
6. CWE-571. Expression is Always True (1 error)
static void formatIP(UInt32 ip, char *& out)
{
  char * begin = out;
  for (auto i = 0; i < 3; ++i)
    *(out++) = 'x';

  for (size_t offset = 8; offset <= 24; offset += 8)
  {
    if (offset > 0)                     // <=
      *(out++) = '.';

    /// Get the next byte.
    UInt32 value = (ip >> offset) & static_cast<UInt32>(255);

    /// Faster than sprintf.
    if (value == 0)
    {
      *(out++) = '0';
    }
    else
    {
      while (value > 0)
      {
        *(out++) = '0' + value % 10;
        value /= 10;
      }
    }
  }
  /// And reverse.
  std::reverse(begin, out);
  *(out++) = '\0';
}
PVS-Studio warning: V547 Expression 'offset > 0' is always true. FunctionsCoding.h 649
"offset > 0" condition is always executed, therefore the point is always added. It seems to me there is no error and a check is just superfluous. Although, of course, I'm not sure. If it wasn't an error, a check should be deleted, so that it wouldn't confuse other programmers and static code analyzers.

Conclusion

Perhaps, project developers will also be able to find a number of errors, looking through the analyzer warnings, which were reflected in the article. I'd like to finish a storytelling especially since I had enough of material to "give regards".
In general, I would like to note the high quality of the code of ClickHouse project developers. However, even highly skilled developers are not immune to having errors and this article proves it again. PVS-Studio static code analyzer will help to prevent many errors. The greatest effect from static analysis developers get when writing new code. It makes no sense to spend time debugging errors that can be detected by the analyzer immediately after checking new code.
I invite you all to downloadhttps://www.viva64.com/en/pvs-studio-download/?win and try PVS-Studio.

Thursday, August 31, 2017

How to Step Over Legacy and Start Using Static Code Analysis

by Svyatoslav Razmyslov 

Legacy code problems are familiar to the majority of software developers. The process of transforming code in legacy is inevitable, because progress in programming moves on. Projects either "die" forever, or require constant support and writing new functions. Thus, in any project in any programming language legacy code appears and brings different inconveniences when the process of further development. On the example of PVS-Studio, in this article I'll show you how to immediately start using the static code analyzer in your project.
Picture 1

Introduction


To start, let us deal with the definition of legacy code and get acquainted with PVS-Studio static analyzer.
A term "legacy" is not always related to the age of code. Here are some of the most common definitions:
  • Code is not covered by tests - it is usually a very serious reason to make less changes in such code;
  • Legacy code - it was forgotten how it works long time ago. In addition, if it is not covered by tests, the situation is very sad, because the code is still used and requires support.
  • Third-party code - is similar to the previous definition, but even in the new third-party code these problems may occur;
  • The code is saved for compatibility - this code is usually old and, probably, is not tested with compatible programs;
  • Other similar definitions.
In any of these situations, developers are trying to modify code as carefully as possible when designing a project not to crash something. Additionally, a project can also use techniques and tools for writing more effective code of high quality. In this case, the old code may need to make much more changes. A using of modern standard language, conformity to a style guide and implementation of the static code analyzer are a few examples. Unfortunately, any of these ideas faces such an obstacle as legacy-code. It can be everywhere: it is impossible to separate it like test files or third-library files. Almost all tools, offering modifications of code, have the ability to mark up the code to disable the warnings, but in practice, this method cannot be applied. The first time you use such a tool, no one will let you mark up hundreds of thousands of comments in source code files - it is a very large amount of changed code because of which a view of changes in the systems of SCM (Source Control Management) will become very complicated. So, for PVS-Studio static analyzer a system of mass warnings suppression has been developed so that there would be no need to modify the project source files. This feature allows you to easily embed the analyzer in any project and IMMEDIATELY start to benefit from this, i.e. to find new bugs.
PVS-Studio is a static analyzer for bug detection in the source code of programs, written in C, C++ and C#. It can be run on Windows and Linux, integrate into build systems, various IDE systems and continuous integration systems.

Operation principles

Suppression mechanism is based on the use of special files "bases" of analyzer messages that are added next to the project. These files contain messages, tagged for this project as "unnecessary". During the next files checks of this type by the analyzer, it will automatically verify the existence of such files next to the project, and in case of their detecting will not show the messages, which are included in such a "base". We should note that a modifying of the source file that contains the tagged messages, and, in particular, lines shift, would not lead to the re-emergence of these messages. However, the edit of the line containing this message can lead to its repeated occurrence, since this message has already become "new".

The Usage of Suppression Mechanism in Visual Studio (Windows)

For Microsoft Visual Studio PVS-Studio plugin is available, which is conveniently integrates in IDE. It allows you to run a check of the entire solution, specific projects or files, and it supports incremental analysis.
In PVS-Studio menu (Figure 1) a point "Suppress Messages ..." is available, which opens the window for working with suppressed analyzer warnings (Figure 2).
Figure 1 - PVS-Studio Menu in Visual Studio
Figure 1 - PVS-Studio Menu in Visual Studio
Figure 2 - Window for analyzer warnings suppression
Figure 2 - Window for analyzer warnings suppression
In the opening window, several actions are available:
  • Suppress Current Messages - suppressing of all analyzer warnings ;
  • Clear Selected Files - restoring of hidden warnings for all or some of the projects;
  • Display Suppressed Messages - a displaying of hidden analyzer warnings in a window (PVS-Studio Output Window) with the rest of the warnings. In this mode, you can return to correcting previously suppressed warnings. Such messages will be marked, strikethrough in a special way, so they cannot be confused with others.
To view analysis results in Visual Studio, there is a special window (Figure 3).
Figure 3 - PVS-Studio Output Window
Figure 3 - PVS-Studio Output Window
Special window allows navigating along found warnings and jump to the code to fix them. PVS-Studio window provides a wide range of options for filtering and sorting the results. It is also possible to quickly navigate to the documentation of the selected diagnostic.
Additional opportunities of working with each message are available in the context menu by clicking the right mouse button on the message (Figure 4). Here the command for suppressing a selected warning is available. When menu entry at the already suppressed warning, false alarm mark will be available for reverting.
Figure 4 - A context menu of analyzer warning
Figure 4 - A context menu of analyzer warning

The usage of the Suppression Mechanism in Standalone (Windows)

PVS-Studio can be used regardless of the integrated Visual Studio development environment. Standalone provides opportunities to test the code regardless of the used compiler or build system, and then lets you work with analysis results providing a user interface similar to PVS-Studio plugin for Visual Studio (Figure 5).
Figure 5 - PVS-Studio Standalone
Figure 5 - PVS-Studio Standalone
Standalone also allows you to work with the analyzer report, obtained by its direct integration into the build system, when a user does not have the Visual Studio environment.
Standalone menu for the analysis run and warnings suppression looks as follows (Figure 6).
Figure 6- Menu Tools of Standalone utility
Figure 6- Menu Tools of Standalone utility
When choosing a point in menu for analyzer run, the window "Compiler Monitoring (C/C++)" will appear (Figure 7).
Figure 7 - Dialog of build monitoring run
Figure 7 - Dialog of build monitoring run
For analyzer warnings filtration, before the analysis a programmer has to specify the file with warnings, suppressed earlier. You can create and maintain such a file through the menu "Message Suppression...", which is the same as the one presented in a section about Visual Studio at Figure 2. After the completion of the analysis in a PVS-Studio window, only new errors will be shown. Without specifying the file, the analyzer will show all the results.

The Usage of Suppression Mechanism in Linux

In Linux commands of suppression and filtration of analyzer warnings are performed only in console, but when their automating on server, the mechanism of suppression becomes very convenient.
There are several ways of using this mechanism, depending on the integration of the analyzer.

Analysis using pvs-studio-analyzer Utility

To suppress all analyzer warnings (first time and in subsequent occasions) you need to execute the command:
pvs-studio-analyzer suppress /path/to/report.log
Analysis of the project can be run as before. At the same time suppressed warnings will be filtered:
pvs-studio-analyzer analyze ... -o /path/to/report.log
plog-converter ...
With this run suppressed warnings will be saved in the current directory in a file named uppress_base.json, which should be stored with the project. New suppressed warnings will be recorded to the file. If there is a need to specify a different name or location of the file, than the commands above may be supplemented by specifying the path to the file with suppressed warnings.

Direct Integration of the Analyzer in the Build System

Direct integration might look as follows:
.cpp.o:
  $(CXX) $(CFLAGS) $(DFLAGS) $(INCLUDES) $< -o $@
  pvs-studio --cfg $(CFG_PATH) --source-file $< --language C++
     --cl-params $(CFLAGS) $(DFLAGS) $(INCLUDES) $<
In this mode, the analyzer cannot verify source files and filter them simultaneously. So, filtration and warnings suppression would require additional commands.
To suppress all the warnings, you must also run the command:
pvs-studio-analyzer suppress /path/to/report.log
To filter a new log, you must use the following commands:
pvs-studio-analyzer filter-suppressed /path/to/report.log
plog-converter ...
File with suppressed warnings also has the default name suppress_base.json, for which you can optionally specify an arbitrary name.

The use of the Suppression Mechanism in SonarQube

SonarQube (formerly Sonar) is an open source platform for continuous inspection of code quality.
PVS-Studio plugin is available to users of this system. SonarQube combines the results of the analysis to a single dashboard, taking the history of runs, allowing to see the overall trend of software quality during a development. An additional advantage is the ability to combine the results of different analyzers.
So, after receiving the results of the analysis of one or more analyzers, you must go to the list of warnings and click on "Bulk Change", which opens the next menu (Figure 8).
Figure 8- A massive change of errors status
Figure 8- A massive change of errors status
In this window, you can mark up all warnings of the analyzer as "won't fix" and further work only with new errors.

What can we do after suppressing all warnings?

Configure static analysis on the build server and developers' computers. Correct new warnings of the analyzer and not let them accumulate. It is also worth to plan search and correction of errors among the suppressed warnings.
Additional control over code quality will help to ensure sending results by mail. It is possible to send warnings to only those programmers who made erroneous code using BlameNotifier utility, which is included in PVS-Studio distribution package.
For some users it may be convenient to load results in Jenkins or TeamCity using PVS-Studio plugin, and send a link to this page.

Conclusion

Missing of analyzer warnings on legacy-code is not the only scenario of presented mechanisms of using. Mass warnings suppression allows you to easily embed the analyzer in any project and immediately start to benefit from this, i.e. to find new bugs. Such implementation allows you to plan correcting of missed warnings in future, without distracting developers from performing their current tasks. This mechanism can also be used in future to suppress false alarms and warnings, which will not be corrected for various reasons.
This article discusses all the possible ways of suppressing analyzer warnings now. The collected material is based on documentation for the PVS-Studio analyzer, but the details on that topic were considered more than in documentation. General information may not be very informative for new users, so you should be familiar with the documentation below.
An alternative way of nullifying the number of analyzer warnings is a correction of all the analyzer warnings at the initial stage. Epic Games company used this way in Unreal Engine project. Details on this can be found in the article: How the PVS-Studio Team Improved Unreal Engine's Code