Originally introduced by Nicholas Metropolis during the Manhattan Project, Monte Carlo methods today have very broad and extensive use in various areas of science and technology. Random number generators (RNGs) are a core part of any Monte Carlo method, having a significant impact on the overall quality and performance of Monte Carlo simulations. Libraries of RNGs can make implementing Monte Carlo simulations much easier and faster. The most important role of high-performance libraries is to provide facilities that can make writing programs simpler, substantially speed up development, and improve program efficiency in terms of performance.

What are the requirements for such a library? On the user interface side, the library should support natural use in different programming languages. There should also be support for different integer and floating point data types and their arbitrary mixture in the program. In addition, the library should have a consistent API to simplify integration in the user's application.

The uniform distribution RNG — called a basic random number generator or BRNG — plays a key role in generating random sequences of other distributions. Non-uniform distributions are usually obtained by transforming uniformly distributed sequences. Properties of distribution generators are essentially dependent on properties of the underlying BRNG — its period length, multidimensional uniformity, and lattice structure. Therefore, the library should provide a selection of BRNGs and different transformation methods.

Many Monte Carlo methods require substantial computational resources. To improve the program performance, the library should effectively utilize processor and system architectural features, including parallel environments with shared and distributed memory. The ability to work in parallel is therefore an important feature of RNGs. From a statistical point of view, this means that the RNG should be able to generate independent sequences.

Toward that end, there are a number of approaches, each of which has pluses and minuses. For example, one approach is to select different BRNGs for each processor, outcomes of which are proven to be independent in some sense. Another way is to pick non-overlapping subsequences of the original sequence for each processor. This idea could be implemented effectively for some BRNGs using, for example, such known methods as leapfrog and skip-ahead. On the single process side, most computer architectures allow substantial speedup for vector operations vs. scalar operations.

Use of vector-type RNGs is quite natural for Monte Carlo applications since they require many random numbers. On the other hand, a vector API can be inconvenient for some algorithms despite the fact that a large quantity of random numbers is needed. Hence, a library with vector RNGs should provide mechanisms to make vector programming easier.

Situations requiring several distribution generators to be used in combination are typical in Monte Carlo methods. Consider the following flowchart fragment repeated in the loop:

The pseudo-code and picture below illustrate how this fragment could be implemented with a library, which has one scalar BRNG and scalar implementations of distribution generators:

If the same library has vector generators, the above example could be implemented as follows:

Assume that each vector generator has two parameters: the first is a vector length and the second is a buffer for storing the generated numbers. Such an implementation might lead to better overall performance, but it is important to consider the following:

- The order of use of uniform random numbers coming from the underlying BRNG to distribution generators has been changed. Sometimes it might be undesirable.
- When N is large, improvements in the speed from using vector routines might not lead to an overall speedup due to an increasing number of cache misses when values x and y are being read from buffers.

One of the possible solutions is to process data by smaller portions as shown in the figure below:

For simplicity, it is assumed that N is a multiple of BLOCK_SIZE. BLOCK_SIZE can be set to assure that the buffers all fit in cache and still assure good performance, since the vector generators exceed the scalar generators for quite short vector lengths. In the examples below, it is assumed that such modification is done and will not show it explicitly. Further, if the library supports several BRNGs, the following implementation is possible.

One additional parameter appeared in calls to subroutines: some identifier specifying the underlying BRNG to use, although this is not the only way to support the use of different BRNGs. For example, a service function can be provided to switch between basic generators. Explicit specification is a good way to minimize switching overheads.

Using different basic generators makes calls to distribution generators independent in terms of data and control flow. In addition, the appropriate selection of BRNGs allows achieving statistical independence as well.

Vector generators can preserve the same order of use of uniform numbers coming from BRNG. The leapfrog method is needed to do that. Regardless of whether one or several BRNGs is used, the leapfrog method provides a means of supporting program parallelization in addition to the possibility of using several sequences, each of which is based on different BRNGs. Finally, using a mechanism of random streams, the skip-ahead service could be implemented similar to the leapfrog method. This is one more way of supporting effective use of random number generators in parallel environments.

While the examples included here are simple, the ideas behind them reflect the development of numerical libraries over the years. These examples should help address today's growing user demands for functionality to support Monte Carlo simulations.

*This work was done by Sergey A. Maidanov and Andrey Naraikin of Intel Corporation.*