OpenMP
OpenMP (Open Multi-Processing) is an application programming interface that supports shared memory multi-threaded programming on most platforms, including AIX, Solaris, Linux, OS X, and Windows. This wiki article will explain how to build the library with OpenMP support and provide an example program.
OpenMP pragmas surface in various source files, including nbtheory.cpp
and scrypt.cpp
. Crypto++ supports OpenMP when it is profitable to do so, like modern Key Derivation Functions with parallelism. Other classes, like Rabin-Williams signatures, has OpenMP but it is not profitable because it does not speed up calculations.
If you are using OpenMP on Linux then see GCC Issue 58378, Protect libgomp against child process hanging after a Unix fork().
Compiler Options
Here is a list of compiler options for the compilers the project regularly tests. The option should be added to CXXFLAGS
. Crypto++ uses the compiler to drive link, so the compiler will add the correct OpenMP libraries, too.
SunCC requires -O3
or above. SunCC will change the optimization level automatically if it is below -O3
.
Compiler | Option |
---|---|
MSVC | /openmp
|
GCC | -fopenmp
|
Clang | -fopenmp=libomp
|
ICC | -fopenmp
|
ICX | -qopenmp
|
SunCC | -xopenmp=parallel
|
XLC | -qsmp=omp
|
PGI | -mp
|
Tru64 | -omp
|
Building the Library
You must use -fopenmp
on Linux to build the library with OpenMP support using GCC and Clang. Windows should use /openmp
option. Below is an example of building with OpenMP on Fedora. The makefile detects -fopenmp
on Linux platforms and adds -lgomp
if it is missing.
$ CXXFLAGS="-DNDEBUG -g2 -O3 -fopenmp" make -j 4 g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cryptlib.cpp g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cpu.cpp g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c integer.cpp ... g++ -o cryptest.exe -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe adhoc.o test.o bench1.o bench2.o validat0.o validat1.o validat2.o validat3.o validat4.o datatest.o regtest1.o regtest2.o regtest3.o dlltest.o fipsalgt.o ./libcryptopp.a -lgomp
After building the library you can test it with cryptest.exe v
. Notice the message "OpenMP version 201511, 4 threads" when OpenMP is in effect. The version number comes directly from the _OPENMP
macro.
$ ./cryptest.exe v Using seed: 1548150602 OpenMP version 201511, 4 threads Testing Settings... passed: Your machine is little endian. passed: sizeof(byte) == 1 passed: sizeof(word16) == 2 passed: sizeof(word32) == 4 passed: sizeof(word64) == 8 passed: sizeof(word128) == 16 passed: sizeof(hword) == 4, sizeof(word) == 8, sizeof(dword) == 16 passed: cacheLineSize == 64 hasSSE2 == 1, hasSSSE3 == 1, hasSSE4.1 == 1, hasSSE4.2 == 1, hasAESNI == 1, hasCLMUL == 1, hasRDRAND == 1, hasRDSEED == 1, hasSHA == 0, isP4 == 0 ...
The "OpenMP version..." message is due to the following code in test.cpp
:
void PrintSeedAndThreads(const std::string& seed) { std::cout << "Using seed: " << seed << std::endl; #ifdef _OPENMP int tc = 0; #pragma omp parallel { tc = omp_get_num_threads(); } std::cout << "OpenMP version " << (int)_OPENMP << ", "; std::cout << tc << (tc == 1 ? " thread" : " threads") << std::endl; #endif }
Sample Program
The following is a sample program from the Scrypt wiki page. Scrypt benefits using OpenMP in its Smix
or ROmix
function. In the example below, the security parameters cost=(1<<20), blockSize=12, paralellization=4
were selected to demonstrate OpenMP profitability. 1<<20
is a cost factor of approximately 1 million.
$ cat test.cxx #include "cryptlib.h" #include "secblock.h" #include "scrypt.h" #include "osrng.h" #include "files.h" #include "hex.h" #include <iostream> #include <omp.h> int main() { int threads = 1; #pragma omp parallel { threads = omp_get_num_threads(); } using namespace CryptoPP; AutoSeededRandomPool prng; SecByteBlock key(64), salt(16*1024); prng.GenerateBlock(key, key.size()); prng.GenerateBlock(salt, salt.size()); Scrypt scrypt; scrypt.DeriveKey(key, key.size(), key, key.size(), salt, salt.size(), 1<<20, 12, 4); std::cout << "Threads: " << threads << std::endl; std::cout << "Key: "; StringSource(key, 16, true, new HexEncoder(new FileSink(std::cout))); std::cout << "..." << std::endl; return 0; }
The program should be compiled with the same options used to build the library (or, the library should be built with the same options that will be used for the program). Below the two important ones, -O3 -fopenmp
, are used for both the library and the program.
$ g++ -I. -O3 -fopenmp test.cxx ./libcryptopp.a -o test.exe
Below are the results from a 6-core Ice Lake machine with a Core i5-1035G1 CPU @ 3.5 GHz. Recall that the test program uses paralellization=4. The multithreaded OpenMP version of the program is about 3x faster than the single threaded version:
$ time OMP_NUM_THREADS=1 ./test.exe Threads: 1 Key: 4A433D4B73F5EA27400D6EC001CA3C9E... real 0m13.516s user 0m13.061s sys 0m0.438s $ time OMP_NUM_THREADS=2 ./test.exe Threads: 2 Key: 9EF3B82A54E57ADA188E0CD83B80C24A... real 0m7.889s user 0m14.684s sys 0m0.966s $ time OMP_NUM_THREADS=4 ./test.exe Threads: 4 Key: 512B6C8C44CCCA1A1E67C990178EB2CC... real 0m4.869s user 0m16.618s sys 0m2.136s $ time OMP_NUM_THREADS=6 ./test.exe Threads: 6 Key: 0A23D30D90C4E2212A7DF2FCA8A67412... real 0m4.831s user 0m16.634s sys 0m2.022s $ time OMP_NUM_THREADS=8 ./test.exe Threads: 8 Key: 033BE7CE45D127234261905E36761BB9... real 0m4.894s user 0m16.705s sys 0m2.206s $ time OMP_NUM_THREADS=12 ./test.exe Threads: 12 Key: A8BC687006F55B67D2F283FBA242736A... real 0m4.848s user 0m16.652s sys 0m2.059s
Performance
The Scrypt example showed that OpenMP is profitable for programs that are naturally parallelizable. However, there are costs associated with OpenMP that non-omp algorithms will pay.
The table below shows GMAC and VMAC performance in both configurations on a Skylake Core-i5 6400 @ 2.7 GHz. Below, bigger GiB/second is better; and smaller Cycle/byte is better.
Algorithm | GiB/second | Cycle/byte |
---|---|---|
GMAC(AES) | 8.009 GiB/s | 0.32 cpb |
VMAC(AES)-64 | 9.676 GiB/s | 0.27 cpb |
VMAC(AES)-128 | 5.110 GiB/s | 0.50 cpb |
Algorithm | GiB/second | Cycle/byte |
---|---|---|
GMAC(AES) | 8.445 GiB/s | 0.30 cpb |
VMAC(AES)-64 | 10.964 GiB/s | 0.23 cpb |
VMAC(AES)-128 | 6.237 GiB/s | 0.41 cpb |
Geometric Average
Geometric Average is the way Crypto++ aggregates benchmark results for different algorithms. In the tables blow, bigger Throughput is better. A variance of 3 to 5 is typical, and we consider it noise. But a change of 100 or 180 is not, and it indicates a non-trivial drop in performance.
Here are the results of running the full Benchmark suite on a Skylake Core i5-6400 CPU @ 2.70GHz
Configuration | Throughput |
---|---|
With OpenMP | 1100.504107 |
Without OpenMP | 1286.652288 |
Here are the results of running the full Benchmark suite on a Coffee Lake Core i7-8700 @ 3.2 GHz.
Configuration | Throughput |
---|---|
With OpenMP | 1414.931019 |
Without OpenMP | 1583.779185 |
Here are the results of running the full Benchmark suite on a Ice Lake Core i5-1035G1 @ 3.5 GHz.
Configuration | Throughput |
---|---|
With OpenMP | 1311.111786 |
Without OpenMP | 1451.539969 |
Here are the results of running the full Benchmark suite on an ARM ODROID-XU4 @ 2.0 GHz.
Configuration | Throughput |
---|---|
With OpenMP | 172.617632 |
Without OpenMP | 188.200157 |