Link Time Optimization
Link Time Optimization, or LTO, is a GCC-compatible feature that allows the compiler to retain its internal representation of a program or module and use it later with different compilation units to perform optimizations during linking. Also see Link Time Optimization on the GCC wiki.
Generally speaking, you should not use Link Time Optimization for Crypto++. There are three reasons for the recommendation. First, we don't want the linker changing object files or the executables produced during link. The linker's job is to combine object files, not attempt to peephole optimize them.
Second, the tooling does not handle extended inline ASM properly when using Link Time Optimization. It appears the tooling does not track register usage properly. This is unexpected since GCC inline assembly requires a program to declare input operands, output operands and clobbers in the ASM template.
Third, Link Time Optimizations causes the library to slow down. Based on our Benchmark results, and with all other things being equal, the library performance gets worse. But be sure to Benchmark your program to determine if it is profitable to use LTO.
If you chose to use LTO then you must add -DCRYPTOPP_DISABLE_ASM
to CXXFLAGS
.
Also see the following:
- Issue 865, LTO build fails due to missing "-m" flags in linker command, showed we had gaps in our testing procedures because we were not testing under the configuration.
- Issue 993, Problem with LTO build, showed LTO does not work with extended inline assembly (or does not work with our inline assembly).
- Issue 1031, Broken module found on Clang 7, showed Clang still cannot utilize LTO. (This may have changed in newer releases of Clang).
- Issue 1038, Segmentation fault when initializing AutoSeededX917RNG, where a simple test program crashes on startup
- Segfaults on FreeBSD, where a simple test program crashes on FreeBSD.
- LTO on Windows is broken (C++), a GCC bug report that has existed for 11 years.
- Can LTO minor version be updated in backward compatible way? on the GCC mailing list for some potential versioning troubles.
A related topic is Bitcode, where the library is compiled to an intermediate representation that will eventually change.
Performance
We are not aware of any GCC documentation on performance benefits with benchmarks numbers to substantiate the claims. GCC's Link Time Optimization page does not provide them. We suspect LTO does not benefit most programs in a measurable way.
With respect to Crypto++, running the full Benchmark suite on a Skylake machine at 2.7 GHz results in an overall drop in performance. The numbers are provided below. In the table below, bigger Geometric Average Throughput is better.
A variance of 0 to 3 is typical for Geometric Average Throughput. We consider 3 or less simply noise due to interrupts and task switching. However, 25 indicates a problem, and it is usually something we investigate to determine the cause of the drastic drop in performance. In this case, there's nothing to investigate since we know LTO is causing the performance loss.
Configuration | Throughput |
---|---|
With LTO | 1261.012811 |
Without LTO | 1286.652288 |
GCC Options
You can build the library with LTO using the following GCC options. In addition to the GCC options, you must change AR
to gcc-ar
and RANLIB
to gcc-ranlib
. Crypto++ drives link through the compiler so you don't need to do anything with LD
. The same CXXFLAGS
are used for compile and link.
$ AR=gcc-ar RANLIB=gcc-ranlib \ CXXFLAGS="-DNDEBUG -O2 -flto=6 -g -fPIC -pthread" make -j 4 Using testing flags: -DNDEBUG -O2 -flto=6 -g -fPIC -pthread g++ -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe -c cryptlib.cpp g++ -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe -c cpu.cpp g++ -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe -c integer.cpp ... gcc-ar r libcryptopp.a cryptlib.o cpu.o integer.o ... gcc-ranlib libcryptopp.a ... g++ -o cryptest.exe -DNDEBUG -O2 -flto=6 -g -fPIC -pthread -pipe adhoc.o test.o bench1.o bench2.o bench3.o datatest.o dlltest.o fipsalgt.o validat0.o validat1.o validat2.o validat3.o validat4.o validat5.o validat6.o validat7.o validat8.o va lidat9.o validat10.o regtest1.o regtest2.o regtest3.o regtest4.o ./libcryptopp.a
Clang Options
You can build the library with LTO using the following Clang options. In addition to the Clang options, you must change AR
to llvm-ar
and RANLIB
to llvm-ranlib
. Crypto++ drives link through the compiler so you don't need to do anything with LD
. The same CXXFLAGS
are used for compile and link.
$ AR=llvm-ar RANLIB=llvm-ranlib \ CXX=clang++ CXXFLAGS="-DNDEBUG -O2 -flto -g -fPIC -pthread" make -j 4 Using testing flags: -DNDEBUG -O2 -flto -g -fPIC -pthread clang++ -pipe -DNDEBUG -O2 -flto -g -fPIC -pthread -c cryptlib.cpp clang++ -pipe -DNDEBUG -O2 -flto -g -fPIC -pthread -c cpu.cpp clang++ -pipe -DNDEBUG -O2 -flto -g -fPIC -pthread -c integer.cpp ... llvm-ar r libcryptopp.a cryptlib.o cpu.o integer.o 3way.o adler32.o algebra.o algparam.o allocate.o arc4.o aria.o aria_simd.o ariatab.o asn.o authenc.o base32.o base64.o ... make: llvm-ar: Command not found make: *** [GNUmakefile:1438: libcryptopp.a] Error 127 make: *** Waiting for unfinished jobs....
GCC ARM Platform
Here is what the GCC LTO error looks like on ARM platforms.
g++ -o cryptest.exe -DNDEBUG -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-str ong -funwind-tables -fasynchronous-unwind-tables -flto=6 -g -fpic -fPIC -pthread -fopenmp adhoc.o test.o bench1.o bench2.o bench3.o datatest.o dlltest.o fipsalg t.o validat0.o validat1.o validat2.o validat3.o validat4.o validat5.o validat6.o validat7.o validat8.o validat9.o validat10.o regtest1.o regtest2.o regtest3.o r egtest4.o ./libcryptopp.a -lgomp pubkey.h:640:26: warning: type ‘struct TF_ObjectImpl’ violates the C++ One Defin ition Rule [-Wodr] class CRYPTOPP_NO_VTABLE TF_ObjectImpl : public TF_ObjectImplBase<BASE, SCHEME_ OPTIONS, KEY_CLASS> ^ pubkey.h:640:26: note: a different type is defined in another translation unit class CRYPTOPP_NO_VTABLE TF_ObjectImpl : public TF_ObjectImplBase<BASE, SCHEME_ OPTIONS, KEY_CLASS> ^ pubkey.h:651:11: note: a different type is defined in another translation unit ... make[1]: *** [/tmp/cc1QfZK2.ltrans17.ltrans.o] Error 1 /usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h: In function ‘BLAKE2_Compr ess32_NEON’: /usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h:10401:47: fatal error: You must enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use these intrinsics. return (uint8x16_t)__builtin_neon_vld1v16qi ((const __builtin_neon_qi *) __a); ... ^ compilation terminated.