BASE+SIMD
BASE+SIMD is a compilation model where algorithms with special architectural needs are placed in a separate source file and that receives extra compilation options to enable instruction set architectures (ISA). The separate source file avoids cross-pollinating cpu features into a straight C++ implementation intended to run on a minimally featured machine. The BASE file uses standard C++ flags, while the SIMD file provides hardware acceleration like Altivec, SSE, NEON, CRC, AES, CLMUL and SHA.
The library switched to BASE+SIMD at version Crypto++ 6.0 to better support distros. Also see Issue 380, PR 461 and Commit 5272744410d0. The change was necessary for two reasons. First, the library stopped using -march=native
by default. Second, GCC, Clang and ARM toolchains have slightly different behavior. GCC i686 and x86_64 ISA features were always available, even without options like -msse4
, -maes
and -msha
. ARM and Clang had different behavior, and the ISA features were only available if the options to enable them were present on the command line.
Crypto++ honors a user's CXXFLAGS
, but it always adds the required arch flags when compiling a SIMD file because they are required for a compilation. Additionally, Autotools and CMake project files also add the required architectural options. Also see the GNU Coding Standards, Section 7.2.3 Variables for Specifying Commands.
In addition to supporting Clang and GCC on Linux, the library must also support unusual compilers like SunCC and different platforms like AIX, Solaris and Windows from the same source files. The mash-up of requirements makes the support tricky because things must work with multiple platforms and compilers.
Some related pages are GNUmakefile and Linux (Command Line).
BASE+SIMD
In the BASE+SIMD model there are two files for an algorithm. There is a BASE file like crc.cpp
that provides a standard C++ implementation. It uses the default CXXFLAGS
and nothing more. There is also a SIMD file like crc-simd.cpp
which provides architecture dependent acceleration, like SSE4.2 or CRC instructions. The SIMD file requires additional compiler options for the platform.
As an example, below is the compilation of the CRC source files on a x86_64 platform. Notice crc-simd.cpp
requires -msse4.2
on IA-32 platforms.
$ make g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cryptlib.cpp g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cpu.cpp g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c integer.cpp ... g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -msse4.2 -c crc-simd.cpp g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c crc.cpp ...
Other platforms may require different options. For example, Aarch64 provides CRC32 and CRC-32C acceleration, and the architectural flag of interest is -march=armv8-a+crc
as shown below.
$ make g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cryptlib.cpp g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cpu.cpp g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c integer.cpp ... g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -march=armv8-a+crc -c crc-simd.cpp g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c crc.cpp ...
SIMD Files
The following is a list of SIMD files as of Crypto++ 8.2. Each file listed below requires an architectural option.
$ ls -1 *_simd.cpp aria_simd.cpp blake2b_simd.cpp blake2s_simd.cpp chacha_simd.cpp cham_simd.cpp crc_simd.cpp gcm_simd.cpp gf2n_simd.cpp keccak_simd.cpp lea_simd.cpp neon_simd.cpp ppc_simd.cpp rijndael_simd.cpp shacal2_simd.cpp sha_simd.cpp simeck_simd.cpp simon128_simd.cpp simon64_simd.cpp sm4_simd.cpp speck128_simd.cpp speck64_simd.cpp sse_simd.cpp
GNUmakefile
As stated earlier, the GNUmakefile always supplies the required architecture option for a SIMD file. Additionally, the Autotools and CMake project files also provide the architectural options when compiling a SIMD file. The makefile uses a feature test to provide the architectural options.
The feature test is a little unusual, but it simply looks for diagnostic messages from the compiler. The way it works is, all whitespace in a compiler message is converted to newlines, and then the number of newlines are counted. If line count is greater then 0, then the feature test fails.
The reason for the pattern is, compiler return codes are not standard. Some compilers issue a warning but return success when a feature is not available. Instead of relying on the return code, we simply look for compiler diagnostic messages.
The feature test was crafted after "dark and silent cockpits", meaning no messages are good, and any message in bad. Airplane cockpits are similar: no warning lights and no buzzers are good; and warning lights and buzzers are bad.
Below is from the makefile and its handling of the CRC flag for Intel-based machines. Other architectures, like Aarch64, have similar code.
SSE42_FLAG = -msse4.2 ... TPROG = TestPrograms/test_x86_sse42.cxx TOPT = $(SSE42_FLAG) HAVE_OPT = $(shell $(CXX) $(TCXXFLAGS) $(ZOPT) $(TOPT) $(TPROG) -o $(TOUT) 2>&1 | tr ' ' '\n' | wc -l) ifeq ($(strip $(HAVE_OPT)),0) CRC_FLAG = $(SSE42_FLAG) else SSE42_FLAG = endif ... ifeq ($(SSE42_FLAG),) CRYPTOPP_CXXFLAGS += -DCRYPTOPP_DISABLE_SSE4 endif ... crc-simd.o : crc-simd.cpp $(CXX) $(strip $(CRYPTOPP_CXXFLAGS) $(CXXFLAGS) $(CRC_FLAG) -c) $< ... %.o : %.cpp $(CXX) $(strip $(CRYPTOPP_CXXFLAGS) $(CXXFLAGS) -c) $<
Both Autotools and CMake fall victim to "... some compilers issue a warning but return success when a feature is not available." The Crypto++ Autotools and CMake projects try to work around the build system failures by supplying its own function to try a compile.
Arch Options
The following table lists the architectural flags required for the SIMD files. The flags are GCC's style of options, but LLVM's Clang and Intel's ICC will consume the flags.
The list of files and options are current as of Crypto++ 8.2. They may become stale over time as additional files are added. If a source file is missing from the list then just run make
and see what the GNUmakefile
uses for the file.
SIMD File | i686 & x86_64 | ARM NEON | AArch64 | PowerPC |
---|---|---|---|---|
aria_simd.cpp | -mssse3 | -march=armv7-a -mfpu=neon† | |
-mcpu=power4 -maltivec‡ |
blake2_simd.cpp | -msse4.1 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power4 -maltivec‡ |
chacha_simd.cpp | -msse2 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power4 -maltivec‡ |
chacha_avx.cpp | -mavx2 | |
|
|
cham_simd.cpp | -mssse3 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power4 -maltivec‡ |
crc_simd.cpp | -msse4.2 | |
-march=armv8-a+crc | |
gf2n_simd.cpp | -mpclmul | -march=armv7-a -mfpu=neon† | -march=armv8-a+crypto | -mcpu=power8 -maltivec‡ |
keccak_simd.cpp | -mssse3 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power8 -maltivec‡ |
lea_simd.cpp | -mssse3 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power4 -maltivec‡ |
neon_simd.cpp | |
-march=armv7-a -mfpu=neon† | -march=armv8-a | |
ppc_simd.cpp | |
|
|
-mcpu=power4 -maltivec‡ |
rijndael_simd.cpp | -msse4.1 -maes | -march=armv7-a -mfpu=neon† | -march=armv8-a+crypto | -mcpu=power8 -maltivec‡ |
sha_simd.cpp | -msse4.2 -msha | |
-march=armv8-a+crypto | -mcpu=power8 -maltivec‡ |
shacal2_simd.cpp | -msse4.2 -msha | |
-march=armv8-a+crypto | -mcpu=power8 -maltivec‡ |
simon64_simd.cpp | -msse4.1 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power7 -maltivec‡ |
simon128_simd.cpp | -mssse3 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power8 -maltivec‡ |
speck64_simd.cpp | -msse4.1 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power7 -maltivec‡ |
speck128_simd.cpp | -mssse3 | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power8 -maltivec‡ |
sm4_simd.cpp | -msse4.1 -maes | -march=armv7-a -mfpu=neon† | -march=armv8-a | -mcpu=power8 -maltivec‡ |
sse_simd.cpp | -msse2* | |
|
|
* i686 requires -msse2
option. x86_64 does not need the flag because SSE2 is part of amd64's core instruction set.
† ARM NEON also requires a floating point ABI like -mfloat-abi=hard
or -mfloat-abi=softfp
.
‡ If compiling with IBM XL C/C++ use -qarch=pwr4 -qaltivec
and -qarch=pwr8 -qaltivec
.
Multiversioning
GCC has a feature called Function Multiversioning, which allows software to provide different versions of a function based on an Instruction Set Architecture (ISA). Function multiversioning first appeared in GCC for x86_64 in GCC 5, while Aarch64 multiversioning appeared in GCC 6. An example from the GCC online manual is shown below.
Multiversioning does not work well for the Crypto++ library for several reasons. First, multiversioning is too new and does not provide enough coverage in the field. GCC 4 is still common in the wild, especially on ARM and MIPS boards. In fact many of the machines at the GCC compile farm provide GCC 4.8.5 or 4.9.2 as the default compiler.
Second, GCC lacks function multiversioning completely for some platforms, like PowerPC, MIPS and SPARC. PowerPC, MIPS and SPARC would require BASE+SIMD.
Third, some versions of Clang do not support function multiversioning well, and other compilers don't support it at all. Support for Clang and some other compilers like SunCC would require BASE+SIMD.
Fourth, multiversioning is too incomplete, even with the latest compilers. For example, the example below will fail to compile with GCC 10 and Clang 12:
Finally, multiversioning is too buggy. We tried the experiment and a round of testing. It failed miserably.
__attribute__ ((target ("default"))) template <class T> int foo () { // The default version of foo. return 0; } __attribute__ ((target ("sse4.2"))) template <class T> int foo () { // foo version for SSE4.2 return 1; }
Crypto++ 5.6.2
Crypto++ versions prior to 6.0 used a different method to make instructions available. The method is detailed below, and it shows why we had to switch to BASE+SIMD. Effectively Crypto++ source code used the following pattern to provide hardware accelerated algorithms (the real code is a little messier):
#if defined(__AES__) || (_MSC_FULL_VER >= 150030729) // Include standard Intel SSE headers for declarations # define CRYPTOPP_AESNI_AVAILABLE 1 # include <wmmintrin.h> #elif (GCC_VERSION >= 40400) || (__INTEL_COMPILER >= 1110) // Mirror the Intel functions but use inline ASM for missing pieces # define CRYPTOPP_AESNI_AVAILABLE 1 ... __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_aesimc_si128 (__m128i a) { __m128i r; asm ("aesimc %1, %0" : "=x"(r) : "xm"(a)); return r; } ... #else // AESNI not available. Straight C++ will be used #endif
The pattern above worked great on i686 or x86_64 for GNU GCC, Intel ICC and Microsoft's MSVC++. GCC, ICC and MSVC++ always made the instructions available with as little as -march=native
. GCC, ICC and MSVC++ were the original compilers the library supported from the 1990's and early 2000's.
The pattern failed miserably on other i686 or x86_64 compilers, including Clang and SunCC. Clang required -maes
to be explicitly passed on the command line. Clang would not accept -march=native
and enable AESNI even on a machine with AESNI. The result was many compilers, including Clang and SunCC, used the C++ implementation. Specialized support for compilers like Clang and SunCC did not arrive until about 2016.
Other architectures, like ARM and NEON, ARMv8, and POWER8 do not make the architectures available unless the appropriate architecture switch is present, even with -march=native
. So the additional architectures were broke out of the box using the original x86 pattern. Specialized support for architectures like ARM and NEON, ARMv8, and POWER8 did not arrive until about 2016.
All of this trouble would have been avoided if the compilers simply made the instructions available out of the box for user code. It is one thing for GCC to use a particular instruction set when generating its own code from C++; but its an entirely different story when a programmers asks for specific instruction to be generated, like a aesimc
.
You can see an example of the old pattern in Crypto++ 5.6.2 cpu.h
.
Clang and OS X
We are aware of one problem when using BASE+SIMD. On an old OS X machine with an updated Clang, Clang allows higher ISAs to cross-pollinate into BASE code by way of global constructors. The old OS X machine is an early Core2 Duo a.k.a. MacBook Pro from 2011 or so. The updated Clang comes from MacPorts and includes version 5.0 and 6.0.
For example, instead of compiling only the function ChaCha_OperateKeystream_AVX2
with AVX2, Clang compiles the entire source file chacha_avx.cpp
using AVX2. The translation unit includes global constructors so AVX and AVX2 will be used outside of our guarded functions. If the binary is run on a down-level machine then the program will segfault with a SIGILL
.
Testing with GCC 6 does not reveal the problem. GCC was tested on OS X and Debian. Both machines were early Core2 Duo machines.
Also see Issue 751, SIGILL on older OS X with new Clang compiler due to global ctor ISA and Restrict global constructors to base ISA on the LLVM-dev mailing list.