A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.
Introduction
sse2neon translates Intel SSE (Streaming SIMD Extensions) intrinsics to Arm NEON,
enabling rapid porting of x86 SIMD code to Arm platforms.
The header file sse2neon.h provides NEON-based implementations of functions from Intel intrinsic headers (e.g., <xmmintrin.h>),
preserving the original semantics.
Some SSE intrinsics map directly to a single NEON intrinsic (e.g., _mm_loadu_si128 → vld1q_s32),
while others require multiple NEON instructions (e.g., _mm_maddubs_epi16 uses 13 instructions).
See perf-tier.md for detailed performance tier classification of all intrinsics.
Floating-point Compatibility
Some conversions produce different results than SSE due to IEEE-754 handling differences.
For example, _mm_rsqrt_ps:
__m128 _mm_rsqrt_ps(__m128 in)
{
float32x4_t out = vrsqrteq_f32(vreinterpretq_f32_m128(in));
out = vmulq_f32(
out, vrsqrtsq_f32(vmulq_f32(vreinterpretq_f32_m128(in), out), out));
return vreinterpretq_m128_f32(out);
}
This NEON conversion returns NaN for input 0.0 because rsqrt(0) = INF, and the subsequent INF × 0 = NaN.
The SSE intrinsic returns INF instead.
Enable the compile-time precision flags below when exact SSE compatibility is required.
Requirements
Architecture: Little-endian ARM only. Big-endian ARM is not supported and will produce a compile-time error.
Earlier GCC/Clang versions contain bugs in vector instruction generation that cause incorrect assembly output for certain NEON intrinsics (e.g., rev16, rev32 with invalid operand combinations).
ARM64EC is a hybrid ABI allowing ARM64 code to interoperate with x64 code.
Usage
Copy sse2neon.h into your source directory.
Replace SSE headers with sse2neon:
// Before
#include <xmmintrin.h>
#include <emmintrin.h>
// After
#include "sse2neon.h"
This also replaces {p,t,s,n,w}mmintrin.h.
Add the appropriate compiler flags:
Target
Compiler Flag
ARMv8-A AArch64
-march=armv8-a+fp+simd+crypto+crc
ARMv8-A AArch32
-mfpu=neon-fp-armv8
ARMv7-A
-mfpu=neon
Remove +crypto and/or +crc if unsupported by your target.
For Windows ARM64EC (hybrid x64/ARM64 mode):
sse2neon.h automatically skips <intrin.h> when _M_ARM64EC is detected
This avoids type conflicts between MSVC’s SSE union types and sse2neon’s NEON types
No manual configuration needed when using MSVC with ARM64EC target
Note: sse2neon’s __m128 type is not ABI-compatible with x64 code; users needing
cross-ABI SIMD interop should use MSVC’s softintrin instead
(Optional) To reduce the header file size for your specific target architecture and accelerate compilation,
you can use the unifdef tool to remove unused conditional compilation paths.
For example, to generate a reduced version for AArch64:
NEON trades IEEE-754 compliance for performance when handling denormals and NaNs.
Define these macros as 1 before including sse2neon.h to enable precise (but slower) implementations:
Macro
Effect
SSE2NEON_PRECISE_MINMAX
Correct NaN handling in _mm_min_{ps,pd} and _mm_max_{ps,pd}
SSE2NEON_PRECISE_DIV
Extra Newton-Raphson iteration for _mm_rcp_ps and _mm_div_ps
SSE2NEON_PRECISE_SQRT
Extra Newton-Raphson iteration for _mm_sqrt_ps and _mm_rsqrt_ps
SSE2NEON_PRECISE_DP
Conditional multiplication in _mm_dp_pd
SSE2NEON_UNDEFINED_ZERO
Force zero for _mm_undefined_{ps,pd,si128} (MSVC already does this)
All precision flags are disabled by default to maximize performance.
Recommended Flag Combinations by Use Case
Use Case
Flags
Rationale
Graphics/Rendering
MINMAX, SQRT (+DIV if using rcp)
NaN handling and normalization; Blender enables all three
DSP/Audio
None (defaults)
Throughput over precision; inaudible differences
Cryptography
None (defaults)
Integer-focused; FP precision irrelevant
Scientific/Numerical
MINMAX, SQRT, DP
Reduces x86 divergence; see caveats below
Game Physics
MINMAX
Prevents NaN propagation in collision detection
Machine Learning
MINMAX for inference
NaN handling for determinism; training tolerates defaults
Flags use SSE2NEON_PRECISE_ prefix (e.g., MINMAX → SSE2NEON_PRECISE_MINMAX).
Architecture notes:
DIV flag affects _mm_rcp* (reciprocal approximation), not _mm_div* which uses native IEEE-754 division on ARMv8. Enable DIV on ARMv7 or when using reciprocal intrinsics.
For strict determinism, also define SSE2NEON_UNDEFINED_ZERO=1. Some divergences (FTZ/DAZ, NaN payloads) cannot be fully eliminated.
Memory from _mm_malloc() must be freed with _mm_free(), not free(). Mixing allocators causes heap corruption on Windows.
MONITOR/MWAIT Policy
ARM has no userspace equivalent for x86 address-range monitoring.
_mm_monitor is a no-op; _mm_mwait behavior is controlled by SSE2NEON_MWAIT_POLICY:
Value
Behavior
0 (default)
yield - Safe everywhere, never blocks
1
wfe - Event wait, may trap in EL0
2
wfi - Interrupt wait, may trap in EL0
Policies 1/2 do not provide “wake on store” semantics and may trap on Linux/iOS/macOS.
See sse2neon.h for detailed usage guidance.
Run Built-in Test Suite
Test cases are in the tests directory with runtime-specified input data.
# Basic test run
make check
# Enable crypto and CRC features
make FEATURE=crypto+crc check
# Target specific CPU
make ARCH_CFLAGS="-mcpu=cortex-a53 -mfpu=neon-vfpv4" check
Compiler optimizations (-O1, -O2, etc.) may cause unexpected behavior with frequent rounding mode changes or repeated _MM_SET_DENORMALS_ZERO_MODE() calls.
The project prioritizes performance over these edge cases—developers should handle them explicitly when needed.
Adoptions
Open source projects using sse2neon for Arm/Aarch64 support (partial list):
Aaru Data Preservation Suite is a fully-featured software package to preserve all storage media from the very old to the cutting edge, as well as to give detailed information about any supported image file (whether from Aaru or not) and to extract the files from those images.
aether-game-utils is a collection of cross platform utilities for quickly creating small game prototypes in C++.
ALE, aka Assembly Likelihood Evaluation, is a tool for evaluating accuracy of assemblies without the need of a reference genome.
AnchorWave, Anchored Wavefront Alignment, identifies collinear regions via conserved anchors (full-length CDS and full-length exon have been implemented currently) and breaks collinear regions into shorter fragments, i.e., anchor and inter-anchor intervals.
ATAK-CIV, Android Tactical Assault Kit for Civilian Use, is the official geospatial-temporal and situational awareness tool used by the US Government.
Apache Doris is a Massively Parallel Processing (MPP) based interactive SQL data warehousing for reporting and analysis.
Apache Impala is a lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
Apache Kudu completes Hadoop’s storage layer to enable fast analytics on fast data.
apollo is a high performance, flexible architecture which accelerates the development of Autonomous Vehicles.
ares is a cross-platform, open source, multi-system emulator, focusing on accuracy and preservation.
Async is a set of c++ primitives that allows efficient and rapid development in C++17 on GNU/Linux systems.
avec is a little library for using SIMD instructions on both x86 and Arm.
BARCH is a low-memory, dynamically configurable, constant access time ordered cache similar to Valkey and Redis.
BEAGLE is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages.
BitMagic implements compressed bit-vectors and containers (vectors) based on ideas of bit-slicing transform and Rank-Select compression, offering sets of method to architect your applications to use HPC techniques to save memory (thus be able to fit more data in one compute unit) and improve storage and traffic patterns when storing data vectors and models in files or object stores.
bipartite_motif_finder as known as BMF (Bipartite Motif Finder) is an open source tool for finding co-occurences of sequence motifs in genomic sequences.
Blender is the free and open source 3D creation suite, supporting the entirety of the 3D pipeline.
Boo is a cross-platform windowing and event manager similar to SDL or SFML, with additional 3D rendering functionality.
Brickworks is a music DSP toolkit that supplies with the fundamental building blocks for creating and enhancing audio engines on any platform.
CARTA is a new visualization tool designed for viewing radio astronomy images in CASA, FITS, MIRIAD, and HDF5 formats (using the IDIA custom schema for HDF5).
compute-runtime, the Intel Graphics Compute Runtime for oneAPI Level Zero and OpenCL Driver, provides compute API support (Level Zero, OpenCL) for Intel graphics hardware architectures (HD Graphics, Xe).
contour is a modern and actually fast virtual terminal emulator.
Cog is a free and open source audio player for macOS.
dab-cmdline provides entries for the functionality to handle Digital audio broadcasting (DAB)/DAB+ through some simple calls.
DISTRHO is an open-source project for Cross-Platform Audio Plugins.
Dragonfly is a modern in-memory datastore, fully compatible with Redis and Memcached APIs.
EDGE is an advanced OpenGL source port spawned from the DOOM engine, with focus on easy development and expansion for modders and end-users.
Embree is a collection of high-performance ray tracing kernels. Its target users are graphics application engineers who want to improve the performance of their photo-realistic rendering application by leveraging Embree’s performance-optimized ray tracing kernels.
emp-tool aims to provide a benchmark for secure computation and allowing other researchers to experiment and extend.
Exudyn is a C++ based Python library for efficient simulation of flexible multibody dynamics systems.
FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers.
fsrc is capable of searching large codebases for text snippets.
GDAL is a translator library for raster and vector geospatial data formats that comes with a variety of useful command line utilities for data translation and processing.
gmmlib is the Intel Graphics Memory Management Library that provides device specific and buffer management for the Intel Graphics Compute Runtime for OpenCL and the Intel Media Driver for VAAPI.
hashcat is the world’s fastest and most advanced password recovery utility.
HISE is a cross-platform open source audio application for building virtual instruments, emphasizing on sampling, but includes some basic synthesis features for making hybrid instruments as well as audio effects.
iqtree2 is an efficient and versatile stochastic implementation to infer phylogenetic trees by maximum likelihood.
indelPost is a Python library for indel processing via realignment and read-based phasing to resolve alignment ambiguities.
IResearch is a cross-platform, high-performance document oriented search engine library written entirely in C++ with the focus on a pluggability of different ranking/similarity models.
jak aims to port the original Jak and Daxter and Jak II to PC.
Kraken is a 3D animation platform redefining animation composition, collaborative workflows, simulation engines, skeletal rigging systems, and look development from storyboard to final render.
Krita is a cross-platform application that offers an end-to-end solution for creating digital art files from scratch built on the KDE and Qt frameworks.
libCML is a SLAM library and scientific tool, which include a novel fast thread-safe graph map implementation.
libhdfs3 is implemented based on native Hadoop RPC protocol and Hadoop Distributed File System (HDFS), a highly fault-tolerant distributed fs, data transfer protocol.
libpll-2 is a C library for Phylogenetic Likelihood.
libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data.
libscapi stands for the “Secure Computation API”, providing reliable, efficient, and highly flexible cryptographic infrastructure.
libmatoya is a cross-platform application development library, providing various features such as common cryptography tasks.
Loosejaw provides deep hybrid CPU/GPU digital signal processing.
Madronalib enables efficient audio DSP on SIMD processors with readable and brief C++ code.
MaxMath is an extensive SIMD math library available to Unity developers.
minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database.
mixed-fem is an open source reference implementation of Mixed Variational Finite Elements for Implicit Simulation of Deformables.
MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets.
MRIcroGL is a cross-platform tool for viewing NIfTI, DICOM, MGH, MHD, NRRD, AFNI format medical images.
N2 is an approximate nearest neighborhoods algorithm library written in C++, providing a much faster search speed than other implementations when modeling large dataset.
nanors is a tiny, performant implementation of Reed-Solomon codes, capable of reaching multi-gigabit speeds on a single core.
niimath is a general image calculator with superior performance.
NVIDIA GameWorks has been already used in a lot of games. These repositories are public on GitHub.
NVIDIA MathLib is a cross-platform header-only SSE/AVX/NEON-accelerated math library, designed for computer graphics.
ofxNDI is an openFrameworks addon to allow sending and receiving images over a network using the NewTek Network Device Protocol.
OGRE is a scene-oriented, flexible 3D engine written in C++ designed to make it easier and more intuitive for developers to produce games and demos utilising 3D hardware.
Olive is a free non-linear video editor for Windows, macOS, and Linux.
OpenColorIO a complete color management solution geared towards motion picture production with an emphasis on visual effects and computer animation.
OpenXRay is an improved version of the X-Ray engine, used in world famous S.T.A.L.K.E.R. game series by GSC Game World.
Orkid is a C++ flexible media presentation engine.
parallel-n64 is an optimized/rewritten Nintendo 64 emulator made specifically for Libretro.
Pathfinder C++ is a fast, practical, GPU-based rasterizer for fonts and vector graphics using Vulkan and C++.
PFFFT does 1D Fast Fourier Transforms, of single precision real and complex vectors.
PhyML uses modern statistical approaches to analyse alignments of nucleotide or amino acid sequences in a phylogenetic framework.
pixaccess provides the abstractions for integer and float bitmaps, pixels, and aliased (nearest neighbor) and anti-aliased (bi-linearly interpolated) pixel access.
PlutoSDR Firmware is the customized firmware for the PlutoSDR that can be used to introduce fundamentals of Software Defined Radio (SDR) or Radio Frequency (RF) or Communications as advanced topics in electrical engineering in a self or instructor lead setting.
PowerToys is a set of utilities for power users to tune and streamline their Windows experience for greater productivity.
PVFMM is a library for solving certain types of elliptic partial differential equations.
Pygame is cross-platform and designed to make it easy to write multimedia software, such as games, in Python.
R:RandomFieldsUtils provides various utilities might be used in spatial statistics and elsewhere. (CRAN)
RAxML is tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies.
REDUCE is an interactive system for general algebraic computations, of interest to mathematicians, scientists, and engineers.
ReHLDS is fully compatible with latest Half-Life Dedicated Server (HLDS) with a lot of defects and (potential) bugs fixed.
The Forge is a cross-platform rendering framework, providing building blocks to write your own game engine.
Typesense is a fast, typo-tolerant search engine for building delightful search experiences.
Vcpkg is a C++ Library Manager for Windows, Linux, and macOS.
VelocyPack is a fast and compact format for serialization and storage.
VOLK, Vector-Optimized Library of Kernel, is a sub-project of GNU Radio.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
Winter is the top rated chess engine from Switzerland and has competed at top invite only computer chess events.
XEVE (eXtra-fast Essential Video Encoder) is an open sourced and fast MPEG-5 EVC encoder.
XMRig is an open source CPU miner for Monero cryptocurrency.
xpar is an error/erasure code system guarding data integrity.
xsimd provides a unified means for using SIMD intrinsics and parallelized, optimized mathematical functions.
YACL is a C++ library contains modules and utilities which SecretFlow code depends on.
Related Projects
SIMDe: fast and portable implementations of SIMD
intrinsics on hardware which doesn’t natively support them, such as calling SSE functions on ARM.
sse2rvv: C header file that converts Intel SSE intrinsics to RISC-V Vector intrinsic.
sse2msa: A C/C++ header file that converts Intel SSE intrinsics to MIPS/MIPS64 MSA intrinsics.
sse2zig: Intel SSE intrinsics mapped to Zig vector extensions.
POWER/PowerPC support for GCC contains a series of headers simplifying porting x86_64 code that makes explicit use of Intel intrinsics to powerpc64le (pure little-endian mode that has been introduced with the POWER8).
sse2neon
A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.
Introduction
sse2neontranslates Intel SSE (Streaming SIMD Extensions) intrinsics to Arm NEON, enabling rapid porting of x86 SIMD code to Arm platforms. The header filesse2neon.hprovides NEON-based implementations of functions from Intel intrinsic headers (e.g.,<xmmintrin.h>), preserving the original semantics.Mapping and Coverage
<mmintrin.h><xmmintrin.h><emmintrin.h><pmmintrin.h><tmmintrin.h><smmintrin.h><nmmintrin.h><wmmintrin.h>sse2neonsupports SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and AES extensions.Some SSE intrinsics map directly to a single NEON intrinsic (e.g.,
_mm_loadu_si128→vld1q_s32), while others require multiple NEON instructions (e.g.,_mm_maddubs_epi16uses 13 instructions).See perf-tier.md for detailed performance tier classification of all intrinsics.
Floating-point Compatibility
Some conversions produce different results than SSE due to IEEE-754 handling differences.
For example,
_mm_rsqrt_ps:This NEON conversion returns NaN for input
0.0because rsqrt(0) = INF, and the subsequent INF × 0 = NaN. The SSE intrinsic returns INF instead. Enable the compile-time precision flags below when exact SSE compatibility is required.Requirements
Architecture: Little-endian ARM only. Big-endian ARM is not supported and will produce a compile-time error.
Compilers: | Compiler | Minimum Version | Notes | |———-|—————–|——-| | GCC | 10+ | Earlier versions have vector instruction bugs | | Clang | 11+ | Earlier versions have vector instruction bugs | | MSVC | 2019+ (v142) | ARM64 and ARM64EC targets supported | | Apple Clang | 12+ | macOS ARM64 (Apple Silicon) |
Earlier GCC/Clang versions contain bugs in vector instruction generation that cause incorrect assembly output for certain NEON intrinsics (e.g.,
rev16,rev32with invalid operand combinations).ARM64EC is a hybrid ABI allowing ARM64 code to interoperate with x64 code.
Usage
Copy
sse2neon.hinto your source directory.Replace SSE headers with sse2neon:
This also replaces
{p,t,s,n,w}mmintrin.h.Add the appropriate compiler flags:
-march=armv8-a+fp+simd+crypto+crc-mfpu=neon-fp-armv8-mfpu=neonRemove
+cryptoand/or+crcif unsupported by your target.For Windows ARM64EC (hybrid x64/ARM64 mode):
sse2neon.hautomatically skips<intrin.h>when_M_ARM64ECis detected__m128type is not ABI-compatible with x64 code; users needing cross-ABI SIMD interop should use MSVC’s softintrin instead(Optional) To reduce the header file size for your specific target architecture and accelerate compilation, you can use the unifdef tool to remove unused conditional compilation paths. For example, to generate a reduced version for AArch64:
Compile-time Configurations
NEON trades IEEE-754 compliance for performance when handling denormals and NaNs. Define these macros as
1before includingsse2neon.hto enable precise (but slower) implementations:SSE2NEON_PRECISE_MINMAX_mm_min_{ps,pd}and_mm_max_{ps,pd}SSE2NEON_PRECISE_DIV_mm_rcp_psand_mm_div_psSSE2NEON_PRECISE_SQRT_mm_sqrt_psand_mm_rsqrt_psSSE2NEON_PRECISE_DP_mm_dp_pdSSE2NEON_UNDEFINED_ZERO_mm_undefined_{ps,pd,si128}(MSVC already does this)All precision flags are disabled by default to maximize performance.
Recommended Flag Combinations by Use Case
MINMAX,SQRT(+DIVif usingrcp)MINMAX,SQRT,DPMINMAXMINMAXfor inferenceFlags use
SSE2NEON_PRECISE_prefix (e.g.,MINMAX→SSE2NEON_PRECISE_MINMAX).Architecture notes:
DIVflag affects_mm_rcp*(reciprocal approximation), not_mm_div*which uses native IEEE-754 division on ARMv8. EnableDIVon ARMv7 or when using reciprocal intrinsics.SSE2NEON_UNDEFINED_ZERO=1. Some divergences (FTZ/DAZ, NaN payloads) cannot be fully eliminated.Example configuration for graphics applications:
Memory Allocation
Memory from
_mm_malloc()must be freed with_mm_free(), notfree(). Mixing allocators causes heap corruption on Windows.MONITOR/MWAIT Policy
ARM has no userspace equivalent for x86 address-range monitoring.
_mm_monitoris a no-op;_mm_mwaitbehavior is controlled bySSE2NEON_MWAIT_POLICY:0(default)yield- Safe everywhere, never blocks1wfe- Event wait, may trap in EL02wfi- Interrupt wait, may trap in EL0Policies 1/2 do not provide “wake on store” semantics and may trap on Linux/iOS/macOS. See
sse2neon.hfor detailed usage guidance.Run Built-in Test Suite
Test cases are in the
testsdirectory with runtime-specified input data.Cross-compilation Testing
Requires QEMU for non-Arm hosts.
See tests/README.md for details.
Optimization Caveats
Compiler optimizations (
-O1,-O2, etc.) may cause unexpected behavior with frequent rounding mode changes or repeated_MM_SET_DENORMALS_ZERO_MODE()calls. The project prioritizes performance over these edge cases—developers should handle them explicitly when needed.Adoptions
Open source projects using
sse2neonfor Arm/Aarch64 support (partial list):Related Projects
Reference
Licensing
sse2neonis freely redistributable under the MIT License.