Floating Point Math
Historically ARM CPU's lacked a Floating Point Unit (FPU) to perform hardware accelerated floating point calculations. The modern ARM CPU's have FPU's and in some cases other hardware blocks capable of accelerating floating point calculations. The use of these blocks is typically referred to as 'hardware floating point' or 'hardfloat' instead of using software to emulate floating-point otherwise known as 'softfloat'.
The GCC compiler can produce binaries with several options regarding floating point:
- soft - suitable for running on CPU's with no FPU - calculations are done in software by compiler generated code
- softfp - suitable for running on CPU's with or without FPU - will use an FPU if present, otherwise will use compiler generated code
- hard - suitable for running on CPU's with FPU only - the most efficient but also the most restrictive as far as binary compatibility goes
Gateworks Product Family CPU details
Most of the Gateworks product families have hardware floating point acceleration capabilities:
- Ventana: Freescale IMX6 SoC
- ARM Cortex-A9 core (1 to 4 core depending on board options)
- armv7a CPU instruction set
- vfpv3-d16 Vector Floating Point unit (3rd generation, 16 64bit FPU registers)
- NEON General purpose SIMD (Single Instruction Multiple Dataset) engine
FPU Notes:
- NEON is a SIMD engine (vector math operations): It can be used for single-precision floating-point ops on up to 4 single-precision values in parallel - there are pros and cons to using neon as the fpu (see here)
- VFP is an FPU (floating point unit) - the one in the Cortex-A9 is the vfpv3
GCC options and binary compatibility
The GCC compiler has several options that tell it how to deal with floating point in C code for ARM CPUs:
- -mfloat-abi=<name>:
- soft - causes GCC to generate output containing lib calls (to your libc) for floating-point operations.
- softfp - allows generation of code using hardware floating-point instructions (based on the -mfp option) but still uses the soft-float calling conventions.
- hard - allows generation of floating-point instructions and uses FPU-specific calling conventions.
- -mfp=<name> - specifies the floating point hardware that is available on the target
Note that there are some arguments that can be used for aliases of the above. We will only refer to the options above however to avoid confusion in this document:
- -mhard-float - equivalent to -mfloat-abi=hard
- -msoft-float - equvalent to -mfloat-abi=soft
The default setting (if not specified) for the above is dictated by the configuration options used to build the gcc cross-toolchain. See here for more details.
References:
- https://wiki.debian.org/ArmHardFloatPort/VfpComparison - probably the best explanation of the details of soft/softfp/hard I've seen
- https://wiki.linaro.org/Linaro-arm-hardfloat gc
- http://www.arm.com/products/processors/technologies/vector-floating-point.php
- http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
- http://gcc.gnu.org/install/configure.html
soft
When gcc is used with -mfloat-abi=soft this causes GCC to generate output containing lib calls (to your libc) for floating-point operations.
The GCC software floating point library is used when -mfloat-abi=soft (for use on machines which do not have hardware support for floating point). This library provides addition, subtraction, multiplication, division, and conversion functions for floats and doubles.
softfp
When gcc is used with -mfloat-abi=softfp this allows generation of code using hardware floating-point instructions (based on the -mfp option) but still uses the soft-float calling conventions. This allows a parameter passing and function linking compatibility between binaries built with hardware floating-point instructions or library calls as they both pass floats using standard non-float instructions and registers. In other words this will use hardware floating-point where available (as specified at build-time with the -mfp argument) but can also link with binary/lib objects built with soft floating-point.
The downside is the performance hit that you take in the function prologue/epilog when passing floats around (assuming you are).
The upside is the binary/lib/kernel compatibility which allows you to support more system types (FPU or no FPU) with the same OS distribution.
hard
When gcc is used with -mfloat-abi=hard this allows generation of floating-point instructions and uses FPU-specific calling conventions. This means FPU specific instructions/registers are used and thus a binary built with hard floating-point cannot link with a binary built with soft/softfp. Note that this also requires CONFIG_VFP=y in the kernel.
The upside is that you get better performance as the function prologue/epilog don't have to spend time moving floats to standard registers - instead the compiler will store arguments in dedicated FPU registers.
The downside is the lack of binary/lib compatibility across the system and kernel compatibility (FPU vs no FPU) - which means you reduce the number of systems that your OS can run on.
NEON vs VFPv3
The IMX6 SoC used on the Ventana product family has two choices for hardware floating point:
- NEON - Note that NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic so the use of NEON instructions may loead to a loss of precision.
- VFPv3
References:
- http://www.arm.com/products/processors/technologies/vector-floating-point.php
- http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
- https://wiki.debian.org/ArmHardFloatPort/VfpComparison (information regarding gcc neon support is likely dated)
Exploring an Example
Consider the simple C application below built with the OpenWrt GCC4.6 compiler (based on Linaro GCC 4.6):
#include <stdio.h> #include <stdlib.h> float foo(float f1, float f2) { return f1 * f2; } int main(int argc, char **argv) { float a1, a2, r; if (argc != 3) { printf("usage: %s <float1> <float2>\n", argv[0]); exit(1); } a1 = strtof(argv[1], NULL); a2 = strtof(argv[2], NULL); r = foo(a1, a2); printf("%f * %f = %f\n", a1, a2, r); }
Compiling the application and disassembling you can compare the results (below). You will notice that in the soft dissasembly the branch to eaabi_f2d for the multiplication (this is provided by gcc). You will notice that in the softfp dissasembly the flds/fmuls/fsts/fcvtds calls and the use of s14,s15 - these are all vfpv3 instructions/registers. You will notice in the hard dissasembly that floating point registers and instructions are used instead of the stack.
- soft:
$ staging_dir/toolchain-arm_cortex-a9+vfpv3_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/bin/arm-openwrt-linux-gcc \ -pipe -march=armv7-a -mtune=cortex-a9 \ -mfloat-abi=soft -S -o a.out-softfp.s floattest.c $ cat a.out-softfp.s .arch armv7-a .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 6 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file "floattest.c" .global __aeabi_fmul .text .align 2 .global foo .type foo, %function foo: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 1, uses_anonymous_args = 0 stmfd sp!, {fp, lr} add fp, sp, #4 sub sp, sp, #8 str r0, [fp, #-8] @ float str r1, [fp, #-12] @ float ldr r0, [fp, #-8] @ float ldr r1, [fp, #-12] @ float bl __aeabi_fmul mov r3, r0 mov r0, r3 sub sp, fp, #4 ldmfd sp!, {fp, pc} .size foo, .-foo .section .rodata .align 2 .LC0: .ascii "usage: %s <float1> <float2>\012\000" .align 2 .LC1: .ascii "%f * %f = %f\012\000" .global __aeabi_f2d .text .align 2 .global main .type main, %function main: @ args = 0, pretend = 0, frame = 24 @ frame_needed = 1, uses_anonymous_args = 0 stmfd sp!, {r4, r6, r7, r8, r9, fp, lr} add fp, sp, #24 sub sp, sp, #44 str r0, [fp, #-48] str r1, [fp, #-52] ldr r3, [fp, #-48] cmp r3, #3 beq .L3 movw r3, #:lower16:.LC0 movt r3, #:upper16:.LC0 ldr r2, [fp, #-52] ldr r2, [r2, #0] mov r0, r3 mov r1, r2 bl printf mov r0, #1 bl exit .L3: ldr r3, [fp, #-52] add r3, r3, #4 ldr r3, [r3, #0] mov r0, r3 mov r1, #0 bl strtof str r0, [fp, #-32] @ float ldr r3, [fp, #-52] add r3, r3, #8 ldr r3, [r3, #0] mov r0, r3 mov r1, #0 bl strtof str r0, [fp, #-36] @ float ldr r0, [fp, #-32] @ float ldr r1, [fp, #-36] @ float bl foo str r0, [fp, #-40] @ float movw r4, #:lower16:.LC1 movt r4, #:upper16:.LC1 ldr r0, [fp, #-32] @ float bl __aeabi_f2d mov r6, r0 mov r7, r1 ldr r0, [fp, #-36] @ float bl __aeabi_f2d mov r8, r0 mov r9, r1 ldr r0, [fp, #-40] @ float bl __aeabi_f2d mov r2, r0 mov r3, r1 strd r8, [sp] strd r2, [sp, #8] mov r0, r4 mov r2, r6 mov r3, r7 bl printf mov r0, r3 sub sp, fp, #24 ldmfd sp!, {r4, r6, r7, r8, r9, fp, pc} .size main, .-main .ident "GCC: (OpenWrt/Linaro GCC 4.6-2013.05 r39638) 4.6.4" .section .note.GNU-stack,"",%progbits
- Notes:
- note the '.global eabi_fmul' - this is the routine gcc uses for software float multiply. The floating point emulation functions are automatically inserted by GCC when the processor does not have native instructions to deal with them. There are several floating point emulation functions: eabi_fadd, eable_fmul for example
- note the use of standard ARM instructions and registers when operating on the floats (str/ldr r0,r1,r3)
- Notes:
- softfp:
$ staging_dir/toolchain-arm_cortex-a9+vfpv3_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/bin/arm-openwrt-linux-gcc \ -pipe -march=armv7-a -mtune=cortex-a9 \ -mfloat-abi=softfp -mfpu=vfpv3-d16 -S -o a.out-softfp.s floattest.c $ cat a.out-softfp.s .arch armv7-a .eabi_attribute 27, 3 .fpu vfpv3-d16 .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 6 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file "floattest.c" .text .align 2 .global foo .type foo, %function foo: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 1, uses_anonymous_args = 0 @ link register save eliminated. str fp, [sp, #-4]! add fp, sp, #0 sub sp, sp, #12 str r0, [fp, #-8] @ float str r1, [fp, #-12] @ float flds s14, [fp, #-8] flds s15, [fp, #-12] fmuls s15, s14, s15 fmrs r3, s15 mov r0, r3 @ float add sp, fp, #0 ldmfd sp!, {fp} bx lr .size foo, .-foo .section .rodata .align 2 .LC0: .ascii "usage: %s <float1> <float2>\012\000" .align 2 .LC1: .ascii "%f * %f = %f\012\000" .text .align 2 .global main .type main, %function main: @ args = 0, pretend = 0, frame = 24 @ frame_needed = 1, uses_anonymous_args = 0 stmfd sp!, {fp, lr} add fp, sp, #4 sub sp, sp, #40 str r0, [fp, #-24] str r1, [fp, #-28] ldr r3, [fp, #-24] cmp r3, #3 beq .L3 movw r3, #:lower16:.LC0 movt r3, #:upper16:.LC0 ldr r2, [fp, #-28] ldr r2, [r2, #0] mov r0, r3 mov r1, r2 bl printf mov r0, #1 bl exit .L3: ldr r3, [fp, #-28] add r3, r3, #4 ldr r3, [r3, #0] mov r0, r3 mov r1, #0 bl strtof str r0, [fp, #-8] @ float ldr r3, [fp, #-28] add r3, r3, #8 ldr r3, [r3, #0] mov r0, r3 mov r1, #0 bl strtof str r0, [fp, #-12] @ float ldr r0, [fp, #-8] @ float ldr r1, [fp, #-12] @ float bl foo str r0, [fp, #-16] @ float movw r3, #:lower16:.LC1 movt r3, #:upper16:.LC1 flds s15, [fp, #-8] fcvtds d5, s15 flds s15, [fp, #-12] fcvtds d6, s15 flds s15, [fp, #-16] fcvtds d7, s15 fstd d6, [sp, #0] fstd d7, [sp, #8] mov r0, r3 fmrrd r2, r3, d5 bl printf mov r0, r3 sub sp, fp, #4 ldmfd sp!, {fp, pc} .size main, .-main .ident "GCC: (OpenWrt/Linaro GCC 4.6-2013.05 r39638) 4.6.4" .section .note.GNU-stack,"",%progbits
- Notes:
- the flds/fmuls s14/s15 VFP instructions/registers are used here because the compiler was told to use the vfpv3-d16 floating point unit
- however when passed as arguments to functions they are stored in the standard ARM cpu registers (str/mov r0/r1/r3) which incurs a pipeline stall for each register passed and has performance implications in that a lot of time is spent in function prologue/epilogue copying data back and forth to the FPU registers (which could be about 20 cycles or more for each float). Obviously here passing floats or returning floats from functions takes the performance hit here. If the sample code inlined the foo function the code would look the same as the hard float below
- binaries built with soft or softfp can be intermixed on a target at the expense of less performance when passing and returning floats to/from functions
- Notes:
- hard:
$ staging_dir/toolchain-arm_cortex-a9+vfpv3_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/bin/arm-openwrt-linux-gcc \ -pipe -march=armv7-a -mtune=cortex-a9 \ -mfloat-abi=hard -mfpu=vfpv3-d16 -S -o a.out-softfp.s floattest.c $ cat a.out-hard.s .arch armv7-a .eabi_attribute 27, 3 .eabi_attribute 28, 1 .fpu vfpv3-d16 .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 6 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file "floattest.c" .text .align 2 .global foo .type foo, %function foo: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 1, uses_anonymous_args = 0 @ link register save eliminated. str fp, [sp, #-4]! add fp, sp, #0 sub sp, sp, #12 fsts s0, [fp, #-8] fsts s1, [fp, #-12] flds s14, [fp, #-8] flds s15, [fp, #-12] fmuls s15, s14, s15 fcpys s0, s15 add sp, fp, #0 ldmfd sp!, {fp} bx lr .size foo, .-foo .section .rodata .align 2 .LC0: .ascii "usage: %s <float1> <float2>\012\000" .align 2 .LC1: .ascii "%f * %f = %f\012\000" .text .align 2 .global main .type main, %function main: @ args = 0, pretend = 0, frame = 24 @ frame_needed = 1, uses_anonymous_args = 0 stmfd sp!, {fp, lr} add fp, sp, #4 sub sp, sp, #40 str r0, [fp, #-24] str r1, [fp, #-28] ldr r3, [fp, #-24] cmp r3, #3 beq .L3 movw r3, #:lower16:.LC0 movt r3, #:upper16:.LC0 ldr r2, [fp, #-28] ldr r2, [r2, #0] mov r0, r3 mov r1, r2 bl printf mov r0, #1 bl exit .L3: ldr r3, [fp, #-28] add r3, r3, #4 ldr r3, [r3, #0] mov r0, r3 mov r1, #0 bl strtof fsts s0, [fp, #-8] ldr r3, [fp, #-28] add r3, r3, #8 ldr r3, [r3, #0] mov r0, r3 mov r1, #0 bl strtof fsts s0, [fp, #-12] flds s0, [fp, #-8] flds s1, [fp, #-12] bl foo fsts s0, [fp, #-16] movw r3, #:lower16:.LC1 movt r3, #:upper16:.LC1 flds s15, [fp, #-8] fcvtds d5, s15 flds s15, [fp, #-12] fcvtds d6, s15 flds s15, [fp, #-16] fcvtds d7, s15 fstd d6, [sp, #0] fstd d7, [sp, #8] mov r0, r3 fmrrd r2, r3, d5 bl printf mov r0, r3 sub sp, fp, #4 ldmfd sp!, {fp, pc} .size main, .-main .ident "GCC: (OpenWrt/Linaro GCC 4.6-2013.05 r39638) 4.6.4" .section .note.GNU-stack,"",%progbits
- Notes:
- like softfp, the flds/fmuls s14/s15 VFP instructions/registers are used here because the compiler was told to use the vfpv3-d16 floating point unit
- however when passed as arguments to functions they are stored in the VFP registers (s14/s15 above) directly and don't need to be copied around saving the pipeline hit and 20 or so instructions per function call.
- binaries built with hard float can yield better float performance but can not be intermixed with binaries built with soft/softfp
- Notes:
Floating Point Support in the Kernel (via exception handling)
If the hardware floating point ABI (-mfloat-abi=hard) is going to be used, the Linux kernel must be built with CONFIG_VFP=y to install the necessary exception handlers. You can optionally eliminate this if you know your not going to use hard float - the only place I see this as beneficial is if you are wanting to build a multi-arch kernel that runs on CPUs with and without an FPU and thus also much be using either -mfloat-abi=soft|softfp. Furthermore, if you don't have hardware floating point, you will need to configure software float emulation in the kernel if you have any userspace apps/libs that use hardware floating point.
This is available in the kernel config under 'Enable Floating point emulation' (CONFIG_VFP=y)
Depending on the architecture and SoC you can select emulation using various hardware floating-point options. For example, for the IMX6 you can select either VFPv3 (CONFIG_VFPv3=y) or NEON (CONFIG_NEON=y) emulation. The kernel floating-point emulation can only be used for architectures that have hardware floating-point.
References:
- http://www.linux-arm.org/LinuxKernel/LinuxVFP
- http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm/VFP/release-notes.txt
- http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/Kconfig
OpenWrt
There is a menuconfig in OpenWrt under toolchain options under Advanced configuration options->Target options->Use software floating point which configures the following:
- CONFIG_SOFT_FLOAT=y:
- apps are built with -mfloat-abi=softfp
- uClibc is built with UCLIBC_HAS_FLOATS=y, UCLIBC_HAS_SOFT_FLOAT=y
- toolchain is configured to default to -mfloat-abi=soft (if not specified)
- CONFIG_SOFT_FLOAT undefined
- apps are built with -mfloat-abi=hard
- uClibc is built with UCLIBC_HAS_FLOATS=y, UCLIBC_HAS_FPU=y
- toolchain is configured to default to -mfloat-abi=hard (if not specified)
When using CONFIG_SOFT_FLOAT undefined, and thus -mfloat-abi=hard you must have kernel support for VFP, otherwise any VFP stack instructions (used when passing floats to functions) will cause an exception that is not handled and crash your system.
OpenEmbedded / Yocto
The OpenEmbedded build system used to build the Yocto BSP for the Gateworks boards varies by the yocto version:
- Yocto 1.4:
- toolchain: gcc v4.7.2 with no default for -mabi-float
- libc: eglibc-2.16
- apps: -mfloat-abi=softfp
- Yocto 1.5:
- toolchain: gcc v4.8.1 defaulted to -mabi-float=hard
- libc: eglibc-2.18
- apps: -mfloat-abi=hard
Binary distributions
Certain OS Distributions may provide multiple distributions based on floating point. It is common for the suffix 'armhl' to be used for hard floating point.