wiki:floatingpoint

Floating Point Math

Historically ARM CPU's lacked a Floating Point Unit (FPU) to perform hardware accelerated floating point calculations. The modern ARM CPU's have FPU's and in some cases other hardware blocks capable of accelerating floating point calculations. The use of these blocks is typically referred to as 'hardware floating point' or 'hardfloat' instead of using software to emulate floating-point otherwise known as 'softfloat'.

The GCC compiler can produce binaries with several options regarding floating point:

  • soft - suitable for running on CPU's with no FPU - calculations are done in software by compiler generated code
  • softfp - suitable for running on CPU's with or without FPU - will use an FPU if present, otherwise will use compiler generated code
  • hard - suitable for running on CPU's with FPU only - the most efficient but also the most restrictive as far as binary compatibility goes

Gateworks Product Family CPU details

Most of the Gateworks product families have hardware floating point acceleration capabilities:

FPU Notes:

  • NEON is a SIMD engine (vector math operations): It can be used for single-precision floating-point ops on up to 4 single-precision values in parallel - there are pros and cons to using neon as the fpu (see here)
  • VFP is an FPU (floating point unit) - the one in the Cortex-A9 is the vfpv3

GCC options and binary compatibility

The GCC compiler has several options that tell it how to deal with floating point in C code for ARM CPUs:

  • -mfloat-abi=<name>:
    • soft - causes GCC to generate output containing lib calls (to your libc) for floating-point operations.
    • softfp - allows generation of code using hardware floating-point instructions (based on the -mfp option) but still uses the soft-float calling conventions.
    • hard - allows generation of floating-point instructions and uses FPU-specific calling conventions.
  • -mfp=<name> - specifies the floating point hardware that is available on the target

Note that there are some arguments that can be used for aliases of the above. We will only refer to the options above however to avoid confusion in this document:

  • -mhard-float - equivalent to -mfloat-abi=hard
  • -msoft-float - equvalent to -mfloat-abi=soft

The default setting (if not specified) for the above is dictated by the configuration options used to build the gcc cross-toolchain. See here for more details.

References:

soft

When gcc is used with -mfloat-abi=soft this causes GCC to generate output containing lib calls (to your libc) for floating-point operations.

The GCC software floating point library is used when -mfloat-abi=soft (for use on machines which do not have hardware support for floating point). This library provides addition, subtraction, multiplication, division, and conversion functions for floats and doubles.

softfp

When gcc is used with -mfloat-abi=softfp this allows generation of code using hardware floating-point instructions (based on the -mfp option) but still uses the soft-float calling conventions. This allows a parameter passing and function linking compatibility between binaries built with hardware floating-point instructions or library calls as they both pass floats using standard non-float instructions and registers. In other words this will use hardware floating-point where available (as specified at build-time with the -mfp argument) but can also link with binary/lib objects built with soft floating-point.

The downside is the performance hit that you take in the function prologue/epilog when passing floats around (assuming you are).

The upside is the binary/lib/kernel compatibility which allows you to support more system types (FPU or no FPU) with the same OS distribution.

hard

When gcc is used with -mfloat-abi=hard this allows generation of floating-point instructions and uses FPU-specific calling conventions. This means FPU specific instructions/registers are used and thus a binary built with hard floating-point cannot link with a binary built with soft/softfp. Note that this also requires CONFIG_VFP=y in the kernel.

The upside is that you get better performance as the function prologue/epilog don't have to spend time moving floats to standard registers - instead the compiler will store arguments in dedicated FPU registers.

The downside is the lack of binary/lib compatibility across the system and kernel compatibility (FPU vs no FPU) - which means you reduce the number of systems that your OS can run on.

NEON vs VFPv3

The IMX6 SoC used on the Ventana product family has two choices for hardware floating point:

  • NEON - Note that NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic so the use of NEON instructions may loead to a loss of precision.
  • VFPv3

References:

Exploring an Example

Consider the simple C application below built with the OpenWrt GCC4.6 compiler (based on Linaro GCC 4.6):

#include <stdio.h>
#include <stdlib.h>

float foo(float f1, float f2) {
        return f1 * f2;
}

int main(int argc, char **argv)
{
        float a1, a2, r;

        if (argc != 3) {
                printf("usage: %s <float1> <float2>\n", argv[0]);
                exit(1);
        }
        a1 = strtof(argv[1], NULL);
        a2 = strtof(argv[2], NULL);

        r = foo(a1, a2);
        printf("%f * %f = %f\n", a1, a2, r);
}

Compiling the application and disassembling you can compare the results (below). You will notice that in the soft dissasembly the branch to eaabi_f2d for the multiplication (this is provided by gcc). You will notice that in the softfp dissasembly the flds/fmuls/fsts/fcvtds calls and the use of s14,s15 - these are all vfpv3 instructions/registers. You will notice in the hard dissasembly that floating point registers and instructions are used instead of the stack.

  • soft:
    $ staging_dir/toolchain-arm_cortex-a9+vfpv3_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/bin/arm-openwrt-linux-gcc \
    -pipe -march=armv7-a -mtune=cortex-a9 \
    -mfloat-abi=soft -S -o a.out-softfp.s floattest.c
    
    $ cat a.out-softfp.s
            .arch armv7-a
            .fpu softvfp
            .eabi_attribute 20, 1
            .eabi_attribute 21, 1
            .eabi_attribute 23, 3
            .eabi_attribute 24, 1
            .eabi_attribute 25, 1
            .eabi_attribute 26, 2
            .eabi_attribute 30, 6
            .eabi_attribute 34, 1
            .eabi_attribute 18, 4
            .file   "floattest.c"
            .global __aeabi_fmul
            .text
            .align  2
            .global foo
            .type   foo, %function
    foo:
            @ args = 0, pretend = 0, frame = 8
            @ frame_needed = 1, uses_anonymous_args = 0
            stmfd   sp!, {fp, lr}
            add     fp, sp, #4
            sub     sp, sp, #8
            str     r0, [fp, #-8]   @ float
            str     r1, [fp, #-12]  @ float
            ldr     r0, [fp, #-8]   @ float
            ldr     r1, [fp, #-12]  @ float
            bl      __aeabi_fmul
            mov     r3, r0
            mov     r0, r3
            sub     sp, fp, #4
            ldmfd   sp!, {fp, pc}
            .size   foo, .-foo
            .section        .rodata
            .align  2
    .LC0:
            .ascii  "usage: %s <float1> <float2>\012\000"
            .align  2
    .LC1:
            .ascii  "%f * %f = %f\012\000"
            .global __aeabi_f2d
            .text
            .align  2
            .global main
            .type   main, %function
    main:
            @ args = 0, pretend = 0, frame = 24
            @ frame_needed = 1, uses_anonymous_args = 0
            stmfd   sp!, {r4, r6, r7, r8, r9, fp, lr}
            add     fp, sp, #24
            sub     sp, sp, #44
            str     r0, [fp, #-48]
            str     r1, [fp, #-52]
            ldr     r3, [fp, #-48]
            cmp     r3, #3
            beq     .L3
            movw    r3, #:lower16:.LC0
            movt    r3, #:upper16:.LC0
            ldr     r2, [fp, #-52]
            ldr     r2, [r2, #0]
            mov     r0, r3
            mov     r1, r2
            bl      printf
            mov     r0, #1
            bl      exit
    .L3:
            ldr     r3, [fp, #-52]
            add     r3, r3, #4
            ldr     r3, [r3, #0]
            mov     r0, r3
            mov     r1, #0
            bl      strtof
            str     r0, [fp, #-32]  @ float
            ldr     r3, [fp, #-52]
            add     r3, r3, #8
            ldr     r3, [r3, #0]
            mov     r0, r3
            mov     r1, #0
            bl      strtof
            str     r0, [fp, #-36]  @ float
            ldr     r0, [fp, #-32]  @ float
            ldr     r1, [fp, #-36]  @ float
            bl      foo
            str     r0, [fp, #-40]  @ float
            movw    r4, #:lower16:.LC1
            movt    r4, #:upper16:.LC1
            ldr     r0, [fp, #-32]  @ float
            bl      __aeabi_f2d
            mov     r6, r0
            mov     r7, r1
            ldr     r0, [fp, #-36]  @ float
            bl      __aeabi_f2d
            mov     r8, r0
            mov     r9, r1
            ldr     r0, [fp, #-40]  @ float
            bl      __aeabi_f2d
            mov     r2, r0
            mov     r3, r1
            strd    r8, [sp]
            strd    r2, [sp, #8]
            mov     r0, r4
            mov     r2, r6
            mov     r3, r7
            bl      printf
            mov     r0, r3
            sub     sp, fp, #24
            ldmfd   sp!, {r4, r6, r7, r8, r9, fp, pc}
            .size   main, .-main
            .ident  "GCC: (OpenWrt/Linaro GCC 4.6-2013.05 r39638) 4.6.4"
            .section        .note.GNU-stack,"",%progbits
    
    • Notes:
      • note the '.global eabi_fmul' - this is the routine gcc uses for software float multiply. The floating point emulation functions are automatically inserted by GCC when the processor does not have native instructions to deal with them. There are several floating point emulation functions: eabi_fadd, eable_fmul for example
      • note the use of standard ARM instructions and registers when operating on the floats (str/ldr r0,r1,r3)
  • softfp:
    $ staging_dir/toolchain-arm_cortex-a9+vfpv3_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/bin/arm-openwrt-linux-gcc \
    -pipe -march=armv7-a -mtune=cortex-a9 \
    -mfloat-abi=softfp -mfpu=vfpv3-d16 -S -o a.out-softfp.s floattest.c
    
    $ cat a.out-softfp.s
            .arch armv7-a
            .eabi_attribute 27, 3
            .fpu vfpv3-d16
            .eabi_attribute 20, 1
            .eabi_attribute 21, 1
            .eabi_attribute 23, 3
            .eabi_attribute 24, 1
            .eabi_attribute 25, 1
            .eabi_attribute 26, 2
            .eabi_attribute 30, 6
            .eabi_attribute 34, 1
            .eabi_attribute 18, 4
            .file   "floattest.c"
            .text
            .align  2
            .global foo
            .type   foo, %function
    foo:
            @ args = 0, pretend = 0, frame = 8
            @ frame_needed = 1, uses_anonymous_args = 0
            @ link register save eliminated.
            str     fp, [sp, #-4]!
            add     fp, sp, #0
            sub     sp, sp, #12
            str     r0, [fp, #-8]   @ float
            str     r1, [fp, #-12]  @ float
            flds    s14, [fp, #-8]
            flds    s15, [fp, #-12]
            fmuls   s15, s14, s15
            fmrs    r3, s15
            mov     r0, r3  @ float
            add     sp, fp, #0
            ldmfd   sp!, {fp}
            bx      lr
            .size   foo, .-foo
            .section        .rodata
            .align  2
    .LC0:
            .ascii  "usage: %s <float1> <float2>\012\000"
            .align  2
    .LC1:
            .ascii  "%f * %f = %f\012\000"
            .text
            .align  2
            .global main
            .type   main, %function
    main:
            @ args = 0, pretend = 0, frame = 24
            @ frame_needed = 1, uses_anonymous_args = 0
            stmfd   sp!, {fp, lr}
            add     fp, sp, #4
            sub     sp, sp, #40        str     r0, [fp, #-24]
            str     r1, [fp, #-28]
            ldr     r3, [fp, #-24]
            cmp     r3, #3
            beq     .L3
            movw    r3, #:lower16:.LC0
            movt    r3, #:upper16:.LC0
            ldr     r2, [fp, #-28]
            ldr     r2, [r2, #0]
            mov     r0, r3
            mov     r1, r2
            bl      printf
            mov     r0, #1
            bl      exit
    .L3:
            ldr     r3, [fp, #-28]
            add     r3, r3, #4
            ldr     r3, [r3, #0]
            mov     r0, r3
            mov     r1, #0
            bl      strtof
            str     r0, [fp, #-8]   @ float
            ldr     r3, [fp, #-28]
            add     r3, r3, #8
            ldr     r3, [r3, #0]
            mov     r0, r3
            mov     r1, #0
            bl      strtof
            str     r0, [fp, #-12]  @ float
            ldr     r0, [fp, #-8]   @ float
            ldr     r1, [fp, #-12]  @ float
            bl      foo
            str     r0, [fp, #-16]  @ float
            movw    r3, #:lower16:.LC1
            movt    r3, #:upper16:.LC1
            flds    s15, [fp, #-8]
            fcvtds  d5, s15
            flds    s15, [fp, #-12]
            fcvtds  d6, s15
            flds    s15, [fp, #-16]
            fcvtds  d7, s15
            fstd    d6, [sp, #0]
            fstd    d7, [sp, #8]
            mov     r0, r3
            fmrrd   r2, r3, d5
            bl      printf
            mov     r0, r3
            sub     sp, fp, #4
            ldmfd   sp!, {fp, pc}
            .size   main, .-main
            .ident  "GCC: (OpenWrt/Linaro GCC 4.6-2013.05 r39638) 4.6.4"
            .section        .note.GNU-stack,"",%progbits
    
    • Notes:
      • the flds/fmuls s14/s15 VFP instructions/registers are used here because the compiler was told to use the vfpv3-d16 floating point unit
      • however when passed as arguments to functions they are stored in the standard ARM cpu registers (str/mov r0/r1/r3) which incurs a pipeline stall for each register passed and has performance implications in that a lot of time is spent in function prologue/epilogue copying data back and forth to the FPU registers (which could be about 20 cycles or more for each float). Obviously here passing floats or returning floats from functions takes the performance hit here. If the sample code inlined the foo function the code would look the same as the hard float below
      • binaries built with soft or softfp can be intermixed on a target at the expense of less performance when passing and returning floats to/from functions
  • hard:
    $ staging_dir/toolchain-arm_cortex-a9+vfpv3_gcc-4.6-linaro_uClibc-0.9.33.2_eabi/bin/arm-openwrt-linux-gcc \
    -pipe -march=armv7-a -mtune=cortex-a9 \
    -mfloat-abi=hard -mfpu=vfpv3-d16 -S -o a.out-softfp.s floattest.c
    
    $ cat a.out-hard.s
            .arch armv7-a
            .eabi_attribute 27, 3
            .eabi_attribute 28, 1
            .fpu vfpv3-d16
            .eabi_attribute 20, 1
            .eabi_attribute 21, 1
            .eabi_attribute 23, 3
            .eabi_attribute 24, 1
            .eabi_attribute 25, 1
            .eabi_attribute 26, 2
            .eabi_attribute 30, 6
            .eabi_attribute 34, 1
            .eabi_attribute 18, 4
            .file   "floattest.c"
            .text
            .align  2
            .global foo
            .type   foo, %function
    foo:
            @ args = 0, pretend = 0, frame = 8
            @ frame_needed = 1, uses_anonymous_args = 0
            @ link register save eliminated.
            str     fp, [sp, #-4]!
            add     fp, sp, #0
            sub     sp, sp, #12
            fsts    s0, [fp, #-8]
            fsts    s1, [fp, #-12]
            flds    s14, [fp, #-8]
            flds    s15, [fp, #-12]
            fmuls   s15, s14, s15
            fcpys   s0, s15
            add     sp, fp, #0
            ldmfd   sp!, {fp}
            bx      lr
            .size   foo, .-foo
            .section        .rodata
            .align  2
    .LC0:
            .ascii  "usage: %s <float1> <float2>\012\000"
            .align  2
    .LC1:
            .ascii  "%f * %f = %f\012\000"
            .text
            .align  2
            .global main
            .type   main, %function
    main:
            @ args = 0, pretend = 0, frame = 24
            @ frame_needed = 1, uses_anonymous_args = 0
            stmfd   sp!, {fp, lr}
            add     fp, sp, #4
            sub     sp, sp, #40
            str     r0, [fp, #-24]
            str     r1, [fp, #-28]
            ldr     r3, [fp, #-24]
            cmp     r3, #3
            beq     .L3
            movw    r3, #:lower16:.LC0
            movt    r3, #:upper16:.LC0
            ldr     r2, [fp, #-28]
            ldr     r2, [r2, #0]
            mov     r0, r3
            mov     r1, r2
            bl      printf
            mov     r0, #1
            bl      exit
    .L3:
            ldr     r3, [fp, #-28]
            add     r3, r3, #4
            ldr     r3, [r3, #0]
            mov     r0, r3
            mov     r1, #0
            bl      strtof
            fsts    s0, [fp, #-8]
            ldr     r3, [fp, #-28]
            add     r3, r3, #8
            ldr     r3, [r3, #0]
            mov     r0, r3
            mov     r1, #0
            bl      strtof
            fsts    s0, [fp, #-12]
            flds    s0, [fp, #-8]
            flds    s1, [fp, #-12]
            bl      foo
            fsts    s0, [fp, #-16]
            movw    r3, #:lower16:.LC1
            movt    r3, #:upper16:.LC1
            flds    s15, [fp, #-8]
            fcvtds  d5, s15
            flds    s15, [fp, #-12]
            fcvtds  d6, s15
            flds    s15, [fp, #-16]
            fcvtds  d7, s15
            fstd    d6, [sp, #0]
            fstd    d7, [sp, #8]
            mov     r0, r3
            fmrrd   r2, r3, d5
            bl      printf
            mov     r0, r3
            sub     sp, fp, #4
            ldmfd   sp!, {fp, pc}
            .size   main, .-main
            .ident  "GCC: (OpenWrt/Linaro GCC 4.6-2013.05 r39638) 4.6.4"
            .section        .note.GNU-stack,"",%progbits
    
    • Notes:
      • like softfp, the flds/fmuls s14/s15 VFP instructions/registers are used here because the compiler was told to use the vfpv3-d16 floating point unit
      • however when passed as arguments to functions they are stored in the VFP registers (s14/s15 above) directly and don't need to be copied around saving the pipeline hit and 20 or so instructions per function call.
      • binaries built with hard float can yield better float performance but can not be intermixed with binaries built with soft/softfp

Floating Point Support in the Kernel (via exception handling)

If the hardware floating point ABI (-mfloat-abi=hard) is going to be used, the Linux kernel must be built with CONFIG_VFP=y to install the necessary exception handlers. You can optionally eliminate this if you know your not going to use hard float - the only place I see this as beneficial is if you are wanting to build a multi-arch kernel that runs on CPUs with and without an FPU and thus also much be using either -mfloat-abi=soft|softfp. Furthermore, if you don't have hardware floating point, you will need to configure software float emulation in the kernel if you have any userspace apps/libs that use hardware floating point.

This is available in the kernel config under 'Enable Floating point emulation' (CONFIG_VFP=y)

Depending on the architecture and SoC you can select emulation using various hardware floating-point options. For example, for the IMX6 you can select either VFPv3 (CONFIG_VFPv3=y) or NEON (CONFIG_NEON=y) emulation. The kernel floating-point emulation can only be used for architectures that have hardware floating-point.

References:

OpenWrt

There is a menuconfig in OpenWrt under toolchain options under Advanced configuration options->Target options->Use software floating point which configures the following:

  • CONFIG_SOFT_FLOAT=y:
    • apps are built with -mfloat-abi=softfp
    • uClibc is built with UCLIBC_HAS_FLOATS=y, UCLIBC_HAS_SOFT_FLOAT=y
    • toolchain is configured to default to -mfloat-abi=soft (if not specified)
  • CONFIG_SOFT_FLOAT undefined
    • apps are built with -mfloat-abi=hard
    • uClibc is built with UCLIBC_HAS_FLOATS=y, UCLIBC_HAS_FPU=y
    • toolchain is configured to default to -mfloat-abi=hard (if not specified)

When using CONFIG_SOFT_FLOAT undefined, and thus -mfloat-abi=hard you must have kernel support for VFP, otherwise any VFP stack instructions (used when passing floats to functions) will cause an exception that is not handled and crash your system.

OpenEmbedded / Yocto

The OpenEmbedded build system used to build the Yocto BSP for the Gateworks boards varies by the yocto version:

  • Yocto 1.4:
    • toolchain: gcc v4.7.2 with no default for -mabi-float
    • libc: eglibc-2.16
    • apps: -mfloat-abi=softfp
  • Yocto 1.5:
    • toolchain: gcc v4.8.1 defaulted to -mabi-float=hard
    • libc: eglibc-2.18
    • apps: -mfloat-abi=hard

Binary distributions

Certain OS Distributions may provide multiple distributions based on floating point. It is common for the suffix 'armhl' to be used for hard floating point.

Last modified 23 months ago Last modified on 02/27/2023 09:07:05 PM
Note: See TracWiki for help on using the wiki.