| 1 | [[PageOutline]] |
| 2 | |
| 3 | = Multi-Core Processing = |
| 4 | Gateworks has single board computers with single, dual, and quad core processors. |
| 5 | |
| 6 | This page is to reference only those boards who are using dual and quad core processors. |
| 7 | |
| 8 | We encourage customers with a Ventana board to leverage the IMX community at Freescale [https://community.freescale.com/community/imx] |
| 9 | |
| 10 | References: '''PLEASE UTILIZE''': |
| 11 | * [https://github.com/torvalds/linux/blob/master/Documentation/networking/scaling.txt] |
| 12 | * [http://www.embedded.com/design/embedded/4236957/2/Multicore-networking-in-L] |
| 13 | * See also our wiki page for [wiki:performance_tuning Performance Tuning] |
| 14 | |
| 15 | Sample top command '''Shows Processor usage''' |
| 16 | {{{ |
| 17 | root@OpenWrt:/#top |
| 18 | |
| 19 | Mem: 37212K used, 728856K free, 0K shrd, 1212K buff, 8100K cached |
| 20 | CPU: 0% usr 0% sys 0% nic 100% idle 0% io 0% irq '''0% sirq''' |
| 21 | Load average: 0.01 0.02 0.05 1/78 23125 |
| 22 | PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND |
| 23 | 23125 1174 root R 1216 0% 1 0% top |
| 24 | 2575 1 root S 8228 1% 0 0% /usr/sbin/collectd |
| 25 | 2617 1 root S 2880 0% 2 0% batmand ath0 |
| 26 | 1423 1 root S 1224 0% 3 0% /sbin/syslogd -C16 |
| 27 | 1174 1 root S 1224 0% 3 0% /bin/ash --login |
| 28 | 1 0 root S 1220 0% 0 0% init |
| 29 | 2630 1 root S 1216 0% 0 0% /usr/sbin/ntpd -n -p 0.openwrt.po |
| 30 | 2491 1 root S 1216 0% 3 0% /sbin/watchdog -t 5 /dev/watchdog |
| 31 | 2468 1 root S 1208 0% 1 0% /usr/sbin/telnetd -l /bin/login.s |
| 32 | 1425 1 root S 1204 0% 2 0% /sbin/klogd |
| 33 | 1441 1 root S 1120 0% 2 0% /sbin/netifd |
| 34 | 1434 1 root S 944 0% 3 0% /sbin/procd |
| 35 | 2335 1434 root S 908 0% 1 0% /usr/sbin/dropbear -F -P /var/run |
| 36 | 2476 1 root S 868 0% 2 0% /usr/sbin/uhttpd -f -h /www -r Op |
| 37 | 1427 1 root S 836 0% 2 0% /sbin/hotplug2 --override --persi |
| 38 | 2558 1 nobody S 768 0% 1 0% /usr/sbin/dnsmasq -C /var/etc/dns |
| 39 | 2640 1 root S 748 0% 3 0% /usr/sbin/vnstatd -d |
| 40 | 1437 1434 root S < 668 0% 2 0% ubusd |
| 41 | 528 2 root SW 0 0% 0 0% [kworker/0:1] |
| 42 | 620 2 root SW 0 0% 0 0% [kworker/u:3] |
| 43 | }}} |
| 44 | |
| 45 | == SMP Affinity (interrupt steering) == |
| 46 | Symmetric multiprocessing (SMP) |
| 47 | |
| 48 | The 'affinity' of an interrupt handler can be get/set via /proc/irq/<interrupt>/smp_affinity which is a bitmask of what CPU cores the interrupt handler can run on. By default the affinity for each handler is set to allow all available cores (ie for a dual-core system a value of 3 means bit0 (CPU0) and bit1 (CPU1) are both set). If you want a particular interrupt handler to always occur on a specific CPU you can change that bitmask. To see what interrupt handlers are configured and what interrupt they are on look at /proc/interrupts. |
| 49 | |
| 50 | For example to set the MMC interrupt handler on a dual-core Laguna (cns3xxx) board to only run on CPU1 (and never CPU0): |
| 51 | {{{ |
| 52 | cat /proc/interrupts ;# see interrupt mapping |
| 53 | echo 2 > /proc/irq/33/smp_affinity ;# set MMC irq to CPU1 |
| 54 | }}} |
| 55 | |
| 56 | References '''PLEASE UTILIZE''': |
| 57 | * http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux |
| 58 | * http://elinux.org/images/4/43/Understanding_And_Using_SMP_Multicore_Processors_Anderson.pdf |
| 59 | |
| 60 | Sample of command /cat/proc/interrupts |
| 61 | {{{ |
| 62 | root@OpenWrt:/# cat /proc/interrupts |
| 63 | CPU0 CPU1 CPU2 CPU3 |
| 64 | 29: 6498 6713 4601 11306 GIC twd |
| 65 | 34: 1 0 0 0 GIC sdma |
| 66 | 45: 6173 0 0 0 GIC mxs-dma |
| 67 | 47: 1 0 0 0 GIC bch |
| 68 | 56: 3070 0 0 0 GIC mmc0 |
| 69 | 59: 80 0 0 0 GIC IMX-uart |
| 70 | 68: 32 0 0 0 GIC 21a0000.i2c |
| 71 | 69: 0 0 0 0 GIC 21a4000.i2c |
| 72 | 70: 91 0 0 0 GIC 21a8000.i2c |
| 73 | 72: 29 0 0 0 GIC ci13xxx_imx |
| 74 | 78: 0 0 0 0 GIC ssi@02028000 |
| 75 | 87: 9 0 0 0 GIC i.MX Timer Tick |
| 76 | 150: 3435 0 0 0 GIC 2188000.ethernet |
| 77 | 151: 0 0 0 0 GIC 2188000.ethernet |
| 78 | 153: 0 0 0 0 GIC ath9k |
| 79 | 154: 0 0 0 0 GIC ath9k, ath9k |
| 80 | 155: 0 0 0 0 GIC ath9k |
| 81 | 352: 0 0 0 0 gpio-mxc mmc0 |
| 82 | 407: 0 0 0 0 IPU imx_drm |
| 83 | 412: 0 0 0 0 IPU imx_drm |
| 84 | 567: 0 0 0 0 IPU imx_drm |
| 85 | 572: 0 0 0 0 IPU imx_drm |
| 86 | IPI0: 0 0 0 0 CPU wakeup interrupts |
| 87 | IPI1: 0 0 0 0 Timer broadcast interrupts |
| 88 | IPI2: 5027 5875 5953 5026 Rescheduling interrupts |
| 89 | IPI3: 5 4 4 5 Function call interrupts |
| 90 | IPI4: 3 6 6 3 Single function call interrupts |
| 91 | IPI5: 0 0 0 0 CPU stop interrupts |
| 92 | Err: 0 |
| 93 | }}} |
| 94 | |
| 95 | |
| 96 | === PCI Interrupt steering === |
| 97 | |
| 98 | The PCI specification calls out 4 interrupts (INTA/INTB/INTC/INTD) that are routed to PCI slots. Each slot gets two interrupts and they are shared with other slots dependent on board layout (in a technique called swizzled or barber-polled). This means that if you have a board with 4 PCI slots you can have a single unique interrupt for each slot, however if you have 5 slots or more, those extra slots will share an interrupt with another slot. If you can populate your slots such that you have unique interrupts, you can use smp affinity (above) to configure different CPU cores for the interrupt handlers of those slots which can greatly help performance if the bottleneck is interrupt processing (which usually a the 'top' linux command will help determine). |
| 99 | |
| 100 | Note that performance gains are difficult to quantify as there are many factors at play. In general, you can 'tune' your system by using 'top' which shows CPU utilization (per core if you hit the '1' key while running) and moving things around to better balance your system. In general, if you have one core being underutilized, try to spread the load. |
| 101 | |
| 102 | ==== Ventana ==== |
| 103 | The IMX6 SoC used on The Gateworks Ventana product family has 4 'legacy' interrupts to support the PCI INTA/INTB/INTC/INTD interrupts: |
| 104 | * 152 pin1/INTD (also used as the MSI int) |
| 105 | * 153 pin2/INTC |
| 106 | * 154 pin3/INTB |
| 107 | * 155 pin4/INTA |
| 108 | |
| 109 | Which slot is routed to each depends on the baseboard and expansion mezzanine board stackup and the best way to determine the mapping for your particular board stackup is to populate a device one slot at a time and check /proc/interrupts for the mapping. |
| 110 | |
| 111 | Depending on your interrupt routing (board stackup), device slot placement (what device is in what slot), and CPU (number of cores) you can then choose to spread interrupts according to your application needs. |
| 112 | |
| 113 | ==== Laguna ==== |
| 114 | The cns3xxx uses irq61 for pcie0_intr which in the case of a PCIe-to-PCI bridge ends up combining INTA/B/C/D on a single ARM CPU interrupt. This is not optimal when you have multiple cores. To overcome this limitation an enhancement was made on the GW2388-4-H (the model of your GW2388 is displayed by the bootloader on bootup) by additionally routing the INTA/B/C/D signals to unique external ARM CPU interrupts: |
| 115 | * J5: irq95 |
| 116 | * J7: irq94 |
| 117 | * J4: irq93 |
| 118 | * J6: irq154 |
| 119 | |
| 120 | To determine if you have a Laguna board supporting isolated PCI interrupts, check the PCB part number and revision under the 6digit bar-code label on the top of the board: |
| 121 | * GW2388-4 RevH will have PCB 02210082-07. The 07 indicates the PCB revision and anything above rev 07 supports isolated interrupts |
| 122 | |
| 123 | A linux kernel patch is necessary to detect boards that support the isolated PCI interrupts and configure them to be used for the PCI host controller's interrupts. This is supported in: |
| 124 | * OpenWrt trunk BSP r553 |
| 125 | * OpenWrt 13.06 BSP branch r554 |
| 126 | |
| 127 | == Specifying and determining CPU for a userspace process == |
| 128 | The default is for userspace processes to be able to be scheduled on all CPU cores. |
| 129 | |
| 130 | The 'taskset' application can be used to specify an smp_affinity for a specific task. As above the affinity is a bitmask specifying what CPU's the task can run on (ie 0x3 for CPU0|CPU1, 0x1 for CPU0, 0x2 for CPU1): |
| 131 | * set the affinity for an existing PID (ie PID 1): |
| 132 | {{{ |
| 133 | taskset -p 0x1 1 ;# set PID1 to only run on CPU0 |
| 134 | taskset -p 0x2 1 ;# set PID1 to only run on CPU1 |
| 135 | taskset -p 0x3 1 ;# set PID1 to run on either CPU0 or CPU1 |
| 136 | }}} |
| 137 | * launch a process with a specific affinity: |
| 138 | {{{ |
| 139 | taskset 0x1 top ;# run top on CPU0 |
| 140 | }}} |
| 141 | |
| 142 | You can obtain details about what CPU's a process is allowed to run on via proc or tools like top: |
| 143 | * using /proc/<pid>/status to see what CPU's PID1 is allowed on: |
| 144 | {{{ |
| 145 | # grep Cpus /proc/1/status ;# see current affinity |
| 146 | Cpus_allowed: 3 |
| 147 | Cpus_allowed_list: 0-1 |
| 148 | # taskset -p 0x1 1 ;# set affinity to CPU0 |
| 149 | pid 1's current affinity mask: 3 |
| 150 | pid 1's new affinity mask: 1 |
| 151 | # grep Cpus /proc/1816/status |
| 152 | Cpus_allowed: 1 |
| 153 | Cpus_allowed_list: 0 |
| 154 | }}} |
| 155 | * using top (see the CPU column) |
| 156 | {{{ |
| 157 | # top -n1 ;# 1 iteration of top |
| 158 | Mem: 19704K used, 236320K free, 0K shrd, 2892K buff, 4460K cached |
| 159 | CPU: 0% usr 6% sys 0% nic 11% idle 0% io 0% irq 81% sirq |
| 160 | Load average: 0.05 0.11 0.13 1/44 3590 |
| 161 | PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND |
| 162 | 3588 1423 root R 1204 0% 1 25% top -n1 |
| 163 | 3073 2 root SW 0 0% 1 5% [kworker/u:2] |
| 164 | 695 1 root S 1212 0% 1 0% /sbin/syslogd -C16 |
| 165 | 1 0 root S 1208 0% 0 0% init |
| 166 | 1423 1345 root S 1208 0% 1 0% /bin/ash --login |
| 167 | 1493 1345 root S 1208 0% 0 0% /bin/ash --login |
| 168 | 627 1 root S 1208 0% 0 0% /bin/ash --login |
| 169 | 1345 1 root S 1204 0% 0 0% /usr/sbin/telnetd -l /bin/login.s |
| 170 | 1816 627 root S 1200 0% 0 0% {doup} /bin/sh ./doup |
| 171 | 697 1 root S 1192 0% 0 0% /sbin/klogd |
| 172 | 3590 1816 root S 1192 0% 0 0% sleep 1 |
| 173 | 1126 1 root S 1140 0% 0 0% hostapd -P /var/run/wifi-phy0.pid |
| 174 | 719 1 root S 1116 0% 1 0% /sbin/netifd |
| 175 | 706 1 root S 940 0% 0 0% /sbin/procd |
| 176 | 699 1 root S 700 0% 0 0% /sbin/hotplug2 --override --persi |
| 177 | 712 706 root S < 664 0% 0 0% ubusd |
| 178 | 6 2 root SW 0 0% 0 0% [kworker/u:0] |
| 179 | 16 2 root SW 0 0% 1 0% [kworker/u:1] |
| 180 | 8 2 root SW 0 0% 0 0% [migration/0] |
| 181 | 390 2 root SW 0 0% 1 0% [kworker/1:1] |
| 182 | }}} |
| 183 | |
| 184 | == Network Packet Steering == |
| 185 | The Receive Packet Steering (RPS) uses a hashing algorithm that takes the ip address and port to generate a hash index that then uses a hash table that map the hash index to CPU #. In the end, it will use the same CPU# for ip address/port combination. This is done by design so that the number of cache hits increase during packet processing of the same network stream. In order to realize the benefits of RPS, you would need to use multiple streams. You can use the '-P' option on iperf to execute such a test. |
| 186 | |
| 187 | = Single Core Processing = |
| 188 | |
| 189 | There may be times when a developer needs to only use one CPU because of driver issues, etc. We have found in many instances, especially with wireless application, that they are interrupt intensive and thus the CPU will be running at very low utilization but the interrupt controller is saturated. Since the wireless drivers can only operate on a single core any reduction in overall interrupt traffic helps with performance. When in dual core mode, we use FIQs for inter-processor communication so by running in single core mode we actually reduce the load on the interrupt controller which we’ve seen in some cases provide 5-10% performance improvement. If you are running a lot of user applications then the dual core can definitely provide a benefit it really just depends on your application. '''NOTE: This mainly applies to the Laguna Family line of ARM11 processors. The ARM9 Ventana has better interrupt capabilities''' |
| 190 | |
| 191 | To do this, we will modify the bootargs in the bootloader. |
| 192 | |
| 193 | For CNS3xxx Laguna boards, break into the bootloader by pressing a key at bootup: |
| 194 | |
| 195 | Modify the variable bootargs to include maxcpus=1 at the end of the line as shown below: |
| 196 | {{{ |
| 197 | Laguna > setenv bootargs console=ttyS0,115200 root=/dev/mtdblock3 rootfstype=squashfs,jffs2 noinitrd init=/etc/preinit maxcpus=1 |
| 198 | }}} |
| 199 | |
| 200 | Then save the bootargs with: |
| 201 | {{{ |
| 202 | Laguna > saveenv |
| 203 | }}} |
| 204 | |
| 205 | Note /proc/interrupts at 2 cores: |
| 206 | {{{ |
| 207 | root@OpenWrt:/# cat /proc/interrupts |
| 208 | CPU0 CPU1 |
| 209 | 29: 7231 7367 GIC twd |
| 210 | 33: 884 0 GIC mmc0 |
| 211 | 39: 206 0 GIC cns3xxx-i2c |
| 212 | 45: 168 0 GIC serial |
| 213 | 49: 3 0 GIC gig_stat |
| 214 | 51: 82 168 GIC gig_switch |
| 215 | 63: 0 0 GIC dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb1 |
| 216 | 64: 0 0 GIC ehci_hcd:usb2 |
| 217 | 89: 28 0 GIC timer |
| 218 | 91: 1 0 GIC ohci_hcd:usb3 |
| 219 | FIQ: 353 441 cns3xxx-fiq |
| 220 | IPI0: 0 0 CPU wakeup interrupts |
| 221 | IPI1: 0 1 Timer broadcast interrupts |
| 222 | IPI2: 2102 2420 Rescheduling interrupts |
| 223 | IPI3: 0 0 Function call interrupts |
| 224 | IPI4: 1775 1722 Single function call interrupts |
| 225 | IPI5: 0 0 CPU stop interrupts |
| 226 | Err: 0 |
| 227 | }}} |
| 228 | |
| 229 | Note /proc/interrupts with one core: |
| 230 | {{{ |
| 231 | root@OpenWrt:/# cat /proc/interrupts |
| 232 | CPU0 |
| 233 | 29: 12577 GIC twd |
| 234 | 33: 1469 GIC mmc0 |
| 235 | 39: 398 GIC cns3xxx-i2c |
| 236 | 45: 216 GIC serial |
| 237 | 49: 5 GIC gig_stat |
| 238 | 51: 489 GIC gig_switch |
| 239 | 63: 0 GIC dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb1 |
| 240 | 64: 0 GIC ehci_hcd:usb2 |
| 241 | 89: 28 GIC timer |
| 242 | 91: 1 GIC ohci_hcd:usb3 |
| 243 | FIQ: 0 0 cns3xxx-fiq |
| 244 | IPI0: 0 CPU wakeup interrupts |
| 245 | IPI1: 0 Timer broadcast interrupts |
| 246 | IPI2: 0 Rescheduling interrupts |
| 247 | IPI3: 0 Function call interrupts |
| 248 | IPI4: 0 Single function call interrupts |
| 249 | IPI5: 0 CPU stop interrupts |
| 250 | Err: 0 |
| 251 | }}} |
| 252 | Note the top command shows all processes on CPU 0:[[BR]] |
| 253 | |
| 254 | [[Image(cpu0.png)]] |
| 255 | |
| 256 | == Single Core Power Consumption (Laguna Family Only)== |
| 257 | Some quick measurements on a GW2388 with VIN=10V: |
| 258 | * At the prompt, the current difference was only 2ma (396mA single vs 398mA dual ) |
| 259 | * Under stress the current difference was fairly small, 44mA (438 single core vs 482 mA dual core) |
| 260 | |