hoakley September 1, 2021 Macs, Technology

M1 Icestorm cores can still perform very well

Apple is heavily committed to asymmetric multiprocessing (AMP) in its own chips, and in future Macs, iPhones and iPads. With four ‘Firestorm’ performance and four ‘Icestorm’ efficiency cores in its M1 SoC, several researchers have been working to establish the differences between them in terms of structural units, behaviour and performance. For example, Dougall Johnson has meticulously documented them here and here, with measurements for each instruction. Others, including Maynard Handley, have been building a detailed picture of the many techniques which these cores use to achieve their performance.

What currently seems harder to establish is the difference in overall performance across more typical code. In real-world use, what are the penalties for processes running on Icestorm rather than Firestorm cores? Here I report one initial comparison, of performance when calculating floating-point dot products, a task which you might not consider a good fit for the Icestorm.

Central to this is my previous observation that different Quality-of-Service (QoS) settings for processes determine which cores they are run on. OperationQueue processes given a QoS of 17 or higher are invariably run by macOS 11 and 12 on Firestorm cores (and can load Icestorms too), while those with a QoS of 9 are invariably run only on Icestorm cores. Those might change in the face of extreme loading of either core pool, but when there are few other active processes it appears consistent.

Rather than use a test harness such as that developed by Dougall Johnson, these tests were performed in regular macOS running with Full Security enabled on a stock system without any third-party kernel or system extensions. Execution times were measured using Mach ticks, and converted to seconds. The number of processes allowed in the OperationQueue was constrained to 4, to try to limit core use to a single pool.

Four different methods were used to calculate dot products on Swift Float (32-bit floating-point, C float) numbers:

a tight loop of assembly language using mixed SIMD instructions on 4-wide arrays of single-precision floating-point numbers;
the Apple simd (a relative of the Accelerate libraries) call simd_dot() on two simd_float4 arrays, using Swift;
simple Swift for using nested loops;
a more ‘idiomatic’ Swift nested loop using map and reduce.

Code for each is given in the Appendix below.

Does setting QoS control which cores are used?

Core load was observed using Activity Monitor. In every run, tests performed with a QoS of 9 only loaded the Icestorm cores, and those with higher QoS only the Firestorm cores. The screenshot below shows a series (from the left) in which four alternating QoS settings were used. At no time did any test appear to pass any load to the other pool of cores.

Performance

Times taken were measured on a range of iterations, and appeared most consistent and comparable for 10^8 iterations of the dot product calculation. On Firestorm cores, this was fastest using the simd (Accelerate) library, which took 0.0938 seconds, then for the assembly language (0.142 s) and simple Swift (0.451 s). ‘Idiomatic’ Swift took much longer, at 15.7 seconds. That is consistent with my previous results from tests which didn’t control or observe which cores they were run on.

On the Icestorm cores, assembly language was fastest (0.271 seconds), then simd (Accelerate) (0.309 s), simple Swift (1.27 s), and ‘idiomatic’ Swift (86.3 s).

Relative to their Firestorm times, Icestorms performed more slowly by:

190% running assembly language
330% running simd (Accelerate) library functions
280% running simple Swift
550% running ‘idiomatic’ Swift

where 100% would be the same time as the Firestorm core, and 200% would be twice that time.

My previous comparison between compression performed by AppleArchive using all eight cores and only Icestorm cores showed the latter was far slower (717%). These results show that, at their best, Icestorm cores can run SIMD vector arithmetic at slightly better than half the ‘speed’ of the Firestorm cores. Although I suspect that Apple’s simd library isn’t optimised for the Icestorm, it achieved a third of the ‘speed’ of a Firestorm when run on Icestorm.

Maynard Handley previously commented that Icestorm cores use about 10% of the power (net 25% of energy) of Firestorm cores. For SIMD vector arithmetic, at least, they perform extremely well for their economy. In the M1, multiprocessing isn’t always as asymmetric as you might expect.

Appendix: Code used in the iterative loop

In each case, the first section of code calculates the dot product itself, following which the values in one of the arrays are incremented ready for the next run through the loop.

Assembly language:
FMUL V1.4S, V2.4S, V3.4S FADDP V0.4S, V1.4S, V1.4S FADDP V0.4S, V0.4S, V0.4S FADD V2.4S, V2.4S, V4.4S

simd (Accelerate) library:
tempA = simd_dot(vA, vB) vA = vA + vC

Simple Swift:
tempA = 0.0 for i in 0...3 { tempA += vA[i] * vB[i] } for i in 0...3 { vA[i] = vA[i] + vC[i] }

‘Idiomatic’ Swift:
tempA = zip(vA, vB).map(*).reduce(0, +) for (index, value) in vA.enumerated() { vA[index] = value + vC[index] } }

21Comments

Add yours

1

Metin on September 1, 2021 at 6:47 am

Thanks for the insights!

However, I don’t understand “10% of the power (net 25% of energy)”. Where’s the difference between power and energy? Can you explain, please?

LikeLiked by 1 person
- 2
  
  hoakley on September 1, 2021 at 7:19 am
  
  Thank you.
  These are not my figures or words, as I stated – I’m quoting from an earlier comment by someone who really does understand processors. I suspect that they’re different measurements made on the cores. Power is of course energy used over time, but I think key here is the word net. Either way of measuring/expressing it, the Icestorm cores are very efficient in their use of energy/power, but seem to return better performance than you’d expect from that alone.
  Howard.
  
  LikeLike
- 3
  
  Floflo81 on September 1, 2021 at 12:45 pm
  
  I can explain that thanks to my basic physics knowledge!
  Power is energy by time unit. It’s an “average” energy usage per second.
  So :
  Power = Energy / Time
  Energy = Power x Time
  Time = Energy / Power
  
  Icestorm cores use less power than the Firestorm cores, but they also perform calculations slower!
  So “Power” is smaller, “Time” is larger, and “Energy” (total energy needed to do the same job/calculation) is also lower, but with a worse ratio than power.
  
  We can calculate the “Energy” ratio this way. It matches if the “Time” ratio is 250%
  
  10% of the power : Icestorm power = 0.10 x Powerstorm power
  250% of the time : Icestorm time = 2.5 x Powerstorm time
  
  So :
  Icestorm energy = Icestorm power x Icestorm time
  Icestorm energy = (0.10 x Powerstorm power) x (2.5 x Powerstorm time)
  Icestorm energy = 0.25 x Powerstorm energy = 25% of the energy
  
  I guess it’s an average, as it depends on the actual test performed, as explained above.
  
  LikeLiked by 1 person
- 4
  
  Mark on September 1, 2021 at 12:47 pm
  
  They use 10% of the effect, which would be 10% of the power if it ran for the same time. But it (presumably) ran 2.5 times longer before it was done, so the total energy was 2.5 * 10%.
  
  LikeLiked by 1 person
5

Metin on September 1, 2021 at 8:52 pm

Thanks to everyone for the enlightenment. 🙂

LikeLiked by 1 person
6

name99 on September 2, 2021 at 5:15 pm

Howard, this comparison does not mean what you think it means!

I’m not going to comment on the still horrific state of Swift compilers, but what you are testing is primarily limited by a combination of prefetching (ie memory bandwidth) and pure SIMD throughput. Given that Ice has two NEON units and Fire has four, it’s not especially surprising that Ice is half the speed of Fire.

(For code like this, Fire will have its frequency dialed down so that it doesn’t waste cycles running at high GHz while its mostly waiting on DRAM; that’s part of Apple’s equivalent of “Thread Director”, tracking various CPU metrics and adjusting the CPU frequency automatically without OS intervention. So Fire and Ice are apparently running at the same frequency.)

The differences between Fire and Ice will not be apparent on code that is so trivially structured (straight line, with no real need for clever OoO tricks).
The differences are in the sizes of the structures that extract instruction-level parallelism from convoluted code with messy control flow and data access patterns. Think something like a compiler.
If you run say LLVM at the two different priority levels you’d see the difference.

We can validate this by comparing
https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/3
with
https://www.anandtech.com/show/16192/the-iphone-12-review/2

Consider for example the gcc benchmark: Ice is a quarter of the performance of Fire.
OK, there is a frequency difference of 2 vs 3 GHz. Scale down the Fire number by 2/3 (not perfect scaling, but good enough) and we see that Ice has about 40% the IPC of Fire (for that benchmark).

These graphs (thank Andrei F who did the hard work of getting SPEC to run on iPhone before had the much easier route of testing on Macs) are also the basis for my rough rules of thumb about relative performance and energy usage. But these are rules of thumb for what is considered “hard” CPU code. Trivially structured CPU code, like DSP code, that can easily by analyzed by a roofline model, operates under very different rules of thumb.

LikeLiked by 2 people
- 7
  
  hoakley on September 2, 2021 at 9:22 pm
  
  Thank you.
  I’m not concerned here with running Xcode or anything like that on the Icestorms.
  The question goes back to the debate over AMP. Mac users can see its advantages on laptops, and mobile devices, but many are saying that it’s just a waste of cores on a desktop.
  I don’t think that’s right. One very good reason is that Apple has put all the macOS background tasks, like Time Machine backups, onto the Icestorms. I think that’s good psychology. How long did your last TM backup take? Of course, most of us aren’t even aware when the last was, let alone how long it took. I’d much sooner tasks like that were hived off on cores which generate less heat and require less power, even on a desktop system. We’ve just seen the example of planning a UPS for an iMac Pro versus an M1 Mac mini – and the resulting cost difference.
  macOS developers have an option through the QoS to designate tasks to the Icestorms. But most don’t want to, because they reckon that they’re too slow, and users will complain that even their background tasks should be run on the Firestorms. In that case, there are four cores in every M1 which aren’t going to be used, except by macOS.
  Clearly there are some tasks that need all the Firestorms can deliver. But there are others which could be candidates – at least as a user option as I already provide in my compression utility Cormorant – for using the Icestorms instead. So we need to get a better idea of just what they can achieve. And this is a second step after the tests involving more complex code in AppleCompress, which I cite above.
  If I were designing SIMD cores, one thing I’d consider is doing away with SIMD, maybe the FP units altogether. Maybe Apple can’t do that because of its licence, but whatever, the Icestorms come with what appears to be half the FP/SIMD units as the Firestorms. Could they be useful to third-party code? If they deliver around half the performance, I think those four cores could be very useful, and this is a first step in seeing how well that might work.
  So it’s not about whether you’d want to run heavyweight code like LLVM, but whether there is potential for use in apps which run on desktops. Or should Apple build separate SoCs for its desktop Macs which use SMP instead?
  There’s also the matter of exploring whether this sort of testing is even feasible. I was surprised when I saw the clear separation achieved with just QoS in this case, which makes it much easier to compare performance on arbitrary code.
  So I’m not sure that what you’re thinking I’m thinking is correct!
  Howard.
  
  LikeLiked by 1 person
  - 8
    
    name99 on September 2, 2021 at 9:48 pm
    
    Life becomes a lot more difficult if the two cores have slightly different ISA’s.
    
    The A10 did this (the A10 small core, Zephyr, was 64-bit only, as an experiment before the A11 went to full 64-bit only. This meant 32-bit code could only run on the large core. But this was not a real problem for the OS because on A10 the large/small difference was invisible to the OS.
    
    If you have different ISAs, then
    – either you begin to limit what code can run on the small core (maybe it’s background code, but it uses some SIMD) OR
    – the code has to transparently move from a small core to a large core when it hits NEON code. This is not trivial! And it requires messy co-ordination with the OS. We’ve seen the real world version of this with Lakefield and Alder Lake.
    
    Basically it’s not worth the hassle. Transistors are cheap! RIDICULOUSLY cheap! One of the reasons Apple has been so successful is that they have internalized this fact far more than any other company. Just use a few more transistors — for performance, to make developer life easier, to make the OS easier, to add a feature. Why not? They are cheap and getting cheaper every year.
    
    It’s a long time since I last used the API’s in detail, but my understanding is that pretty much every modern Apple API in audio, colors, even basic graphics like placement in a window, is based on FP now, not integers, meaning that even light weight (ie Ice-appropriate) code that has some UI element will intrinsically have FP in it. This is even apart from things like using NEON (and SVE/2 will be much better at this) for tasks like text processing.
    
    I can’t speak to the ways in which Ice cores might be used by a developer. That’s so idiosyncratic — every app is so different — that I have nothing useful to say beyond the obvious points, like if your code is primarily limited by memory access, then you may be pleasantly surprised at how well Ice performs (because both are primarily limited by DRAM speed).
    
    For compression, obviously “compression” is a huge field and different people want different things from it. But if what you want is a reasonable balance of performance, efficiency, and energy usage, have you tried Apple’s API’s in Accelerate. There is good reason to believe that Apple’s functional DMA includes compression functionality, so you may find doing it via Accelerate is even lighter-weight than doing it on an Ice core.
    
    I’m mainly interested in the technical aspects of this. Policy aspects (ie how many cores of what type Apple should place in different devices) don’t interest me much because I’d like to hope that Apple has a better feel for the totality of uses and users of its devices than any single opinion on the internet.
    
    LikeLiked by 2 people
    - 9
      
      hoakley on September 2, 2021 at 10:14 pm
      
      Thank you.
      Yes, all UI code somewhere involves floating-point. But the QoS levels already guide the developer to run that on the Firestorms. Many apps have background tasks which don’t use FP. Several of my utilities, for example, trudge their way through folders checking SHA-256 integrity tags on files, or looking for particular extended attributes, or broken aliases. At present, I think few users would want those delegated to cores which they perceived as being much slower, but at the same time they want good performance from user-facing code running on the Firestorms. I think those background tasks are good candidates for dropping to a low QoS so they run on the Icestorms, but I think developers are reluctant because they expect far worse performance if they do so.
      My app Cormorant uses AppleArchive, which is its latest library aimed at high-performance compression. I assume that it uses carefully optimised code comparable to Accelerate – it certainly performs as if it does!
      Howard.
      
      LikeLiked by 1 person
    - 10
      
      name99 on September 2, 2021 at 10:32 pm
      
      Oh I agree (“I think those background tasks are good candidates for dropping to a low QoS so they run on the Icestorms”).
      
      But the real issue here is not big vs small cores, it’s not testing your assumptions! Which is, I guess, the point of your series here — to remind people that their expectations as to what is slow on an Ice core may be highly inaccurate.
      
      And to that end, yes, I welcome you publishing and publicizing as many different examples as you can think of! It’s good to get people’s intuition better calibrated!
      Looping over the file system I would assume is primarily dependent on memory and IO speed, so is a perfect candidate for Ice cores; but of course one would want to validate.
      
      LikeLiked by 2 people
11

Linford on November 7, 2021 at 6:37 pm

I wonder what this means. The efficient assembly probably has fewer instructions that use vector instruction and floating point calculations more, while the idiomatic Swift probably has just a larger number of instructions that aren t doing heavy calculation. Does that imply then that the high performance cores does much deeper pipelining, but the the number floating point units or whatever is probably pretty similar across both types?

LikeLiked by 1 person
- 12
  
  name99 on November 7, 2021 at 9:26 pm
  
  Regarding translation of “code” into assembly. I’ve recently been running somewhat similar tests which involve comparing “naive” C with “idiomatic” C++ STL.
  
  My experience has been that most of the time the STL is in fact as good as the C or substantially better. The reasons include
  – the C++ says more precisely what I do and don’t want done, which means it’s easier for the compiler to see that the code can, eg, be vectorized. Perhaps this could be achieved in the C by adding enough “const’s” and “restrict’s” in the right places, but the C++ gets there without that hassle.
  – the compiler created code (as opposed to hand-assembly), whether from naive C or idiomatic C++, is willing to go through all the hassle of unrolling loops and vectorizing that I’m not in assembly.
  
  BUT there is a big caveat here which is that you need to look at the assembly the compiler creates and help it out! Most of the complaining I see on the internet about the poor quality of compilers (the same thing for at least the past 30 years!) is from people who are not willing to or do not know how to do this. And yes, in a perfect world, blah blah. But is the goal to complain, or is it to produce optimal code, but with the ease of a compiler rather than an assembler?
  
  So, for example, in the idiomatic C++ I wrote, the first loops were substantially longer than necessary because the compiler did not know that the length of the loop would be at least, say, 8, so that it could unroll the loop by a factor of 8. I need to
  – look at the assembly
  – understand why there is this initial startup crud that’s handling cases of short loops
  – write something (basically some silly C++ hidden in a macro) that’s a way to inform the compiler that “loop length>8)”
  
  Ideally I could use the same strategy to tell the compiler the loop length is always a multiple of 8 (so no cleanup code necessary). Unfortunately current LLVM is not smart enough to use that info 😦 I can pass it in, but it isn’t acted upon.
  
  Likewise you need to consider why the code is doing what looks like pointless busywork. Why does it keep reloading from the same address? Probably a possible address aliasing, or maybe you need to deref a global to a local.
  Why does it keep performing sign extension? You need to write your code with the appropriate use of types, not a random mixture of some variables as 32-bit some as 64 bit, some signed, some not.
  auto can do a lot to help with this — but auto is not perfect! It’s not a magic bullet that says “figure out the best type to use”, it’s a set of rules for what type to use based on the value the variable is set to. My recommendation is to use auto as much as possible in your code — but then to see where there are still weird sign extensions or narrowing instructions, and for those (hopefully few) cases correct the type that auto chose.
  
  And of course (I shouldn’t have to say this but…) check your compiler flags! By default Xcode Release optimizes, yes, but it optimizes for size! Not for performance, and not with fast-math. Apple’s defaults are correct for most real world code in large apps — but not for this sort of tight benchmarking code.
  
  So to summarize:
  – to figure out how Accelerate can do better than NEON assembly, I’d break at the accelerate call then step into the accelerate code. (Use Xcode Debug menu/Workflow, choose “Always show assembly”, then step into library assembly by holding down the control key as you use the step in/step out/step over buttons).
  My guess is it boils down to Accelerate unrolled the loop, which allowed them to use load pair for vectors, and to accumulate more than one sub-dotproduct in parallel, performing the final reduction of four or more independent chains at the very end. Conceptually easy, but a hassle to write by hand.
  
  – for Swift so much is just compiler immaturity. Certainly for idiomatic Swift there isn’t yet the deep experience that allows much of the abstraction use of STL to be compiled away. But, in both cases, naive and idiomatic, I would at the very least look at the inner loop that’s being produced and ask “why” and “how can I fix it”. It may be, for example, that the Swift is being compiled with the equivalent of -fastmath switched off, so the compiler doesn’t know that it can rearrange the instruction ordering, or use FMACs rather than separate multiplies and adds.
  
  There’s also the issue (impossible to tell with just these code fragments) as to what the compiler is simply optimizing away! If tempA is never used (or the way it is “fake used” to prevent the compiler optimizing it away) fools the compiler in one set of circumstances but not another… But this is something you have to be really careful about when microbenchmarking using a modern compiler.
  
  We have a pretty good idea of how the M1 E-cores differ from P-cores (for example E-cores have two NEON pipes, P-cores have 4; E cores can one load and one load-or-store unit; P have two loads, one store, and one load-or-store unit).
  Mainly I think what you’re seeing is that a big core can do a better job of running bad assembly fast — or the converse, a small core runs bad assembly worse. This is less a matter of the NEON pipes and more a matter of the machinery around them which can do a better job of hiding the effect of the dumb busy-work instructions.
  
  But honestly, the take-away is: before worrying about fire vs ice, look at the assembly for the hot spots in your code, figure out why it sucks, and fix the code so the compiler generates better assembly. This is something one has control over and that is easy to understand, unlike the issues of how the OS will fiddle the DVFS rules, or change the DRAM speed, depending on what you are doing or how aggressively you are touching DRAM.
  
  LikeLiked by 1 person
  - 13
    
    hoakley on November 7, 2021 at 11:47 pm
    
    Thank you.
    “My guess is it boils down to Accelerate unrolled the loop, which allowed them to use load pair for vectors, and to accumulate more than one sub-dotproduct in parallel, performing the final reduction of four or more independent chains at the very end. Conceptually easy, but a hassle to write by hand.”
    Funnily enough, that’s pretty well exactly what my assembly code does. I tend to study the executable in disassembly (using Hopper) rather than using debug, and I can’t see any striking differences between the Accelerate code and my own. However, one thing I have noticed elsewhere is that calls to Apple libraries tend to result in higher % CPU usage than my assembly language, run at the same QoS, as if those calls are being given preferential treatment.
    Howard.
    
    LikeLike
    - 14
      
      name99 on November 8, 2021 at 4:36 am
      
      Well, Howard, if you want to post on the website (or somewhere else) the two block of assembly, I’ll take a look-see and give my conclusions, if any.
      
      Also have you instrumented the code and done cycle counts? It would be interesting and useful to compare cycle count timing against “seconds” timing.
      There are lots of ways to get cycle counts, but I’ve found the easiest simple solution is to work with Daniel Lemire’s code at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2021/03/24/m1cycles.cpp
      This is based on Dougall Johnson’s original, but I find it easier to work with. If you are a hardcore C rather than C++ person, you may prefer Dougall’s original.
      
      I find it unlikely that Apple is giving any sort of priority “boost” to their code over user code, not least because I expect by now people would have discovered such a mechanism. You could, for example, simply cut and paste the Apple assembly into your function and see how it performs!
      
      LikeLiked by 1 person
    - 15
      
      name99 on November 8, 2021 at 4:37 am
      
      I’m suggesting all these different things because I think we, the community, still have so much to learn about all the different aspects of Apple Silicon, from compilers to optimization to even basics like different ways of benchmarking. The more each of us investigates and writes up different avenues, the faster we converge on resolving all the unknowns!
      
      LikeLiked by 1 person
    - 16
      
      hoakley on November 8, 2021 at 8:58 am
      
      Oh yes, I completely agree, which is one reason why I continue to try to peck away at this when I get time.
      Although sometimes I suspect that just as we’re closing in on unknowns, everything changes again. Apple has too many fine engineers!
      Howard.
      
      LikeLike
    - 17
      
      hoakley on November 8, 2021 at 8:56 am
      
      Thank you.
      I have provided my full source code in the docs to AsmAttic 4.0, but repeat my assembly code here for ease of access:
      _dotprod:
      STR LR, [SP, #-16]!
      LDP Q2, Q3, [X0]
      FADD V4.4S, V2.4S, V2.4S
      MOV X4, X1
      ADD X4, X4, #1
      dp_while_loop:
      SUBS X4, X4, #1
      B.EQ dp_while_done
      FMUL V1.4S, V2.4S, V3.4S
      FADDP V0.4S, V1.4S, V1.4S
      FADDP V0.4S, V0.4S, V0.4S
      FADD V2.4S, V2.4S, V4.4S
      B dp_while_loop
      dp_while_done:
      LDR LR, [SP], #16
      RET
      with the two vectors passed as (the address of) an Array of Floats, and the number of loops as an integer. I greatly welcome optimisations, please, so that I can steal them!
      Yes, I looked at Dougall Johnson’s code, and have been wondering how best to port/access it using Swift. Calling C++ isn’t a bed of roses, assembly language is actually simpler.
      Sadly, I can’t do any more on this as I’m in the middle of writing a major article for a print magazine which is keeping me fairly fully occupied for the next week. Once I have a bit more free time, I’ll try to revisit this.
      Howard.
      
      LikeLike
    - 18
      
      name99 on November 9, 2021 at 10:52 pm
      
      Two comments
      – Your assembly is not terrible, But it’s also not idea!
      The normal way to write this sort of loop, unless there are good reasons otherwise, is something like
      label:
      stuff
      decrement counter.S
      B.NE label
      Easier to understand, and with one less instruction of overhead.
      
      More important is that I am not at all convinced that the SIMD code (or in fact any other the other code) is doing exactly what you think, hence my constant obsession with “what is the EXACT assembly that is running”.
      
      It would take me too far afield to bother setting up a Swift project and becoming familiar with that whole world, but the obvious eyeballing of your code is that the calculation is doable at compile-time.
      Basically the semifinal vA=vA+(theReps-1)*vC
      the final tempA=dot(semiFinalvA, vB)
      return tempA
      (Maybe the final calculation of vA is off-by one, couldn’t be bothered to figure it out exactly.)
      Certainly the loop does not need to be done; even the dot does not need to be done if the compiler knows what simd_dot does (and the compiler does know).
      
      I wrote equivalent code in C (easy to just slot into my existing project) and no matter what I tried the compiler just crushed it down to a simple short computation (no loop) as the result. The compiler was not perfect — there was some additional overhead junk that could have been removed — but it got at the essence of the problem, namely simply remove the loop altogether!
      
      So I think we have
      – compiler can’t really do anything with your assembly
      – compiler is doing something very different (and much lighter weight) than you expect with the simd_dot version
      – compiler is confused (or just not very good, or doesn’t think it can use -ffastmath rearranging of the code, which is necessary for vectorization) with the two swift versions.
      
      Of course you are interested in whatever you are interested in. But I think if the goal is to understand
      – your assembly vs Accelerate performance AND
      – Swift performance
      there are so many confounders here we are seeing nothing useful.
      
      I’d suggest
      (a) turn the code into something that doesn’t easily collapse the loop to nothing!
      How you do this depends on what you want to know; for example making the entire calculation one long serial calculation will destroy all instruction-level-parallelism and be somewhat unrealistic. Using a dot product of two very long vectors is reasonable (and realistic) but if their length goes outside the L1 you will start testing L2 bandwidth, not core CPU feature…
      
      (b) Look at the assembly in every case! For me that’s easiest in Xcode, but if you are comfortable with some other tool, fine.
      So often people just assume things about what the compiler is or is not doing that just aren’t so. For example we know that, in a sense, “Swift sucks” for this code. But similarly structured code in C will also suck!
      Why? Because if you don’t provide -ffastmath, the compiler will not (and cannot, by language contract) vectorize the code, or even use FMA. And Apple, by default, does not provide -ffastmath as a compiler option, even for release code. (Presumably because it’s, in theory, a dangerous option if you don’t know what you are doing.)
      
      What we are learning, then, is not anything about C vs Swift vs assembly, all we are learning is the uninteresting fact that a compiler that is not allowed to optimize will generate terrible code!
      
      LikeLiked by 1 person
    - 19
      
      hoakley on November 9, 2021 at 11:30 pm
      
      Thank you.
      Prior to this, I ran performance tests on head- and tail-testing in assembly, and found that what you propose (which is essentially what the Swift compiler also generates) was consistently slower than the head-testing that I have used here. That surprised me, but I followed the result of that test when coding this example. I will return and perform that comparison again to confirm those observations.
      Your supposition that the Accelerate code isn’t performing the loop wasn’t borne out by the disassembly or by the performance figures, which increased in time consistently with, and remained slightly faster than, my assembly language. There’s a direct linear relationship between the number of loops and the time taken in both cases, with similar slopes. I have tested examples in the past that have compiled down to nothing, which is pretty obvious when you run the tests: that isn’t what was happening here. Neither is it observed in the CPU History when running the tests, which is another dead giveaway.
      The final calculation isn’t off by one – again, that’s something which I have looked at carefully, and an easy way to assess that is to look at the result of that calculation, which is identical across all four methods of obtaining it.
      When I get time, I will be looking more at this, and intend a longer vector calculation which does accumulate a result, although the danger then is that adds more non-vector operations to the loop, to the point where it could be easy for those to dominate performance.
      Howard.
      
      LikeLike
    - 20
      
      name99 on November 11, 2021 at 3:15 am
      
      After a day I realized an important point!
      
      In the SWIFT code you write a traditional dot product, as I’d expect.
      But for the Accelerate code you write (at least this is in the docs, and this is what I was testing against)
      for _ in 1…theReps {
      tempA = simd_dot(vA, vB)
      vA = vA + vC
      }
      
      do you see the bug (or typo)? You should have tempA += simd_dot(vA, vB) !
      
      So I suspect we have been speaking at cross purposes because you have been assuming everything about a “real” dotproduct, whereas I have been assuming this weird code that calculates a value then immediately ignores is!
      And I have no idea if the Accelerate code you have been testing is with or without the += …
      
      This matters insofar as one cares about optimal performance because (for many reasons) all the different variants being offered up — if what we want is a genuine dot product — are sub-optimal because they work at far too small a granularity.
      
      What Accelerate does in vDSP_dotprD is accumulate the partial sums in a vector, so the essence of the hot loop is
      ldp qA0, qA1
      ldp qB0, qB1
      fmla.2d vAccumulate0, qA0, qB0
      fmla.2d vAccumulate1, qA1, qB1
      
      then once the loop is done, sum the 0 and 1 accumulators together and sum across. Similar (but 4-wide of course) for floats rather than doubles.
      This is obviously substantially more efficient than your assembly. Fewer instructions, but even more important multiple chains of independent instructions.
      
      I suspect there’s no reason (to my eyes to try to optimize the Swift by unrolling it to generate a simd vector-like pattern; I think that’s just confusing the compiler so it can’t see the primary loop structure.
      In C I wrote a basic
      for(){sum+=a[i]*b[i]}
      oneliner dotproduct and this compiled down to essentially the same thing as Accelerate — vectorize, use load pair, use FMLA, unroll so that multiple sums are accumulating in parallel.
      I’d try just writing a simple Swift one-liner and looking at the assembly. I’d be saddened if it weren’t just as good as the C.
      
      Of course, as I keep pointing out, what this will benchmark depends on how large your vectors are. In L1 you’ll be benchmarking the CPU; but once you exceed the capacity of L1 you’re limited by the bandwidth from L2 to L1. The prefetcher will work (and work well) but it can’t move the required data from L2 to L1 at quite the bandwidth being eaten by the CPU, and of course it gets worse as you go out to LSU and DRAM.
      
      LikeLiked by 1 person
21

name99 on November 9, 2021 at 10:53 pm

Oops, when I said “I am not at all convinced that the SIMD code” I meant the Accelerate code!

LikeLiked by 1 person