WIP: Assembly
Adds the functionality to directly show the assembly output of the generated code.
Further, the base pointer specification is revealed to the user which is helpful to minimize register spilling in some cases.
Merge request reports
Activity
added feature label
Regarding
base_pointer_spec
: we should make an architecture-dependent default choice. Modern architectures like ARM64, RISC-V and POWER have 31 or 32 general-purpose registers and CUDA has up to 255, while x86-64 only has 16. Detecting the CPU architecture from Python probably doesn't make sense because in the context of Walberla, the machine generating the kernel is not necessarily the same as the one executing the code. However, when vectorization or GPU is enabled, we know what architecture we are on and can thus infer the number of registers. When vectorization is off, we should ideally generate code like this for portability:#if defined(_M_X64) || defined(__amd64__) // code that uses 16 registers #else // code that uses 31 registers #endif
added 28 commits
-
791a5668...15a7e62d - 27 commits from branch
pycodegen:master
- 502d9e5c - Merge master
-
791a5668...15a7e62d - 27 commits from branch
662 676 summands.insert(0, self.SummandInfo(1, "0")) 663 677 664 678 assert len(summands) >= 2 665 processed = summands[0].term 666 for summand in summands[1:]: 667 func = self.instruction_set['-' + suffix] if summand.sign == -1 else self.instruction_set['+' + suffix] 668 processed = func.format(processed, summand.term) 679 if len(summands) < 10: 680 processed = summands[0].term 681 for summand in summands[1:]: 682 func = self.instruction_set['-' + suffix] if summand.sign == -1 else self.instruction_set['+' + suffix] 683 processed = func.format(processed, summand.term) 684 else: 685 processed = visit(summands) Yes, this is correct. I think it would be better to remove it from this MR and open a new one. I just needed it for my work thus I put it in here.
Consider a 27 point stencil for example. At the moment pystencils generates something like this:
add(add(add(add(add(...
This works just fine however to calculate each addition the result of the previous is needed, because of the nested structure. Thus it is not possible to exploit the full parallelism a single-core provides, which can be seen in the assembly code as well.With the new strategy, partial sums are calculated independently and then summed up. Since this is only a problem if a lot of summands are calculated I did choose kind of a random number (10) to decide when to use the new strategy. I think it might be reasonable to always use the new strategy.
But this is also a work in progress since it does not always give a significant benefit. Thus I need to do low-level benchmarking on different architectures to find out if it performs better in most cases or if there is even a case when it performs worse. Theoretically, this should not be the case.
Overall I think it is better to remove it from here and put it on a development branch.
changed this line in version 11 of the diff
39 39 all dimensions 40 40 skip_independence_check: don't check that loop iterations are independent. This is needed e.g. for 41 41 periodicity kernel, that access the field outside the iteration bounds. Use with care! 42 base_pointer_spec: specifies which field accesses are resolved by pointers see :func:`parse_base_pointer_info` The base pointer specification is also highly work in progress right now. So far I found out, that the compiler generates quite bad code in some cases although I use only a few pointers. We are still not completely sure why this is the case. However, it might be due to the large offsets which are produced when using fewer pointers.
The only reliable strategy to produce fast code consistently is to use intrinsics. By doing so the compiler has enough hints to generate what we want.
Also I will put it in another branch because it has nothing to do with the assembly MR.
changed this line in version 11 of the diff
added 1 commit
- 4141f192 - Remove benchmark generation bug fix from this MR
added 1 commit
- 44526560 - Remove features which do not belong to this MR
mentioned in merge request !224 (closed)