Skip to content
Snippets Groups Projects

WIP: Assembly

Closed Markus Holzer requested to merge holzer/pystencils:assembly into master
2 unresolved threads

Adds the functionality to directly show the assembly output of the generated code.

Further, the base pointer specification is revealed to the user which is helpful to minimize register spilling in some cases.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
662 676 summands.insert(0, self.SummandInfo(1, "0"))
663 677
664 678 assert len(summands) >= 2
665 processed = summands[0].term
666 for summand in summands[1:]:
667 func = self.instruction_set['-' + suffix] if summand.sign == -1 else self.instruction_set['+' + suffix]
668 processed = func.format(processed, summand.term)
679 if len(summands) < 10:
680 processed = summands[0].term
681 for summand in summands[1:]:
682 func = self.instruction_set['-' + suffix] if summand.sign == -1 else self.instruction_set['+' + suffix]
683 processed = func.format(processed, summand.term)
684 else:
685 processed = visit(summands)
  • What do these changes do? There is no obvious connection to the MR title.

    Also, I see that visit does not have test coverage. len(summands) < 10 also seems like a rather arbitrary check.

  • Yes, this is correct. I think it would be better to remove it from this MR and open a new one. I just needed it for my work thus I put it in here.

    Consider a 27 point stencil for example. At the moment pystencils generates something like this: add(add(add(add(add(... This works just fine however to calculate each addition the result of the previous is needed, because of the nested structure. Thus it is not possible to exploit the full parallelism a single-core provides, which can be seen in the assembly code as well.

    With the new strategy, partial sums are calculated independently and then summed up. Since this is only a problem if a lot of summands are calculated I did choose kind of a random number (10) to decide when to use the new strategy. I think it might be reasonable to always use the new strategy.

    But this is also a work in progress since it does not always give a significant benefit. Thus I need to do low-level benchmarking on different architectures to find out if it performs better in most cases or if there is even a case when it performs worse. Theoretically, this should not be the case.

    Overall I think it is better to remove it from here and put it on a development branch.

  • changed this line in version 11 of the diff

  • Please register or sign in to reply
  • 39 39 all dimensions
    40 40 skip_independence_check: don't check that loop iterations are independent. This is needed e.g. for
    41 41 periodicity kernel, that access the field outside the iteration bounds. Use with care!
    42 base_pointer_spec: specifies which field accesses are resolved by pointers see :func:`parse_base_pointer_info`
    • Do you want to do something about the number of registers to use depending on the architecture? Or shall we open a separate issue for that?

    • The base pointer specification is also highly work in progress right now. So far I found out, that the compiler generates quite bad code in some cases although I use only a few pointers. We are still not completely sure why this is the case. However, it might be due to the large offsets which are produced when using fewer pointers.

      The only reliable strategy to produce fast code consistently is to use intrinsics. By doing so the compiler has enough hints to generate what we want.

      Also I will put it in another branch because it has nothing to do with the assembly MR.

    • changed this line in version 11 of the diff

    • Please register or sign in to reply
  • Markus Holzer added 1 commit

    added 1 commit

    Compare with previous version

  • Markus Holzer added 1 commit

    added 1 commit

    Compare with previous version

  • Markus Holzer added 1 commit

    added 1 commit

    • 4141f192 - Remove benchmark generation bug fix from this MR

    Compare with previous version

  • Markus Holzer added 1 commit

    added 1 commit

    • 44526560 - Remove features which do not belong to this MR

    Compare with previous version

  • The assembly code of the AST can be displayed via in-core model output which can be accessed with kerncraft. Showing the hole boilerplate code as assembly code isn't that useful. Thus this MR is closed.

  • closed

  • Michael Kuron mentioned in merge request !224 (closed)

    mentioned in merge request !224 (closed)

  • Please register or sign in to reply
    Loading