Skip to content

Hybrid parallel CFEASolver and CMeshSolver#843

Merged
pcarruscag merged 40 commits intodevelopfrom
hybrid_parallel_fea_solver
Jan 24, 2020
Merged

Hybrid parallel CFEASolver and CMeshSolver#843
pcarruscag merged 40 commits intodevelopfrom
hybrid_parallel_fea_solver

Conversation

@pcarruscag
Copy link
Member

Proposed Changes

Another snapshot from #824.
All major routines of CFEASolver (and by association CMeshSolver) now use OpenMP directives.
Parallelization is at the level of each individual routine, i.e. threads are "started" locally.
This required avoiding the use of small auxiliary array member variables, as that is not thread safe (multiple threads cannot write to the same location) In doing that I ended up doing some cleanup and so I thought it would be better to move the files, strip spaces, etc., it should make the review easier actually.
I will document to the major changes below.

Related Work

#824 #789
Depends on #834 and #830

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with the '-Wall -Wextra -Wno-unused-parameter -Wno-empty-body' compiler flags).
  • My contribution is commented and consistent with SU2 style.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp) , if necessary.

@dep dep bot added the dependent label Dec 21, 2019
Comment on lines +1992 to +1995
for (iMGlevel = 0; iMGlevel <= config->GetnMGLevels(); iMGlevel++) {
numerics[iMGlevel] = new CNumerics** [MAX_SOLS];
for (iSol = 0; iSol < MAX_SOLS; iSol++)
numerics[iMGlevel][iSol] = new CNumerics* [MAX_TERMS*omp_get_max_threads()]();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the major changes it that we need one numerics instance per thread, instead of adding another * I enlarged the last dimension of the numerics container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_TERMS is still 6, while MAX_TERMS_FEA was 10 before, and if I recall correctly that was to enable multi-material laws. How has this changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked like we only have 5 material models / terms:

const int FEA_TERM = 0;      /*!< \brief Position of the finite element analysis terms in the numerics container array. */
const int DE_TERM = 1;       /*!< \brief Position of the dielectric terms in the numerics container array. */

const int MAT_NHCOMP  = 2;   /*!< \brief Position of the Neo-Hookean compressible material model. */
const int MAT_IDEALDE = 3;   /*!< \brief Position of the Ideal-DE material model. */
const int MAT_KNOWLES = 4;   /*!< \brief Position of the Knowles material model. */

And so 6 max terms would be enough?

break;
}
if (!(properties_file.fail())) {
numerics[MESH_0][FEA_SOL][MAT_NHCOMP+offset] = new CFEM_NeoHookean_Comp(nDim, nVar_FEM, config);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container can/should then be used with an "offset" of thread*MAX_TERMS.

Comment on lines +2739 to +2740
void CDriver::Numerics_Postprocessing(CNumerics *****numerics, CSolver***, CGeometry**,
CConfig *config, unsigned short val_iInst) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified (a lot) the numerics postprocessing to just delete everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, I was searching for it 😅😅😅

Comment on lines +44 to +50
# OpenMP
omp = get_option('with-omp')

if omp
omp_dep = dependency('openmp', language:'cpp')
endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated meson to build with OpenMP support, add -Dwith-omp=true (default false) in the call to meson.py, the number of threads is still set via environment variable (see #830).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That should make it much easier to build.

Comment on lines 248 to 264
#ifdef HAVE_OMP
/*--- Get the element coloring. ---*/

const auto& coloring = geometry->GetElementColoring();

auto nColor = coloring.getOuterSize();
ElemColoring.resize(nColor);

for(auto iColor = 0ul; iColor < nColor; ++iColor) {
ElemColoring[iColor].size = coloring.getNumNonZeros(iColor);
ElemColoring[iColor].indices = coloring.innerIdx(iColor);
}

ColorGroupSize = 1; /// TODO: This needs to come from geometry or config

omp_chunk_size = computeStaticChunkSize(nPointDomain, omp_get_max_threads(), OMP_MAX_SIZE);
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The element coloring is obtained from CGeometry as a shallow copy at construction.
I have yet to test the performance impact of the color group size, group size 1 gives maximum parallelization but worse spatial locality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The impact of not grouping elements for coloring (size 1) is not significant for the FEA solver, not a lot of data is accessed per element relative to the computation effort and linear algebra takes up 80%+ of the time anyway.

Comment on lines +957 to +972
#ifdef HAVE_OMP
/*--- Chunk size is at least OMP_MIN_SIZE and a multiple of the color group size. ---*/
auto chunkSize = roundUpDiv(OMP_MIN_SIZE, ColorGroupSize)*ColorGroupSize;

/*--- Loop over element colors. ---*/
for (auto color : ElemColoring)
{
SU2_OMP_FOR_DYN(chunkSize)
for(auto k = 0ul; k < color.size; ++k) {

auto iElem = color.indices[k];
#else
/*--- Natural coloring. ---*/
{
for (auto iElem = 0ul; iElem < nElement; iElem++) {
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coloring is then used like this and the body of the loop is almost unchanged apart from the use of local variables and picking one CElement and one CNumerics per thread.
When compiling without OpenMP support everything reverts to the natural way of looping over elements and no overhead or indirection is introduced by coloring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this linked to the number of threads of the preconditioners, or totally independent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally independent.


auto Ta = element->Get_Kt_a(iNode);
for (iVar = 0; iVar < nVar; iVar++)
LinSysRes[indexNode[iNode]*nVar+iVar] -= simp_penalty*Ta[iVar];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a number of places we now write directly to CSysVector / CSysMatrix instead of populating a temporary first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, it's cleaner like this.

Comment on lines +1441 to +1445
/*--- Reduction variable to compute the maximum von Misses stress,
* each thread uses the first element of each row, the rest of
* the row is padding to prevent false sharing. ---*/
su2activematrix auxMaxVonMisses(omp_get_max_threads(), 64/sizeof(su2double));
auxMaxVonMisses.setConstant(0.0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the trickiest bit in this PR, reduction operations in general are tricky, I can try to clean it up a bit by using the OpenMP reduction clause but this "manual" way is future compatible with the AD type (OpenMP reductions are only defined for builtin types, they can be declared for other types but that sounds like witchcraft to me).

app.add_flag("-d,--dryrun", dry_run, "Enable dry run mode.\n"
app.add_flag("-d,--dryrun", dry_run, "Enable dry run mode.\n"
"Only execute preprocessing steps using a dummy geometry.");
app.add_option("-t,--threads", num_threads, "Number of OpenMP threads per MPI rank.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of threads per rank can now be set via this command line option, it defaults to whatever is defined by the environment.

@pcarruscag pcarruscag changed the base branch from hybrid_parallel_gradients_limiters to develop January 18, 2020 23:04
Comment on lines +1445 to +1446
/* DESCRIPTION: Custom number of threads used for additive domain decomposition for ILU and LU_SGS (0 is "auto"). */
addUnsignedLongOption("LINEAR_SOLVER_PREC_THREADS", Linear_Solver_Prec_Threads, 0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR the ILU and LU_SGS preconditioners can now run with a custom number of threads specified through the config file.
For structural cases (and mesh deformation) it gets harder and harder (more linear iterations) to solve the linear systems as more processors are throw at the problem (which is annoying in multizone problems), being able to independently control the number of preconditioner threads alleviates the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the default, or auto, just keeps the number of threads for the requested MPI parallelization, is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it uses the same threads that are used everywhere else.
To answer another of your questions:

When using the mesh solver in a CFD case (so, single zone), is it possible to use different levels of parallelism for the CFD problem and for the mesh deformation problem? And, if so, is it something worth doing?

Although one could change the number of threads used in all mesh deformation and fea routines very easily, at the moment the mesh deformation runs synchronously with the rest of the code, and so running the discretization parts with fewer threads would not be advantageous.
The area where it might be worth reducing the number of threads is those 2 preconditioners, if and only if:

  • The linear solver is not able to reach adequate convergence (or is diverging) due to the preconditioner being too weak (too partitioned).
  • The code is running on a very bandwidth-limited machine, in which case reducing threads a bit might give a speedup due to fewer linear iterations being required.

Other areas of the code will not suffer as much from having too many threads, since threads do not increase the number of halo points (communication costs) and allow for "automatic" load balancing.

Copy link
Contributor

@WallyMaier WallyMaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @pcarruscag ! This looks good to me, though my C++ isnt the best. I appreciate all the clean up that you have done.

this->centered = NO_CENTERED;
this->upwind = NO_UPWIND;
this->space = SPACE_CENTERED;
this->space = NO_CONVECTIVE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @koodlyakshay
@rsanfer you asked about this below, it addresses #842

Copy link
Contributor

@rsanfer rsanfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pcarruscag
thanks so much for this great contribution! Indeed, it addresses a problem of the over-parallelization for FEA and mesh cases that has been there for a while. kudos!

I saw tests are not passing yet and I asked a few questions here and there, but overall I'd say this is pretty much ready. Happy to approve it, but I'll wait a few days to see if there are any further comments from the community. Just a request, if it's possible that you add one or two test cases so the implementation is safe onwards (and, of course, so I can play around with the new features a little bit 😏).

Honestly, my knowledge of OpenMP is limited to say the least, I have never really used it/coded with it. I guess that shows in my questions here and there, so I can't really comment on some technicalities. However, I do have some additional questions:

  • When using the mesh solver in a CFD case (so, single zone), is it possible to use different levels of parallelism for the CFD problem and for the mesh deformation problem? And, if so, is it something worth doing?
  • Should this just run "out of the box" with a working installation of OpenMP in any machine, or is there anything else fancy needed?
  • Is the previous behaviour exactly kept, or are there any modifications in the basic, non OpenMP version of code? (Not that I mind, just curious).

Also, just an additional (hopefully constructive) comment: I find all of these developments great, and I honestly think that you are doing an amazing job on performance and overall code improvement. However, as a non-C++-master myself, I'm just a little concerned of whether some advanced programming may become an entrance barrier to new additions to the community. I'm aware that you have been doing very well at documenting the code and the various PRs, but I'd say we should try to find an strategy to ease the learning curve on potential new developers (maybe some developer tutorials? a collection of the comments/discussions on the PRs moved to the wiki? a list of links/useful resources?).

Thanks again!

Ruben

break;
}
}
for (auto iVar = 0ul; iVar < nVar; iVar++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be due to my limited knowledge, but this declaration seems strange to me...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ul is for unsigned long (which I was getting tired to write xD) otherwise auto will deduce int, which is the default type for integer literals.

Comment on lines +1445 to +1446
/* DESCRIPTION: Custom number of threads used for additive domain decomposition for ILU and LU_SGS (0 is "auto"). */
addUnsignedLongOption("LINEAR_SOLVER_PREC_THREADS", Linear_Solver_Prec_Threads, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the default, or auto, just keeps the number of threads for the requested MPI parallelization, is that correct?

/*--- Populate ---*/
for(iElem=0; iElem<nElem; ++iElem)

SU2_OMP_PARALLEL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with OpenMP. What is the behaviour if the code is compiled without OpenMP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Business as usual" there should be no runtime overhead, no coloring is applied for the loops, the SU2_OMP macros are disabled and the "omp_xxx" functions are constexpr so the compiler should optimize them away very easily.

Comment on lines +957 to +972
#ifdef HAVE_OMP
/*--- Chunk size is at least OMP_MIN_SIZE and a multiple of the color group size. ---*/
auto chunkSize = roundUpDiv(OMP_MIN_SIZE, ColorGroupSize)*ColorGroupSize;

/*--- Loop over element colors. ---*/
for (auto color : ElemColoring)
{
SU2_OMP_FOR_DYN(chunkSize)
for(auto k = 0ul; k < color.size; ++k) {

auto iElem = color.indices[k];
#else
/*--- Natural coloring. ---*/
{
for (auto iElem = 0ul; iElem < nElement; iElem++) {
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this linked to the number of threads of the preconditioners, or totally independent?

Disp_nP1 = nodes->GetSolution(iPoint);
/*--- Coordinates of the current point at n+1, n, & n-1 time levels. ---*/

const su2double* Disp_nM1 = nodes->GetSolution_time_n1(iPoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for efficiency reasons?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see big reply

Comment on lines +44 to +50
# OpenMP
omp = get_option('with-omp')

if omp
omp_dep = dependency('openmp', language:'cpp')
endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That should make it much easier to build.

@pcarruscag
Copy link
Member Author

pcarruscag commented Jan 23, 2020

Thank you for the thorough review @rsanfer! I'll reply to your main questions and some of the smaller ones here to centralize things.

Just a request, if it's possible that you add one or two test cases so the implementation is safe onwards (and, of course, so I can play around with the new features a little bit ).

The testcases are the same, no changes there other than the one optional option introduced above. When the hybrid stuff covers most of the code I would add an entire build configuration e.g. BaseMPIOMP and corresponding testcase suite.

  • Should this just run "out of the box" with a working installation of OpenMP in any machine, or is there anything else fancy needed?

I would leave it to the community to decide what the defaults should be, probably for a lot of new users that don't run on clusters just calling SU2_CFD and not having to worry about mpi would be nice (a lot of the issues on CFD online are mpi related).

  • Is the previous behaviour exactly kept, or are there any modifications in the basic, non OpenMP version of code? (Not that I mind, just curious).

Other than the algorithmic changes (but mathematically equivalent) introduced to limiters and gradients in #834, yes.

What's the advantage of having one numerics term per thread?

It is a requirement, we need to write data into numerics before using them, multiple threads cannot write to the same location (i.e. the internal structures of CNumerics) therefore one per thread is required.

...Also, I think I missed the point where the numerics container is extended beyond MAX_TERMS

The allocation of space for one numerics per thread is done above in line 1995 of my 21 Dec 2019 comment: ...MAX_TERMS*omp_get_max_threads()....
The instantiation of one numerics per thread is then done by executing the rest of the preprocessing in parallel and instead of using XYZ_TERM using XYZ_TERM+offset where offset = thread_id * MAX_TERMS.
I think someone mentioned this (maybe Tim) that we could revisit the ownership relations of the numerics classes, i.e. allocate them as members of their respective solvers, which if we do, we can think of having a purpose built container that automates the per-thread creation and access.

Why are they redefined each time inside the loop?
Is this for efficiency reasons?

Referring to variables being declared inside loops. One stylist reason is that declaring everything at the top of a function is the C way of doing things, the C++ people whose books/blogs I've read and talks I've watched, recommend keeping namespaces (the inside of the loop being one) as clean as possible.
The only reason not to do this is if you explicitly want re-use, in the case of trivial types this does not improve efficiency, and in the context of OpenMP code it can create issues. Just like we need one numerics per thread, if we declare variables outside a parallel loop the default OpenMP behaviour is to consider them shared, and concurrent writes to shared locations = gdb and many bad words xD.
EDIT: I should mention here that if the parallel region is started before the variable declarations they become local and all is well, with the exception of class members, those will be shared most of the time (this is where const correctness can give some peace of mind).

Also, just an additional (hopefully constructive) comment: I find all of these developments great, and I honestly think that you are doing an amazing job on performance and overall code improvement. However, as a non-C++-master myself, I'm just a little concerned of whether some advanced programming may become an entrance barrier to new additions to the community.

As I wrote in the preamble of #789:
"But please participate even if you never heard of these topics, your opinion about readability and "developability" of the code is important! I think the code-style should be accessible to people starting a PhD (after they read a bit about C++...)."
I try to encapsulate and hide the tricky bits as much as possible to make the code as readable as possible, whether I am succeeding or not is for the community to decide, in all these PR's I've been pointing to the areas I think are trickier, if someone, anyone, feels they are absolutely incomprehensible please say something... either here, or trough slack, or by email (I think it shows in the commits) (I understand not everyone is keen on github exposure).

I'm aware that you have been doing very well at documenting the code and the various PRs, but I'd say we should try to find an strategy to ease the learning curve on potential new developers (maybe some developer tutorials? a collection of the comments/discussions on the PRs moved to the wiki? a list of links/useful resources?).

I agree with documentation of broad design decisions, that is the intent of #789, and developer tutorials (how to implement a new X) once we are content with the restructurings, otherwise they will quickly go outdated... or actually...
We should probably first think about the answers to "how to implement a new X" and restructure/refactor as a function of that.
Based on previous efforts of maintaining wiki's updated while code is being developed, I much prefer this github style where you can clearly tell what version of the code the comments refer to. A collection of comments/discussions organized by topic and linked to a feature is somewhat what I had in mind when I opened a "big PR" (#824) with little branches such as this one, I can try to complete that with a list of links/useful resources, references as it were, good idea!

@pcarruscag pcarruscag merged commit 43f01fb into develop Jan 24, 2020
@pcarruscag pcarruscag deleted the hybrid_parallel_fea_solver branch January 24, 2020 13:55
@jblueh jblueh mentioned this pull request Mar 1, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants