Hybrid parallel CFEASolver and CMeshSolver#843
Conversation
| for (iMGlevel = 0; iMGlevel <= config->GetnMGLevels(); iMGlevel++) { | ||
| numerics[iMGlevel] = new CNumerics** [MAX_SOLS]; | ||
| for (iSol = 0; iSol < MAX_SOLS; iSol++) | ||
| numerics[iMGlevel][iSol] = new CNumerics* [MAX_TERMS*omp_get_max_threads()](); |
There was a problem hiding this comment.
One of the major changes it that we need one numerics instance per thread, instead of adding another * I enlarged the last dimension of the numerics container.
There was a problem hiding this comment.
MAX_TERMS is still 6, while MAX_TERMS_FEA was 10 before, and if I recall correctly that was to enable multi-material laws. How has this changed?
There was a problem hiding this comment.
It looked like we only have 5 material models / terms:
const int FEA_TERM = 0; /*!< \brief Position of the finite element analysis terms in the numerics container array. */
const int DE_TERM = 1; /*!< \brief Position of the dielectric terms in the numerics container array. */
const int MAT_NHCOMP = 2; /*!< \brief Position of the Neo-Hookean compressible material model. */
const int MAT_IDEALDE = 3; /*!< \brief Position of the Ideal-DE material model. */
const int MAT_KNOWLES = 4; /*!< \brief Position of the Knowles material model. */
And so 6 max terms would be enough?
| break; | ||
| } | ||
| if (!(properties_file.fail())) { | ||
| numerics[MESH_0][FEA_SOL][MAT_NHCOMP+offset] = new CFEM_NeoHookean_Comp(nDim, nVar_FEM, config); |
There was a problem hiding this comment.
The container can/should then be used with an "offset" of thread*MAX_TERMS.
| void CDriver::Numerics_Postprocessing(CNumerics *****numerics, CSolver***, CGeometry**, | ||
| CConfig *config, unsigned short val_iInst) { |
There was a problem hiding this comment.
I simplified (a lot) the numerics postprocessing to just delete everything.
There was a problem hiding this comment.
Great idea, I was searching for it 😅😅😅
| # OpenMP | ||
| omp = get_option('with-omp') | ||
|
|
||
| if omp | ||
| omp_dep = dependency('openmp', language:'cpp') | ||
| endif | ||
|
|
There was a problem hiding this comment.
I updated meson to build with OpenMP support, add -Dwith-omp=true (default false) in the call to meson.py, the number of threads is still set via environment variable (see #830).
There was a problem hiding this comment.
Thanks! That should make it much easier to build.
| #ifdef HAVE_OMP | ||
| /*--- Get the element coloring. ---*/ | ||
|
|
||
| const auto& coloring = geometry->GetElementColoring(); | ||
|
|
||
| auto nColor = coloring.getOuterSize(); | ||
| ElemColoring.resize(nColor); | ||
|
|
||
| for(auto iColor = 0ul; iColor < nColor; ++iColor) { | ||
| ElemColoring[iColor].size = coloring.getNumNonZeros(iColor); | ||
| ElemColoring[iColor].indices = coloring.innerIdx(iColor); | ||
| } | ||
|
|
||
| ColorGroupSize = 1; /// TODO: This needs to come from geometry or config | ||
|
|
||
| omp_chunk_size = computeStaticChunkSize(nPointDomain, omp_get_max_threads(), OMP_MAX_SIZE); | ||
| #endif |
There was a problem hiding this comment.
The element coloring is obtained from CGeometry as a shallow copy at construction.
I have yet to test the performance impact of the color group size, group size 1 gives maximum parallelization but worse spatial locality.
There was a problem hiding this comment.
Update: The impact of not grouping elements for coloring (size 1) is not significant for the FEA solver, not a lot of data is accessed per element relative to the computation effort and linear algebra takes up 80%+ of the time anyway.
| #ifdef HAVE_OMP | ||
| /*--- Chunk size is at least OMP_MIN_SIZE and a multiple of the color group size. ---*/ | ||
| auto chunkSize = roundUpDiv(OMP_MIN_SIZE, ColorGroupSize)*ColorGroupSize; | ||
|
|
||
| /*--- Loop over element colors. ---*/ | ||
| for (auto color : ElemColoring) | ||
| { | ||
| SU2_OMP_FOR_DYN(chunkSize) | ||
| for(auto k = 0ul; k < color.size; ++k) { | ||
|
|
||
| auto iElem = color.indices[k]; | ||
| #else | ||
| /*--- Natural coloring. ---*/ | ||
| { | ||
| for (auto iElem = 0ul; iElem < nElement; iElem++) { | ||
| #endif |
There was a problem hiding this comment.
The coloring is then used like this and the body of the loop is almost unchanged apart from the use of local variables and picking one CElement and one CNumerics per thread.
When compiling without OpenMP support everything reverts to the natural way of looping over elements and no overhead or indirection is introduced by coloring.
There was a problem hiding this comment.
Is this linked to the number of threads of the preconditioners, or totally independent?
|
|
||
| auto Ta = element->Get_Kt_a(iNode); | ||
| for (iVar = 0; iVar < nVar; iVar++) | ||
| LinSysRes[indexNode[iNode]*nVar+iVar] -= simp_penalty*Ta[iVar]; |
There was a problem hiding this comment.
In a number of places we now write directly to CSysVector / CSysMatrix instead of populating a temporary first.
There was a problem hiding this comment.
I like it, it's cleaner like this.
| /*--- Reduction variable to compute the maximum von Misses stress, | ||
| * each thread uses the first element of each row, the rest of | ||
| * the row is padding to prevent false sharing. ---*/ | ||
| su2activematrix auxMaxVonMisses(omp_get_max_threads(), 64/sizeof(su2double)); | ||
| auxMaxVonMisses.setConstant(0.0); |
There was a problem hiding this comment.
This is probably the trickiest bit in this PR, reduction operations in general are tricky, I can try to clean it up a bit by using the OpenMP reduction clause but this "manual" way is future compatible with the AD type (OpenMP reductions are only defined for builtin types, they can be declared for other types but that sounds like witchcraft to me).
| app.add_flag("-d,--dryrun", dry_run, "Enable dry run mode.\n" | ||
| app.add_flag("-d,--dryrun", dry_run, "Enable dry run mode.\n" | ||
| "Only execute preprocessing steps using a dummy geometry."); | ||
| app.add_option("-t,--threads", num_threads, "Number of OpenMP threads per MPI rank."); |
There was a problem hiding this comment.
The number of threads per rank can now be set via this command line option, it defaults to whatever is defined by the environment.
| /* DESCRIPTION: Custom number of threads used for additive domain decomposition for ILU and LU_SGS (0 is "auto"). */ | ||
| addUnsignedLongOption("LINEAR_SOLVER_PREC_THREADS", Linear_Solver_Prec_Threads, 0); |
There was a problem hiding this comment.
With this PR the ILU and LU_SGS preconditioners can now run with a custom number of threads specified through the config file.
For structural cases (and mesh deformation) it gets harder and harder (more linear iterations) to solve the linear systems as more processors are throw at the problem (which is annoying in multizone problems), being able to independently control the number of preconditioner threads alleviates the problem.
There was a problem hiding this comment.
I guess the default, or auto, just keeps the number of threads for the requested MPI parallelization, is that correct?
There was a problem hiding this comment.
Yup it uses the same threads that are used everywhere else.
To answer another of your questions:
When using the mesh solver in a CFD case (so, single zone), is it possible to use different levels of parallelism for the CFD problem and for the mesh deformation problem? And, if so, is it something worth doing?
Although one could change the number of threads used in all mesh deformation and fea routines very easily, at the moment the mesh deformation runs synchronously with the rest of the code, and so running the discretization parts with fewer threads would not be advantageous.
The area where it might be worth reducing the number of threads is those 2 preconditioners, if and only if:
- The linear solver is not able to reach adequate convergence (or is diverging) due to the preconditioner being too weak (too partitioned).
- The code is running on a very bandwidth-limited machine, in which case reducing threads a bit might give a speedup due to fewer linear iterations being required.
Other areas of the code will not suffer as much from having too many threads, since threads do not increase the number of halo points (communication costs) and allow for "automatic" load balancing.
WallyMaier
left a comment
There was a problem hiding this comment.
Thanks for the work @pcarruscag ! This looks good to me, though my C++ isnt the best. I appreciate all the clean up that you have done.
| this->centered = NO_CENTERED; | ||
| this->upwind = NO_UPWIND; | ||
| this->space = SPACE_CENTERED; | ||
| this->space = NO_CONVECTIVE; |
There was a problem hiding this comment.
Thank @koodlyakshay
@rsanfer you asked about this below, it addresses #842
rsanfer
left a comment
There was a problem hiding this comment.
Hi @pcarruscag
thanks so much for this great contribution! Indeed, it addresses a problem of the over-parallelization for FEA and mesh cases that has been there for a while. kudos!
I saw tests are not passing yet and I asked a few questions here and there, but overall I'd say this is pretty much ready. Happy to approve it, but I'll wait a few days to see if there are any further comments from the community. Just a request, if it's possible that you add one or two test cases so the implementation is safe onwards (and, of course, so I can play around with the new features a little bit 😏).
Honestly, my knowledge of OpenMP is limited to say the least, I have never really used it/coded with it. I guess that shows in my questions here and there, so I can't really comment on some technicalities. However, I do have some additional questions:
- When using the mesh solver in a CFD case (so, single zone), is it possible to use different levels of parallelism for the CFD problem and for the mesh deformation problem? And, if so, is it something worth doing?
- Should this just run "out of the box" with a working installation of OpenMP in any machine, or is there anything else fancy needed?
- Is the previous behaviour exactly kept, or are there any modifications in the basic, non OpenMP version of code? (Not that I mind, just curious).
Also, just an additional (hopefully constructive) comment: I find all of these developments great, and I honestly think that you are doing an amazing job on performance and overall code improvement. However, as a non-C++-master myself, I'm just a little concerned of whether some advanced programming may become an entrance barrier to new additions to the community. I'm aware that you have been doing very well at documenting the code and the various PRs, but I'd say we should try to find an strategy to ease the learning curve on potential new developers (maybe some developer tutorials? a collection of the comments/discussions on the PRs moved to the wiki? a list of links/useful resources?).
Thanks again!
Ruben
| break; | ||
| } | ||
| } | ||
| for (auto iVar = 0ul; iVar < nVar; iVar++) |
There was a problem hiding this comment.
May be due to my limited knowledge, but this declaration seems strange to me...
There was a problem hiding this comment.
ul is for unsigned long (which I was getting tired to write xD) otherwise auto will deduce int, which is the default type for integer literals.
| /* DESCRIPTION: Custom number of threads used for additive domain decomposition for ILU and LU_SGS (0 is "auto"). */ | ||
| addUnsignedLongOption("LINEAR_SOLVER_PREC_THREADS", Linear_Solver_Prec_Threads, 0); |
There was a problem hiding this comment.
I guess the default, or auto, just keeps the number of threads for the requested MPI parallelization, is that correct?
| /*--- Populate ---*/ | ||
| for(iElem=0; iElem<nElem; ++iElem) | ||
|
|
||
| SU2_OMP_PARALLEL |
There was a problem hiding this comment.
I'm not really familiar with OpenMP. What is the behaviour if the code is compiled without OpenMP?
There was a problem hiding this comment.
"Business as usual" there should be no runtime overhead, no coloring is applied for the loops, the SU2_OMP macros are disabled and the "omp_xxx" functions are constexpr so the compiler should optimize them away very easily.
| #ifdef HAVE_OMP | ||
| /*--- Chunk size is at least OMP_MIN_SIZE and a multiple of the color group size. ---*/ | ||
| auto chunkSize = roundUpDiv(OMP_MIN_SIZE, ColorGroupSize)*ColorGroupSize; | ||
|
|
||
| /*--- Loop over element colors. ---*/ | ||
| for (auto color : ElemColoring) | ||
| { | ||
| SU2_OMP_FOR_DYN(chunkSize) | ||
| for(auto k = 0ul; k < color.size; ++k) { | ||
|
|
||
| auto iElem = color.indices[k]; | ||
| #else | ||
| /*--- Natural coloring. ---*/ | ||
| { | ||
| for (auto iElem = 0ul; iElem < nElement; iElem++) { | ||
| #endif |
There was a problem hiding this comment.
Is this linked to the number of threads of the preconditioners, or totally independent?
| Disp_nP1 = nodes->GetSolution(iPoint); | ||
| /*--- Coordinates of the current point at n+1, n, & n-1 time levels. ---*/ | ||
|
|
||
| const su2double* Disp_nM1 = nodes->GetSolution_time_n1(iPoint); |
There was a problem hiding this comment.
Is this for efficiency reasons?
| # OpenMP | ||
| omp = get_option('with-omp') | ||
|
|
||
| if omp | ||
| omp_dep = dependency('openmp', language:'cpp') | ||
| endif | ||
|
|
There was a problem hiding this comment.
Thanks! That should make it much easier to build.
|
Thank you for the thorough review @rsanfer! I'll reply to your main questions and some of the smaller ones here to centralize things.
The testcases are the same, no changes there other than the one optional option introduced above. When the hybrid stuff covers most of the code I would add an entire build configuration e.g. BaseMPIOMP and corresponding testcase suite.
I would leave it to the community to decide what the defaults should be, probably for a lot of new users that don't run on clusters just calling SU2_CFD and not having to worry about mpi would be nice (a lot of the issues on CFD online are mpi related).
Other than the algorithmic changes (but mathematically equivalent) introduced to limiters and gradients in #834, yes.
It is a requirement, we need to write data into numerics before using them, multiple threads cannot write to the same location (i.e. the internal structures of CNumerics) therefore one per thread is required.
The allocation of space for one numerics per thread is done above in line 1995 of my 21 Dec 2019 comment:
Referring to variables being declared inside loops. One stylist reason is that declaring everything at the top of a function is the C way of doing things, the C++ people whose books/blogs I've read and talks I've watched, recommend keeping namespaces (the inside of the loop being one) as clean as possible.
As I wrote in the preamble of #789:
I agree with documentation of broad design decisions, that is the intent of #789, and developer tutorials (how to implement a new X) once we are content with the restructurings, otherwise they will quickly go outdated... or actually... |
Proposed Changes
Another snapshot from #824.
All major routines of CFEASolver (and by association CMeshSolver) now use OpenMP directives.
Parallelization is at the level of each individual routine, i.e. threads are "started" locally.
This required avoiding the use of small auxiliary array member variables, as that is not thread safe (multiple threads cannot write to the same location) In doing that I ended up doing some cleanup and so I thought it would be better to move the files, strip spaces, etc., it should make the review easier actually.
I will document to the major changes below.
Related Work
#824 #789
Depends on #834 and #830
PR Checklist