Parallelizes local_state_space_iteration_fortran
A few insights about parallelization in fortran mex files
Computer : 16 GiB DDR4 2666 MHz AMD Ryzen 5 2600 6-Core Processor
Experiment | Thread numbers | Fortran | C++ |
---|---|---|---|
Previous code | |||
order=3, nperiods=200, nparticles=5000, nsims=200 | 1 | 5.06 | 5.56 |
order=3, nperiods=200, nparticles=5000, nsims=200 | 12 | 4.94 | 1.22 |
POSIX, allocatable arrays | |||
order=3, nperiods=200, nparticles=5000, nsims=200 | 1 | 3.26 | 5.46 |
order=3, nperiods=200, nparticles=5000, nsims=200 | 12 | 1.11 | 1.25 |
POSIX, contiguous pointers | |||
order=3, nperiods=200, nparticles=5000, nsims=200 | 1 | 3.31 | 5.49 |
order=3, nperiods=200, nparticles=5000, nsims=200 | 12 | 1.21 | 1.26 |
A do concurrent
statement requires calls to pure routines in the loop, whereas the eval
routine is not.
The mex with coarrays compiles with gfortran
, but it is impossible to set the number of images either at compilation, or at run time through MATLAB. For the number of images to be set at compile time, we would need a proprietary compiler such as Intel's ifort
. Still, it would not be possible to specify the number of images in options_.threads.local_state_space_iteration_fortran
as was done for the C++ local_state_space_iteration
routine. MIP and OpenMP implementations seem to suffer from the same problem, let alone the compatibility problems with the innate multi-threading of BLAS routines.
The best solution seems to revolve around the POSIX C routines, which enable to specify the number of threads through a Matlab variable. A wrapper around pthread_create
and pthread_join
works pretty well. 2 implementations were tested:
- one with allocatable arrays, which require a copy to a global structure used by all threads
- one with pointers, which prevent the aforementioned copies
The trade-off is not clear. Pointers avoid the copy of data, but also come with memory management challenges. Allocatable arrays seem better in terms of execution speed.