Optimization of the Jacobi iteration for discretized 2D Poisson equation
The goal of this project is to optimize the Jacobi iteration for the discretized 2D Poisson equation using periodic boundary conditions.
The baseline code is based on this GitHub repo https://github.com/haicgu/training/tree/main/HPC/hands-on/poisson2d.
The following code was developed for the EuroHPC Summit 2023.
Please note that we where given a very short time frame to complete this project. Therefore, the code is not as optimized as it could be and may look a bit messy.
Please note: While the project was made as part of the EuroHPC Masters study program and has therefore indirectly recieved funding by the European High-Performance Computing Joint Undertaking, the following presentation was created nearly 2 years later and is therefore neither supported nor funded by the EU.
All code presented here was implemented by me except otherwise stated.
The result data was generated in 2023 using the MeluXina Supercomputer.
Poisson Equation
The Poisson equation is defined as:
Where is the unknown function, is the right-hand side, and is the Laplace operator. This can be written in 2D as:
Finite Differences
In order to solve this equation numerically, the domain needs to be discretized into a grid of points. We can then approximate the second-order derivatives using finite differences.
We can use the finite differences central difference formula:
Discretized Poisson Equation
By applying the above computed formula to the Poisson equation, we get:
Compact Form
We can simplify this equation by multiplying by :
Jacobi Iteration
Given a problem of the form , we can solve it using the Jacobi iteration. Let , where is the diagonal of and is the rest of the matrix. The Jacobi iteration is then defined as:
Optimization Overview and Results
I implemented multiple different optimizations. The AVX version uses vectorized operations to compute multiple values at once on a single core. The OMP and MPI version use multiple cores to compute the problem in parallel. And the CUDA version uses the GPU to compute the problem.
Implementation
The Jacobi iteration can be implemented in C++ as follows:
This baseline code is based on this GitHub repo https://github.com/haicgu/training/tree/main/HPC/hands-on/poisson2d.
while ((e > eps) && (n < nmax))
{
e = 0.0;
for( int ix = 1; ix < (nx-1); ix++ )
{
for (int iy = 1; iy < (ny-1); iy++)
{
double d;
vp[iy*nx+ix] = -0.25 * (f[iy*nx+ix] -
(v[nx*iy + ix+1] + v[nx*iy + ix-1] +
v[nx*(iy+1) + ix ] + v[nx*(iy-1) + ix ]));
d = fabs(vp[nx*iy+ix] - v[nx*iy+ix]);
e = (d > e) ? d : e;
}
}
// Update v and compute error as well as error weight factor
...
}This code was used as baseline for implementing the following optimizations.
When running this code with a domain size of 512x512 it took 3330 seconds to converge. This is roughly 55 minutes.
Single Core Optimization using AVX
In order to optimize the Jacobi iteration without making use of multiple cores, we can use the AVX instruction set. AVX is a instructions set that features vectorized operations on higher bit registers. By using AVX, we can perform multiple double precision floating point operations in parallel.
Here is used the 256-bit version of AVX, which allows for 4 double precision floating point numbers to be processed in parallel. The inner loops have also been swapped for better cache access.
int unroll = 4;
int restX = (nx-1)-((nx-1)-unroll+1);
__m256d invfour = _mm256_set1_pd(-1/4.0f);
while ((e > eps) && (n < nmax))
{
e = 0.0;
__m256d eVec = _mm256_setzero_pd();
for (int iy = 1; iy < (ny-1); iy++)
{
int ix = 1;
for( ; ix < (nx-1)-unroll+1; ix+=unroll )
{
__m256d vecf = _mm256_loadu_pd(&f[iy * nx + ix]);
__m256d vecTop = _mm256_loadu_pd(&v[(iy - 1) * nx + ix]);
__m256d vecBot = _mm256_loadu_pd(&v[(iy + 1) * nx + ix]);
__m256d vecLeft = _mm256_loadu_pd(&v[iy * nx + ix - 1]);
__m256d vecRight = _mm256_loadu_pd(&v[iy * nx + ix + 1]);
__m256d tmp1 = _mm256_add_pd(vecTop, vecBot);
__m256d tmp2 = _mm256_add_pd(vecLeft, vecRight);
__m256d tmp3 = _mm256_add_pd(tmp1, tmp2);
__m256d tmp4 = _mm256_sub_pd(vecf, tmp3);
tmp4 = _mm256_mul_pd(invfour, tmp4);
_mm256_storeu_pd(&vp[iy * nx + ix], tmp4);
__m256d vecv = _mm256_loadu_pd(&v[iy * nx + ix]);
__m256d d = _mm256_sub_pd(tmp4, vecv);
d = _mm256_andnot_pd(_mm256_set1_pd(-0.0),d); // <-- Computes absolute value
eVec = _mm256_max_pd(eVec,d);
}
for(; ix < (nx-1); ix++ )
{
double d;
vp[iy*nx+ix] = -0.25 * (f[iy*nx+ix] - (v[nx*iy+ix+1] + v[nx*iy+ix-1] + v[nx*(iy+1) + ix ] + v[nx*(iy-1) + ix ]));
d = fabs(vp[nx*iy+ix] - v[nx*iy+ix]);
e = (d > e) ? d : e;
}
}
// eVec = _mm256_sqrt_pd(eVec);
e = (eVec[0] > e) ? eVec[0]: e;
e = (eVec[1] > e) ? eVec[1]: e;
e = (eVec[2] > e) ? eVec[2]: e;
e = (eVec[3] > e) ? eVec[3]: e;
// Update v and compute error as well as error weight factor
...
}When running this code with a domain size of 512x512 it took 22 seconds to converge. This is roughly a 150x speedup.
Multi Core Optimization using OMP
When running this code with a domain size of 512x512 it took 2.6 seconds to converge using 16 cores. This is roughly a 10x speedup when compared to the AVX version. The number of 16 cores was not chosen arbitrarily but was determined to be the optimal number of cores for the given problem size. The OMP version was not written by me, therefore the code will not be shown here.
Multi Node Optimization using MPI
Using MPI allows the code to scale to multiple cores and even multiple nodes.
In order to let mutliple MPI processes work on the same problem we can split the domain into smaller subdomains. Each MPI process then works on its own subdomain.
Each subdomain is surrounded by a halo of ghost cells, which are used to exchange information between the subdomains. This allows the subdomains to be computed independently. After each iteration, the ghost cells are updated using the neighboring subdomains.
In order to exchange the ghost cells multiple MPI_Datatypes are defined.
Multi Node Optimization using MPI and AVX
On the Meluxina supercomputer, the code was run with a domain size of 8192x8192 on 16 nodes featuring 2048 MPI processes (Hyperthreading turned off). The program took 16.5 seconds to converge.
GPU Implementation using CUDA
A CUDA implementation of the Jacobi iteration can be done relatively easily. As a GPU has a large number of cores, the Jacobi iteration can be parallelized by assigning each thread to a single point in the domain. Of course, such a simple implementation would is not very efficient but it is a good starting point. For a more in depth view into CUDA optimizations see the CNN CUDA project.
__global__
void jacobi(double *v, double *vp, double *f, int X, int Y){
int tid = threadIdx.x + blockIdx.x* blockDim.x;
// if(tid >= X*Y) return;
int x = tid % X;
int y = tid / X;
if(x == 0 || y == 0 || y >= Y-1 || x >= X-1) return;
vp[y*X+x] = -0.25 * (f[y*X+x] - (v[X*y+x+1] + v[X*y+x-1] + v[X*(y+1) + x ] + v[X*(y-1) + x ]));
}The problem is calculating the error, which requires a reduction operation. To make this easier, we can use the Thrust library.
For an example of a reduction operation using CUDA, see the CNN CUDA project.
e = thrust::transform_reduce(&ds_thrust[0], &ds_thrust[nx * ny],absolute_value<double>(), (double) 0.0, thrust::maximum<double>());
double w = thrust::transform_reduce(&v_thrust[0], &v_thrust[nx * ny],absolute_value<double>(), (double) 0.0, thrust::plus<double>());Alternative Iteration Algorithms
Instead of trying to optimize the Jacobi iteration, one can simply make use of other algorithms like by using the improved convergence of the Gauss iteration. As the goal of this project was to explicitly optimize the Jacobi iteration, the Gauss iteration will not be further discussed here.