Optimization of the Jacobi iteration for discretized 2D Poisson equation

The goal of this project is to optimize the Jacobi iteration for the discretized 2D Poisson equation using periodic boundary conditions.

The baseline code is based on this GitHub repo https://github.com/haicgu/training/tree/main/HPC/hands-on/poisson2d.

The following code was developed for the EuroHPC Summit 2023.

Please note that we where given a very short time frame to complete this project. Therefore, the code is not as optimized as it could be and may look a bit messy.

This project was presented at the EuroHPC Summit 2023.

Poster
Fullsize PDF
Converence Image
Source: EuroHPC Summit 2023

This project has received funding from the European High-Performance Computing Joint Undertaking under grant agreement No 101051997 and was presented at the EuroHPC Summit 2023.

Please note: While the project was made as part of the EuroHPC Masters study program and has therefore indirectly recieved funding by the European High-Performance Computing Joint Undertaking, the following presentation was created nearly 2 years later and is therefore neither supported nor funded by the EU.

All code presented here was implemented by me except otherwise stated.

The result data was generated in 2023 using the MeluXina Supercomputer.

Poisson Equation

The Poisson equation is defined as:

Δv(x,y)=f(x,y)

Where v(x,y) is the unknown function, f(x,y) is the right-hand side, and Δ=2 is the Laplace operator. This can be written in 2D as:

2v(x,y)x22v(x,y)y2=f(x,y)

Finite Differences

In order to solve this equation numerically, the domain needs to be discretized into a grid of points. We can then approximate the second-order derivatives using finite differences.

We can use the finite differences central difference formula:

f(x)f(x+h)f(xh)2h
To approximate the second-order derivative, we can apply this formula twice:
f(x)f(x+h1)f(xh)2h1
Now let x+h=x+ and xh=x, we can write this as:
f(x)f(x+)f(x)2h1
Applying the central difference formula again for both f(x+) and f(x), we get:
f(x)f(x++h2)f(x+h2)f(x+h2)+f(xh2)4h1h2
If we let the spacing for both derivatives be identical then h1=h2=h, we get:
f(x)f(x+h)2f(x)+f(xh)h2

Discretized Poisson Equation

By applying the above computed formula to the Poisson equation, we get:

2v(x,y)x2v(x+h,y)2v(x,y)+v(xh,y)h2
Or rewritten using the notation vi,j=v(xi,yj):
2v(x,y)x2vi+1,j2vi,j+vi1,jh2
If we now apply the same formula for the second derivative with respect to y, we get:
2v(x,y)y2vi,j+12vi,j+vi,j1h2
And together with the right-hand side fi,j=f(xi,yj), we get the discretized Poisson equation:
vi+1,j2vi,j+vi1,jh2vi,j+12vi,j+vi,j1h2=fi,j

Compact Form

We can simplify this equation by multiplying by h2:

vi+1,jvi,j+1+4vi,jvi1,jvi,j1=h2fi,j
This results in a stencil looking like this:
[010141010]
Now the left-hand side can be written in matrix form:
[11411][vi1,jvi,j1vi,jvi,j+1vi+1,j]=h2fi,j
Where every represents a row of zeros. While this form may look more complicated, it allows for a simpler representation if one replaces the single entry fi,j with the whole domain f.
[410141014][v1,1v1,2v1,3]=h2[f1,1f1,2f1,3]
Av=h2f

Jacobi Iteration

Given a problem of the form Ax=b, we can solve it using the Jacobi iteration. Let A=D+R, where D=diag(A) is the diagonal of A and R is the rest of the matrix. The Jacobi iteration is then defined as:

x(k+1)=D1(bRx(k))
Or written using A:
xi(k+1)=1ai,i(bijiai,jxj(k))
Applying this to the Poisson equation, we get:
v(k+1)=D1(h2fRv(k))
Because the matrix A is symmetric and highly sparse consisting only of lines of the Stencil matrix, we can simplify the Jacobi iteration to:
vi,j(k+1)=14(vi+1,j(k)+vi1,j(k)+vi,j+1(k)+vi,j1(k)+h2fi,j)

Optimization Overview and Results

I implemented multiple different optimizations. The AVX version uses vectorized operations to compute multiple values at once on a single core. The OMP and MPI version use multiple cores to compute the problem in parallel. And the CUDA version uses the GPU to compute the problem.

Runtime 8192x8192 in sec

Implementation

The Jacobi iteration can be implemented in C++ as follows:

This baseline code is based on this GitHub repo https://github.com/haicgu/training/tree/main/HPC/hands-on/poisson2d.

c
while ((e > eps) && (n < nmax))
{
	e = 0.0;
	for( int ix = 1; ix < (nx-1); ix++ )
	{
		for (int iy = 1; iy < (ny-1); iy++)
		{
			double d;

			vp[iy*nx+ix] = -0.25 * (f[iy*nx+ix] -
				(v[nx*iy     + ix+1] + v[nx*iy     + ix-1] +
					v[nx*(iy+1) + ix  ] + v[nx*(iy-1) + ix  ]));

			d = fabs(vp[nx*iy+ix] - v[nx*iy+ix]);
			e = (d > e) ? d : e;
		}
	}

	// Update v and compute error as well as error weight factor
	...
}

This code was used as baseline for implementing the following optimizations.

When running this code with a domain size of 512x512 it took 3330 seconds to converge. This is roughly 55 minutes.

Single Core Optimization using AVX

In order to optimize the Jacobi iteration without making use of multiple cores, we can use the AVX instruction set. AVX is a instructions set that features vectorized operations on higher bit registers. By using AVX, we can perform multiple double precision floating point operations in parallel.

Here is used the 256-bit version of AVX, which allows for 4 double precision floating point numbers to be processed in parallel. The inner loops have also been swapped for better cache access.

c
int unroll = 4;
int restX = (nx-1)-((nx-1)-unroll+1);
__m256d invfour = _mm256_set1_pd(-1/4.0f);
while ((e > eps) && (n < nmax))
{
	e = 0.0;
	__m256d eVec = _mm256_setzero_pd();
	for (int iy = 1; iy < (ny-1); iy++)
	{
		int ix = 1;
		for( ; ix < (nx-1)-unroll+1; ix+=unroll )
		{ 
			__m256d vecf = _mm256_loadu_pd(&f[iy * nx + ix]);
			__m256d vecTop = _mm256_loadu_pd(&v[(iy - 1) * nx + ix]);
			__m256d vecBot = _mm256_loadu_pd(&v[(iy + 1) * nx + ix]);
			__m256d vecLeft = _mm256_loadu_pd(&v[iy * nx + ix - 1]);
			__m256d vecRight = _mm256_loadu_pd(&v[iy * nx + ix + 1]);
			__m256d tmp1 = _mm256_add_pd(vecTop, vecBot);
			__m256d tmp2 = _mm256_add_pd(vecLeft, vecRight);
			__m256d tmp3 = _mm256_add_pd(tmp1, tmp2);
			__m256d tmp4 = _mm256_sub_pd(vecf, tmp3);
			tmp4 = _mm256_mul_pd(invfour, tmp4);
			_mm256_storeu_pd(&vp[iy * nx + ix], tmp4);

			__m256d vecv = _mm256_loadu_pd(&v[iy * nx + ix]);
			__m256d d = _mm256_sub_pd(tmp4, vecv);
			d = _mm256_andnot_pd(_mm256_set1_pd(-0.0),d); // <-- Computes absolute value
			eVec = _mm256_max_pd(eVec,d);

		}
		for(;  ix < (nx-1); ix++ )
		{
			double d;

			vp[iy*nx+ix] = -0.25 * (f[iy*nx+ix] - (v[nx*iy+ix+1] + v[nx*iy+ix-1] + v[nx*(iy+1) + ix  ] + v[nx*(iy-1) + ix  ]));

			d = fabs(vp[nx*iy+ix] - v[nx*iy+ix]);
			e = (d > e) ? d : e;
		}

	}

	// eVec = _mm256_sqrt_pd(eVec);
	e = (eVec[0] > e) ? eVec[0]: e;
	e = (eVec[1] > e) ? eVec[1]: e;
	e = (eVec[2] > e) ? eVec[2]: e;
	e = (eVec[3] > e) ? eVec[3]: e;

	// Update v and compute error as well as error weight factor
	...
}

When running this code with a domain size of 512x512 it took 22 seconds to converge. This is roughly a 150x speedup.

Multi Core Optimization using OMP

When running this code with a domain size of 512x512 it took 2.6 seconds to converge using 16 cores. This is roughly a 10x speedup when compared to the AVX version. The number of 16 cores was not chosen arbitrarily but was determined to be the optimal number of cores for the given problem size. The OMP version was not written by me, therefore the code will not be shown here.

Multi Node Optimization using MPI

Using MPI allows the code to scale to multiple cores and even multiple nodes.

In order to let mutliple MPI processes work on the same problem we can split the domain into smaller subdomains. Each MPI process then works on its own subdomain.

Each subdomain is surrounded by a halo of ghost cells, which are used to exchange information between the subdomains. This allows the subdomains to be computed independently. After each iteration, the ghost cells are updated using the neighboring subdomains.

In order to exchange the ghost cells multiple MPI_Datatypes are defined.

3x3 Domain seperation using 9 MPI processes

Multi Node Optimization using MPI and AVX

On the Meluxina supercomputer, the code was run with a domain size of 8192x8192 on 16 nodes featuring 2048 MPI processes (Hyperthreading turned off). The program took 16.5 seconds to converge.

GPU Implementation using CUDA

A CUDA implementation of the Jacobi iteration can be done relatively easily. As a GPU has a large number of cores, the Jacobi iteration can be parallelized by assigning each thread to a single point in the domain. Of course, such a simple implementation would is not very efficient but it is a good starting point. For a more in depth view into CUDA optimizations see the CNN CUDA project.

c
__global__
void jacobi(double *v, double *vp, double *f, int X, int Y){
	int tid = threadIdx.x + blockIdx.x* blockDim.x;
	// if(tid >= X*Y) return;
	int x = tid % X;
	int y = tid / X;
	if(x == 0 || y == 0 || y >= Y-1 || x >= X-1) return;
    vp[y*X+x] = -0.25 * (f[y*X+x] - (v[X*y+x+1] + v[X*y+x-1] + v[X*(y+1) + x  ] + v[X*(y-1) + x  ]));
}

The problem is calculating the error, which requires a reduction operation. To make this easier, we can use the Thrust library.

For an example of a reduction operation using CUDA, see the CNN CUDA project.

c
e = thrust::transform_reduce(&ds_thrust[0], &ds_thrust[nx * ny],absolute_value<double>(), (double) 0.0, thrust::maximum<double>());
double w = thrust::transform_reduce(&v_thrust[0], &v_thrust[nx * ny],absolute_value<double>(), (double) 0.0, thrust::plus<double>());

Alternative Iteration Algorithms

Instead of trying to optimize the Jacobi iteration, one can simply make use of other algorithms like by using the improved convergence of the Gauss iteration. As the goal of this project was to explicitly optimize the Jacobi iteration, the Gauss iteration will not be further discussed here.