Canny edge detector

Canny edge detector

On this post I’m going to implement the canny edge detector algorithm made by John Canny.

I will provide multiple implementations on CUDA:

  • The canny edge detector CPU approach.
  • The canny edge detector GPU 1D approach.
  • The canny edge detector GPU 2D approach.

The Process of Canny edge detection algorithm can be broken down to 5 different steps:

  • Apply Gaussian filter to smooth the image in order to remove the noise.
  • Find the intensity gradients of the image.
  • Apply non-maximum suppression to get rid of spurious response to edge detection.
  • Apply double threshold to determine potential edges.
  • Track edge by hysteresis: Finalise the detection of edges by suppressing all the other edges that are weak and not connected to strong edges.

The explanation in depth can be found here.

I’m not going to demonstrate or explain the equations, that is up to you my dear reader. But what I’m going to do is to explain to you my very own implementation of this algorithm step by step.

First we need to know where do we want to go, assuming the canny edge detector algorithm is a box I want something like this:

cannyedgedetectorglobalprocess
Global Process

OK, the global process seems pretty straight forward, an input image, do some process of the image, get output image.

For the box is going to be something like this:

cannyalgorithm

Canny edge detector CPU approach

Gaussian filter for noise reduction:

Supposing my input image has dimensions width x Height and applying a 5×5 Gaussian filter:

gassianfilter5x5

the noise reduction of an image can go like this:


for(i=2; i<height-2; i++)
		for(j=2; j<width-2; j++)
		{
			// Noise reduction
			NR[i*width+j] =
				 (2.0*im[(i-2)*width+(j-2)] +  4.0*im[(i-2)*width+(j-1)] +  5.0*im[(i-2)*width+(j)] +  4.0*im[(i-2)*width+(j+1)] + 2.0*im[(i-2)*width+(j+2)]
				+ 4.0*im[(i-1)*width+(j-2)] +  9.0*im[(i-1)*width+(j-1)] + 12.0*im[(i-1)*width+(j)] +  9.0*im[(i-1)*width+(j+1)] + 4.0*im[(i-1)*width+(j+2)]
				+ 5.0*im[(i  )*width+(j-2)] + 12.0*im[(i  )*width+(j-1)] + 15.0*im[(i  )*width+(j)] + 12.0*im[(i  )*width+(j+1)] + 5.0*im[(i  )*width+(j+2)]
				+ 4.0*im[(i+1)*width+(j-2)] +  9.0*im[(i+1)*width+(j-1)] + 12.0*im[(i+1)*width+(j)] +  9.0*im[(i+1)*width+(j+1)] + 4.0*im[(i+1)*width+(j+2)]
				+ 2.0*im[(i+2)*width+(j-2)] +  4.0*im[(i+2)*width+(j-1)] +  5.0*im[(i+2)*width+(j)] +  4.0*im[(i+2)*width+(j+1)] + 2.0*im[(i+2)*width+(j+2)])
				/159.0;
		}

In NR I’m storing the image after applying the Gaussian Filter.

Finding the intensity gradient of the image and Non-maximum suppression.

To find the intensity gradient of the image.

gphi

and to apply the Non-maximum suppression we need to determine if the gradient magnitude (G) is a local maximum :

  • if Θ = 0º (N or S) is border if gradient  G >  Ge  and Gw.
  • if Θ = 90º (E or W) is border if gradient G > Gn and Gs.
  • if Θ = 135º (NE or SW) is border if gradient G > Gnw and Gse.
  • if Θ = 45º (NW or SW) is border if gradient G > Gne and Gsw.

For this step I’ll be needing NR  


for(i=2; i<height-2; i++)
		for(j=2; j<width-2; j++)
		{
			// Intensity gradient of the image
			Gx[i*width+j] =
				 (1.0*NR[(i-2)*width+(j-2)] +  2.0*NR[(i-2)*width+(j-1)] +  (-2.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)]
				+ 4.0*NR[(i-1)*width+(j-2)] +  8.0*NR[(i-1)*width+(j-1)] +  (-8.0)*NR[(i-1)*width+(j+1)] + (-4.0)*NR[(i-1)*width+(j+2)]
				+ 6.0*NR[(i  )*width+(j-2)] + 12.0*NR[(i  )*width+(j-1)] + (-12.0)*NR[(i  )*width+(j+1)] + (-6.0)*NR[(i  )*width+(j+2)]
				+ 4.0*NR[(i+1)*width+(j-2)] +  8.0*NR[(i+1)*width+(j-1)] +  (-8.0)*NR[(i+1)*width+(j+1)] + (-4.0)*NR[(i+1)*width+(j+2)]
				+ 1.0*NR[(i+2)*width+(j-2)] +  2.0*NR[(i+2)*width+(j-1)] +  (-2.0)*NR[(i+2)*width+(j+1)] + (-1.0)*NR[(i+2)*width+(j+2)]);

			Gy[i*width+j] =
				 ((-1.0)*NR[(i-2)*width+(j-2)] + (-4.0)*NR[(i-2)*width+(j-1)] +  (-6.0)*NR[(i-2)*width+(j)] + (-4.0)*NR[(i-2)*width+(j+1)] + (-1.0)*NR[(i-2)*width+(j+2)]
				+ (-2.0)*NR[(i-1)*width+(j-2)] + (-8.0)*NR[(i-1)*width+(j-1)] + (-12.0)*NR[(i-1)*width+(j)] + (-8.0)*NR[(i-1)*width+(j+1)] + (-2.0)*NR[(i-1)*width+(j+2)]
				+    2.0*NR[(i+1)*width+(j-2)] +    8.0*NR[(i+1)*width+(j-1)] +    12.0*NR[(i+1)*width+(j)] +    8.0*NR[(i+1)*width+(j+1)] +    2.0*NR[(i+1)*width+(j+2)]
				+    1.0*NR[(i+2)*width+(j-2)] +    4.0*NR[(i+2)*width+(j-1)] +     6.0*NR[(i+2)*width+(j)] +    4.0*NR[(i+2)*width+(j+1)] +    1.0*NR[(i+2)*width+(j+2)]);

			G[i*width+j]   = sqrtf((Gx[i*width+j]*Gx[i*width+j])+(Gy[i*width+j]*Gy[i*width+j]));	//G = √Gx²+Gy²
			phi[i*width+j] = atan2f(fabs(Gy[i*width+j]),fabs(Gx[i*width+j]));

			if(fabs(phi[i*width+j])<=PI/8 )
				phi[i*width+j] = 0;
			else if (fabs(phi[i*width+j])<= 3*(PI/8))
				phi[i*width+j] = 45;
			else if (fabs(phi[i*width+j]) <= 5*(PI/8))
				phi[i*width+j] = 90;
			else if (fabs(phi[i*width+j]) <= 7*(PI/8))
				phi[i*width+j] = 135;
			else phi[i*width+j] = 0;
	}

Double threshold and Edge tracking by hysteresis.

 

// Edge
	for(i=3; i<height-3; i++)
		for(j=3; j<width-3; j++) 		{ 			pedge[i*width+j] = 0; 			if(phi[i*width+j] == 0){ 				if(G[i*width+j]>G[i*width+j+1] && G[i*width+j]>G[i*width+j-1]) //edge is in N-S
					pedge[i*width+j] = 1;

			} else if(phi[i*width+j] == 45) {
				if(G[i*width+j]>G[(i+1)*width+j+1] && G[i*width+j]>G[(i-1)*width+j-1]) // edge is in NW-SE
					pedge[i*width+j] = 1;

			} else if(phi[i*width+j] == 90) {
				if(G[i*width+j]>G[(i+1)*width+j] && G[i*width+j]>G[(i-1)*width+j]) //edge is in E-W
					pedge[i*width+j] = 1;

			} else if(phi[i*width+j] == 135) {
				if(G[i*width+j]>G[(i+1)*width+j-1] && G[i*width+j]>G[(i-1)*width+j+1]) // edge is in NE-SW
					pedge[i*width+j] = 1;
			}
		}

	// Hysteresis Thresholding
	lowthres = level/2;
	hithres  = 2*(level);

	for(i=3; i<height-3; i++)
		for(j=3; j<width-3; j++) 		{ 			if(G[i*width+j]>hithres && pedge[i*width+j])
				image_out[i*width+j] = 255;
			else if(pedge[i*width+j] && G[i*width+j]>=lowthres && G[i*width+j]<hithres)
				// check neighbours 3x3
				for (ii=-1;ii<=1; ii++)
					for (jj=-1;jj<=1; jj++) 						if (G[(i+ii)*width+j+jj]>hithres)
							image_out[i*width+j] = 255;
		}

Entire code implementing the CPU approach can be found on my git repository.

git clone https://adri1988@bitbucket.org/adri1988/cannyedgedetectorcpu.git

The canny edge detector GPU 1D approach.

We already know all the theory so I’m gonna jump right in to the kernels.

I’ve coded four different kernels, one for each step of the algorithm.

Noise reduction

 

__global__ void noiseReduction1d(float *d_NR, float *d_im, int height, int width)
{

	int j;
	j = blockIdx.x * blockDim.x + threadIdx.x;

	if (j>=2 && j < width-2)
		for (int i=2;i<height-2;i++){

					d_NR[i*width+j] =  	(2.0*d_im[(i-2)*width+(j-2)] +  4.0*d_im[(i-2)*width+(j-1)] +  5.0*d_im[(i-2)*width+(j)] +  4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)]
									+ 4.0*d_im[(i-1)*width+(j-2)] +  9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] +  9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)]
									+ 5.0*d_im[(i  )*width+(j-2)] + 12.0*d_im[(i  )*width+(j-1)] + 15.0*d_im[(i  )*width+(j)] + 12.0*d_im[(i  )*width+(j+1)] + 5.0*d_im[(i  )*width+(j+2)]
									+ 4.0*d_im[(i+1)*width+(j-2)] +  9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] +  9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)]
									+ 2.0*d_im[(i+2)*width+(j-2)] +  4.0*d_im[(i+2)*width+(j-1)] +  5.0*d_im[(i+2)*width+(j)] +  4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)])
									/159.0;
	}

}

Finding the intensity gradient of the image and Non-maximum suppression

 

__global__ void intensityGradienteIm1d(float *d_G, float *d_Gx, float *d_Gy ,float *d_phi ,float *d_NR ,int height, int width)
{
	int j;
	float PI = 3.141593;
	j = blockIdx.x * blockDim.x + threadIdx.x;

	if (j>=2 && j < width-2)
		for (int i=2;i<height-2;i++){
				// Intensity gradient of the image
			d_Gx[i*width+j] =
					 (1.0*d_NR[(i-2)*width+(j-2)] +  2.0*d_NR[(i-2)*width+(j-1)] +  (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
					+ 4.0*d_NR[(i-1)*width+(j-2)] +  8.0*d_NR[(i-1)*width+(j-1)] +  (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)]
					+ 6.0*d_NR[(i  )*width+(j-2)] + 12.0*d_NR[(i  )*width+(j-1)] + (-12.0)*d_NR[(i  )*width+(j+1)] + (-6.0)*d_NR[(i  )*width+(j+2)]
					+ 4.0*d_NR[(i+1)*width+(j-2)] +  8.0*d_NR[(i+1)*width+(j-1)] +  (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)]
					+ 1.0*d_NR[(i+2)*width+(j-2)] +  2.0*d_NR[(i+2)*width+(j-1)] +  (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]);

			d_Gy[i*width+j] =
					 ((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] +  (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
					+ (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)]
					+    2.0*d_NR[(i+1)*width+(j-2)] +    8.0*d_NR[(i+1)*width+(j-1)] +    12.0*d_NR[(i+1)*width+(j)] +    8.0*d_NR[(i+1)*width+(j+1)] +    2.0*d_NR[(i+1)*width+(j+2)]
					+    1.0*d_NR[(i+2)*width+(j-2)] +    4.0*d_NR[(i+2)*width+(j-1)] +     6.0*d_NR[(i+2)*width+(j)] +    4.0*d_NR[(i+2)*width+(j+1)] +    1.0*d_NR[(i+2)*width+(j+2)]);

			 __syncthreads();
			d_G[i*width+j]   = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j]));	//G = √Gx²+Gy²
			d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j]));
			 __syncthreads();
			 	if(fabs(d_phi[i*width+j])<=PI/8 )
					d_phi[i*width+j] = 0;
				else if (fabs(d_phi[i*width+j])<= 3*(PI/8))
					d_phi[i*width+j] = 45;
				else if (fabs(d_phi[i*width+j]) <= 5*(PI/8))
					d_phi[i*width+j] = 90;
				else if (fabs(d_phi[i*width+j]) <= 7*(PI/8))
					d_phi[i*width+j] = 135;
				else d_phi[i*width+j] = 0;
		}

}

Double threshold

 

__global__ void edges1d(int *d_pedge, float *d_G,float *d_phi, int height, int width)
{
	int j;
	j = blockIdx.x * blockDim.x + threadIdx.x;

	// Edge
	if (j>=2 && j < width-2)
		for (int i=2;i<height-2;i++){ 				d_pedge[i*width+j] = 0; 				if(d_phi[i*width+j] == 0){ 					if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S
						d_pedge[i*width+j] = 1;

				} else if(d_phi[i*width+j] == 45) {
					if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE
						d_pedge[i*width+j] = 1;

				} else if(d_phi[i*width+j] == 90) {
					if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W
						d_pedge[i*width+j] = 1;

				} else if(d_phi[i*width+j] == 135) {
					if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW
						d_pedge[i*width+j] = 1;
				}
			}

}

Edge tracking by hysteresis

 

__global__ void hysteresisThresholding1d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width)
{

	int j;
	j = blockIdx.x * blockDim.x + threadIdx.x;

	float lowthres,hithres;

// Hysteresis Thresholding
	lowthres = level/2;
	hithres  = 2*(level);
	if (j<width)
		for (int i=0;i<height;i++) 			d_image_out[i*width+j]=0.0; //Inicializamos a 0 	 __syncthreads(); 	 if (j>=3 && j < width-3)
		 for (int i=3;i<height-3;i++){ 			if(d_G[i*width+j]>hithres && d_pedge[i*width+j])
				d_image_out[i*width+j] = 255;
			else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres)
				// check neighbours 3x3
				for (int ii=-1;ii<=1; ii++)
					for (int jj=-1;jj<=1; jj++) 						if (d_G[(i+ii)*width+j+jj]>hithres)
							d_image_out[i*width+j] = 255;
		}
}

so basically all I did was remove the inner loop and let the parallel magic do the work.

you can clone my entire project.

git clone https://adri1988@bitbucket.org/adri1988/cannyedgedetectorcpuandgpu1d.git

executing this code on an image (1920×1200)  .

cpuvsgpu1d

speed Up of 5.01, pretty cool.

The canny edge detector GPU 2D approach.

the same as the 1D approach, four kernels but now I paralleled both loops.

Noise reduction

 

__global__ void noiseReduction2d(float *d_NR, float *d_im, int height, int width)
{

	int j,i;

	i = blockIdx.x * blockDim.x + threadIdx.x;
	j = blockIdx.y * blockDim.y + threadIdx.y;

	if (j>=2 && j<width-2 && i>=2 && i < height-2){
		d_NR[i*width+j] =    (2.0*d_im[(i-2)*width+(j-2)] +  4.0*d_im[(i-2)*width+(j-1)] +  5.0*d_im[(i-2)*width+(j)] +  4.0*d_im[(i-2)*width+(j+1)] + 2.0*d_im[(i-2)*width+(j+2)]
							+ 4.0*d_im[(i-1)*width+(j-2)] +  9.0*d_im[(i-1)*width+(j-1)] + 12.0*d_im[(i-1)*width+(j)] +  9.0*d_im[(i-1)*width+(j+1)] + 4.0*d_im[(i-1)*width+(j+2)]
							+ 5.0*d_im[(i  )*width+(j-2)] + 12.0*d_im[(i  )*width+(j-1)] + 15.0*d_im[(i  )*width+(j)] + 12.0*d_im[(i  )*width+(j+1)] + 5.0*d_im[(i  )*width+(j+2)]
							+ 4.0*d_im[(i+1)*width+(j-2)] +  9.0*d_im[(i+1)*width+(j-1)] + 12.0*d_im[(i+1)*width+(j)] +  9.0*d_im[(i+1)*width+(j+1)] + 4.0*d_im[(i+1)*width+(j+2)]
							+ 2.0*d_im[(i+2)*width+(j-2)] +  4.0*d_im[(i+2)*width+(j-1)] +  5.0*d_im[(i+2)*width+(j)] +  4.0*d_im[(i+2)*width+(j+1)] + 2.0*d_im[(i+2)*width+(j+2)])
							/159.0;

	}

}
 

Finding the intensity gradient of the image and Non-maximum suppression

 

{
	int j,i;
	float PI = 3.141593;
	i = blockIdx.x * blockDim.x + threadIdx.x;
	j = blockIdx.y * blockDim.y + threadIdx.y;

	if (j>=2 && j<width-2 && i>=2 && i < height-2)
			{
				// Intensity gradient of the image
			d_Gx[i*width+j] =
					 (1.0*d_NR[(i-2)*width+(j-2)] +  2.0*d_NR[(i-2)*width+(j-1)] +  (-2.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
					+ 4.0*d_NR[(i-1)*width+(j-2)] +  8.0*d_NR[(i-1)*width+(j-1)] +  (-8.0)*d_NR[(i-1)*width+(j+1)] + (-4.0)*d_NR[(i-1)*width+(j+2)]
					+ 6.0*d_NR[(i  )*width+(j-2)] + 12.0*d_NR[(i  )*width+(j-1)] + (-12.0)*d_NR[(i  )*width+(j+1)] + (-6.0)*d_NR[(i  )*width+(j+2)]
					+ 4.0*d_NR[(i+1)*width+(j-2)] +  8.0*d_NR[(i+1)*width+(j-1)] +  (-8.0)*d_NR[(i+1)*width+(j+1)] + (-4.0)*d_NR[(i+1)*width+(j+2)]
					+ 1.0*d_NR[(i+2)*width+(j-2)] +  2.0*d_NR[(i+2)*width+(j-1)] +  (-2.0)*d_NR[(i+2)*width+(j+1)] + (-1.0)*d_NR[(i+2)*width+(j+2)]);

			d_Gy[i*width+j] =
					 ((-1.0)*d_NR[(i-2)*width+(j-2)] + (-4.0)*d_NR[(i-2)*width+(j-1)] +  (-6.0)*d_NR[(i-2)*width+(j)] + (-4.0)*d_NR[(i-2)*width+(j+1)] + (-1.0)*d_NR[(i-2)*width+(j+2)]
					+ (-2.0)*d_NR[(i-1)*width+(j-2)] + (-8.0)*d_NR[(i-1)*width+(j-1)] + (-12.0)*d_NR[(i-1)*width+(j)] + (-8.0)*d_NR[(i-1)*width+(j+1)] + (-2.0)*d_NR[(i-1)*width+(j+2)]
					+    2.0*d_NR[(i+1)*width+(j-2)] +    8.0*d_NR[(i+1)*width+(j-1)] +    12.0*d_NR[(i+1)*width+(j)] +    8.0*d_NR[(i+1)*width+(j+1)] +    2.0*d_NR[(i+1)*width+(j+2)]
					+    1.0*d_NR[(i+2)*width+(j-2)] +    4.0*d_NR[(i+2)*width+(j-1)] +     6.0*d_NR[(i+2)*width+(j)] +    4.0*d_NR[(i+2)*width+(j+1)] +    1.0*d_NR[(i+2)*width+(j+2)]);

			d_G[i*width+j]   = sqrtf((d_Gx[i*width+j]*d_Gx[i*width+j])+(d_Gy[i*width+j]*d_Gy[i*width+j]));	//G = √Gx²+Gy²
			d_phi[i*width+j] = atan2f(fabs(d_Gy[i*width+j]),fabs(d_Gx[i*width+j]));

			 	if(fabs(d_phi[i*width+j])<=PI/8 )
					d_phi[i*width+j] = 0;
				else if (fabs(d_phi[i*width+j])<= 3*(PI/8))
					d_phi[i*width+j] = 45;
				else if (fabs(d_phi[i*width+j]) <= 5*(PI/8))
					d_phi[i*width+j] = 90;
				else if (fabs(d_phi[i*width+j]) <= 7*(PI/8))
					d_phi[i*width+j] = 135;
				else d_phi[i*width+j] = 0;
		}

}
 

Double threshold

 

__global__ void edges2d(int *d_pedge, float *d_G,float *d_phi, int height, int width)
{
	int j,i;
	i = blockIdx.x * blockDim.x + threadIdx.x;
	j = blockIdx.y * blockDim.x + threadIdx.y;

	// Edge
	if (j>=2 && j<width-2 && i>=2 && i < height-2){ 				d_pedge[i*width+j] = 0; 				if(d_phi[i*width+j] == 0){ 					if(d_G[i*width+j]>d_G[i*width+j+1] && d_G[i*width+j]>d_G[i*width+j-1]) //edge is in N-S
						d_pedge[i*width+j] = 1;

				} else if(d_phi[i*width+j] == 45) {
					if(d_G[i*width+j]>d_G[(i+1)*width+j+1] && d_G[i*width+j]>d_G[(i-1)*width+j-1]) // edge is in NW-SE
						d_pedge[i*width+j] = 1;

				} else if(d_phi[i*width+j] == 90) {
					if(d_G[i*width+j]>d_G[(i+1)*width+j] && d_G[i*width+j]>d_G[(i-1)*width+j]) //edge is in E-W
						d_pedge[i*width+j] = 1;

				} else if(d_phi[i*width+j] == 135) {
					if(d_G[i*width+j]>d_G[(i+1)*width+j-1] && d_G[i*width+j]>d_G[(i-1)*width+j+1]) // edge is in NE-SW
						d_pedge[i*width+j] = 1;
				}
			}

}
 

Edge tracking by hysteresis

 

__global__ void hysteresisThresholding2d(float *d_image_out, int level ,int* d_pedge,float *d_G, int height, int width)
{

	int j,i;
	i = blockIdx.x * blockDim.x + threadIdx.x;
	j = blockIdx.y * blockDim.x + threadIdx.y;

	float lowthres,hithres;

// Hysteresis Thresholding
	lowthres = level/2;
	hithres  = 2*(level);
	if (j<width && i<height) 			d_image_out[i*width+j]=0.0; //Inicializamos a 0 	 __syncthreads(); 	if (j>=3 && j<width-3 && i>=3 && i<height-3) 		{ 			if(d_G[i*width+j]>hithres && d_pedge[i*width+j])
				d_image_out[i*width+j] = 255;
			else if(d_pedge[i*width+j] && d_G[i*width+j]>=lowthres && d_G[i*width+j]<hithres)
				// check neighbours 3x3
				for (int ii=-1;ii<=1; ii++)
					for (int jj=-1;jj<=1; jj++) 						if (d_G[(i+ii)*width+j+jj]>hithres)
							d_image_out[i*width+j] = 255;
		}
}
 

Now with both loops executing on parallel things gets interesting.

you can clone the 2D approach.

git clone https://adri1988@bitbucket.org/adri1988/cannyedgedetectotgpu2d.git

executing this code on an image (1920×1200) .

cpuvsgpu1d2d

Speed Up is even bigger now, 9.4X times faster than the CPU and 2X times faster than 1D version.

 

Advertisements

Reduction on CUDA

On this post I’m going to solve a common problem on CUDA, and that is a Reduction problem.

But using the algorithms posted by Mark Harris  “Optimizing Parallel Reduction in CUDA

We know on GPU is not very effective perform a reduction operation, and trying to make the most optimal code can be very difficult and not very intuitive.

The post by  Mark Harris  only gives the algorithm but is not quite clear on how to use it, I’m going to do that.

Let us solve the following integral

integral

this is not a calculus class, but in the old days, this was solved liked this

rectangulos

the sum of all the areas of the rectangles gives the approximate solution to the integral and is not other than PI . And yes you are indeed right, this is a reduction problem.

So… how are we going to tackle this one?

first we need to calculate all the rectangle areas. Since one rectangle does not depend on any other rectangle, is perfect for our GPU to solve this one.

__global__ void calcRectangles(float * rectangles,int n){
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        float x;

        if (i >= 0 && i < n){
                x = (i + 0.5) / n;
                rectangles[i] = 4.0 / (1.0 + x*x);
        }
}

where in the input

float * rectangles is the array to store all the areas.

int n the precision, bigger the n the more accurate the solution will be.

this kernel will give us an array of float values, this result will be out input to the reduce kernel.

Now let us prepare for the real deal

for each reduction I can only use 32 threads, this is because of my graphics card limitation, son in order to make large reduction I needed to use recursive kernel calls.

For example, to reduce an array of 2^21 elements

reduction-process
one iteration of the loop calling all the kernels

in the image we can see how an array of 2^21 elements is reduced to an array of 2^16 elements.

Now the same process with this new array, resulting in a 2^11 array, until it has 31 elements or less.

float calcReductionGeneral(int n, float * rectangles,int total , int type){

	clock_t start; 	//vars to measure time
	clock_t end;
	float secondsTotal = 0;

	if (n<32){ //If the lenght of the array is less than 32, we make the reduction in the host

		start = clock();

		float * resultadoGlobaL = (float*)malloc(sizeof(float)*n);
		float pi;
		for (int i=0;i<n;i++)
				pi = pi + rectangles[i];
		pi = pi / total;
		end = clock();
		secondsTotal = ((float)(end - start) / CLOCKS_PER_SEC);
		printf ("pi = %f " ,pi);
		free(resultadoGlobaL);

		return secondsTotal;

	}
	else{ //reduction on GPU

		float * resParcial = (float*)malloc(sizeof(float)*n/32); //partial solution, array of 32 floats

		for (int i=0;i<n/32;i++){  

		 /*i.e.   if we have 64 rectangles we are going to make 2 kernel calls
	  			  first call with 0..31 elemets
	 			  second call with 32..63 elements*/ 

			float * inputParcial = (float*)malloc(sizeof(float)*32); //partial input to the kernel allocation
			float * input_d;
			gpuErrchk(cudaMalloc((void**)&input_d, 32*sizeof(float))); //GPU allocation
			copia(inputParcial,rectangles,i);//copy the relative 32 elements of the rectangles to the partial input

			/*  i.e  if we have 64 elements in the first loop iteration we copy the 0..31 elements and in the second iteration from 32..63 elements
			*/
			cudaMemcpy(input_d, inputParcial, sizeof(float)*32, cudaMemcpyHostToDevice);
			//We move the parcial input to the GPU

			float * output;
			gpuErrchk(cudaMalloc((void**)&output, 32*sizeof(float)));	//allocating memory on GPU
			float * resParcial132 = (float*)malloc(sizeof(float)*32);   //allocating memory for parcial result

			start = clock();//we measure time
			switch (type){  //calling the kernel, using the type parameter
				case 0: reduce0 <<<1,32,32 >>>(input_d, output);	break;
				case 1: reduce1 <<<1,32,32 >>>(input_d, output);	break;
				case 2: reduce2 <<<1,32,32 >>>(input_d, output);	break;
				default: break;
			}
			end = clock();
			secondsTotal = secondsTotal + ((float)(end - start) / CLOCKS_PER_SEC);

			gpuErrchk (cudaThreadSynchronize());
			gpuErrchk (cudaMemcpy(resParcial132,  output, sizeof(float)*32, cudaMemcpyDeviceToHost)); // we move the result to the host
			//printf ("resParcial132[0] = %f" ,resParcial132[0]);
			gpuErrchk (cudaThreadSynchronize());
			resParcial[i]= resParcial132[0];

			//clean up
			free(inputParcial);
			free(resParcial132);
			cudaFree(input_d);
			cudaFree(output);	

		}

		return calcReductionGeneral((int)n/32,resParcial,total,type) + secondsTotal;  //recursive call.
	}

}
__global__ void reduce0(float *g_idata, float *g_odata) {
	extern __shared__ float sdata[];
	// each thread loads one element from global to shared mem
	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
	sdata[tid] = g_idata[i];
    __syncthreads();

	// do reduction in shared mem

	for
		(unsigned int s = 1; s < blockDim.x; s *= 2) {
		if (tid % (2 * s) == 0) {
			sdata[tid] += sdata[tid + s];

		}
		__syncthreads();
	}

	// write result for this block to global mem
	if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
__global__ void reduce1(float *g_idata, float *g_odata) {
	extern __shared__ float sdata[];
	// each thread loads one element from global to shared mem
	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
	sdata[tid] = g_idata[i];
    __syncthreads();

	for (unsigned int s=1; s < blockDim.x; s *= 2) {
		int index = 2 * s * tid;
		if (index < blockDim.x) { 			sdata[index] += sdata[index + s]; 		} 		__syncthreads(); 	} 	 	// write result for this block to global mem 	if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } 
 __global__ void reduce2(float *g_idata, float *g_odata) { 	extern __shared__ float sdata[]; 	// each thread loads one element from global to shared mem 	unsigned int tid = threadIdx.x; 	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 	sdata[tid] = g_idata[i];       __syncthreads();         	for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
		if (tid < s) {
			sdata[tid] += sdata[tid + s];
		}
	__syncthreads();
	}

	// write result for this block to global mem
	if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

where reduce0, reduce1, and reduce2 are exactly the same algorithms used by Mark Harris.

Now, something to point out

pi = 3.141593 Time reduce0: 0.456775 seconds
pi = 3.141593 Time reduce1: 0.459540 seconds
pi = 3.141593 Time reduce2: 0.458745 seconds

the times of each algorithm for 4M elements is pretty much the same, where is this article “Optimizing Parallel Reduction in CUDA” the speedup from reduce0 to reduc1 is 2.33.

So… yeah.. I think they are selling a little bit of hot air in here, BUT, my implementation is very different from Nvidia’s, I try to use all the threads all the time and in their case they don’t, sticking to 128 threads because that is when the achieve peak performance and stuff like that.

The main structure of the program is the same, a tree based calling kernel function in a recursive manner and the results are very different still.

You can clone my entire code

git clone https://adri1988@bitbucket.org/adri1988/picudareduction.git

Multiplying two matrices in CUDA using BLAS…CUBLAS

On this post I’ll be making a simple program, multiplying two matrices using BLAS.

We already know the basics of CUDA, if not you can check me previous post here. To the matrices operation we are going to use the GEMM operation.

C = αop ( A ) op ( B ) + βC, Were α and β are scalar numbers α is going to be 1 and β is going to be 0.
In the CUDA toolkit documentation, the GEMM operation is declared as the following:

cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m,
int n,
int k,
const float *alpha,
const float *A,
int lda,
const float *B,
int ldb,
const float *beta,
float *C,
int ldc)

where

  • M is number of rows of matrix op(A) and C.
  • N is number of columns of matrix op(B) and C.
  • K is number of columns of op(A) and rows of op(B).
  • transa operation op(A) that is non- or (conj.) transpose.
  • transb operation op(B) that is non- or (conj.) transpose.
  • A array of dimensions lda x k with lda>=max(1,m) if transa == CUBLAS_OP_N and lda x m with lda>=max(1,k) otherwise.
  • B array of dimension ldb x n with ldb>=max(1,k) if transa == CUBLAS_OP_N and ldb x k with ldb>=max(1,n) otherwise.
  • Lda leading dimension of two-dimensional array used to store the matrix A.
  • ldb leading dimension of two-dimensional array used to store matrix B.

For more information about this, like return values and tons of other operations you can check the documentation here.

Now let’s us go to the code:
Before starting let me declare some auxiliary functions just to make my code clearer.

 


void init_matrix(float *M, int hM, int wM, float k)
{
int i,j;

for (i=0; i<hM; i++)
    for (j=0; j<wM; j++)
        if (i==j)
            M[i*wM+j] = k*k*1.0f;
        else
            M[i*wM+j] = -k*1.0f;
}

This first function I just initialize a matrix with hM and wM dimensions the k value is just a random for the matrix to be populated with

void print_matrix(float *M, int hM, int wM)
{
int i,j;

for (i=0; i<hM; i++){
     for (j=0; j<wM; j++)
          printf(&quot;%4.1f &quot;, M[i*wM+j]);
     printf(&quot;\n&quot;);
}
}

this functions prints the matrix on screen

For the multiplication, let us declare a function, which accept 6 parameters, 2 inputs matrices, 1 output matrix, height and width of the first matrix and the width of the second matrix.

void Mul(float* A, float* B, int hA, int wA, int wB, float* C){

//Now we have to allocate memory on GPU for the 3 matrices

float* Ad;
size = hA * wA * sizeof(float);
cudaMalloc((void**)&amp;Ad, size);
cudaMemcpy(Ad,A,size,cudaMemcpyHostToDevice);

float* Bd;
size = wA * wB * sizeof(float);
cudaMalloc((void**)&amp;Bd, size);
cudaMemcpy(Bd,B,size,cudaMemcpyHostToDevice);

// Allocate C on the device
float* Cd;
size = hA * wB * sizeof(float);
cudaMalloc((void**)&amp;Cd, size);

//declare the alfa and beta
const float alfa = 1;
const float beta = 0;
const float *alphac = &amp;alfa;
const float *betac = &amp;beta;

int lda=wA,ldb=hA,ldc=wA;

in order to use the CUBLAS methods first we need to create a handle and when we no loger need to use the CUBLAS method we should destroy it.

For the handle creation


cublasHandle_t handle;
cublasCreate(&amp;handle);

cublas gemm method


cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, wA, wB, hA, alphac, Ad,lda,Bd,ldb, betac, Cd, ldc);

Need to remember that CUBLAS method WORK ON GPU, that means, all input matrices must be on the device memory.

handle destruction

cublasDestroy(handle);

bring output to host memory

cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

final touches

cudaFree(Ad);
cudaFree(Bd);
cudaFree(Cd);

So the core of the program is finished, now we need a main method, initialise some matrices and test.


int main(int argc, char *argv[])
{
	// Matrix variables
	float *A, *B, *C;//Matrices declarations
	int hA, wA, hB, wB;//declaration of dimensions

	setbuf(stdout, NULL);

	if (argc!=4){
		printf(&quot;./exec hA hB/WA wB\n&quot;);
		exit(-1);
	}
        //transform the input to integers
	hA = atoi(argv[1]);
	hB = wA = atoi(argv[2]);
	wB = atoi(argv[3]);

	// Init A and B, malloc C
	int size_A = wA * hA;
	A = (float*)malloc(size_A*sizeof(float));
	init_matrix(A, hA, wA,2.0);

	int size_B = wB * hB;
	B = (float*)malloc(size_B*sizeof(float));
	init_matrix(B, hB, wB,1.0);

	int size_C = wB * hA;
	C = (float*)malloc(size_C*sizeof(float));

        //matrix multiplication
	Mul(A, B, hA, wA, wB, C);

        //Print on console the 3 matrices
	printf(&quot;\n\nMATRIX A\n&quot;);print_matrix(A, hA, wA);
	printf(&quot;\n\nMATRIX B\n&quot;);print_matrix(B, hB, wB);
	printf(&quot;\n\nMATRIX C\n&quot;);print_matrix(C, hA, wB);

        //free the allocated memory
        free(A)
	free(B)
	free(C)
	return (1);
}

to compile remember to add the flags -lcudart -lcublas.

An output example of 4×4 matrices multiplication is

cublas_mul_example

the full code can be found on my git repository, feel free to clone

git clone https://adri1988@bitbucket.org/adri1988/cublas_multiplication.git

 

Grids, threads… THE ULTIMATE GUIDE

 

So most users struggle a lot with this topic, on CUDA is important to make a good use of how many grids and threads we need to use so let’s start:

What are grids blocks and thread blocks?

In CUDA we have to define how our grid is going to be and how many threads are going to be inside a grid.

The kernels are executed in grids just like the following image:

Selección_015
N-1 grids each grid has 256 threads

 

 

Identify the dimension of the problem

There are multiple ways to do this, but let’s try to make it simple.

On CUDA or GPU programming in general we have from 1 to 3 dimensions, values different than those will give runtime errors.

So for example, if we have an array, our problem is a one-dimensional problem, a matrix is a two-dimensional problem and a cube/parallelepiped is a three-dimensional problem.

One dimensional problems

These are the easiest ones, we only need to work on the X axis, and basically our problem is reduced to the length of the array.

Let’s say we have an Array with length 1Million:

we have to decide how many grids blocks we want and how many threads per block we want.

First we have to remember, on most devices the maximum threads per block available is 512 or 1024, something bigger than this will give an error.

After deciding the threads per blocks (I’m going to use 1024) the grid is ceil(1M/1024).

Our declaration and call to the kernel will look something like this


dim3 gridBlocks(ceil(1000000/1024),1,1);

dim3 threadsPerBlock(1024,1,1);

kernel<<<gridBlocks, threadsPerBlock>>> (arg1, arg2...);

this seems pretty straight forward, the Y and Z axis is 1 since we are only working on the X axis.

 

Two dimensional problems

These might be the most common, we need to work on the X and Y axis.

Let us divide this into two examples,

A square and a rectangle, the first being an exclusive case of the second one, when both sides of the rectangle are the same length, is a square (Duh!?).

Square:

Let us assume we have a square image with A x A pixels, se we need to create a grid that minimizes the number of threads in order to use them the most efficient way

First we need to set the size of our grid, this is a problem specific matter but we can set some various examples.

For very small values of A (A >8 and A <64):

We can divide the square into 4 sub-squares, for example if A = 16 we have the following layout

16 x 16
2 x 2 grid with 8 x 8 threads per block

So our grid will be 2 x 2 and each grid block will have 8 x 8 threads

We can code that as follow:

dim3 gridBlocks(2,2,1);
dim3 threadsPerBlock(8,8,1);
kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);

 

For medium sizes (A>64 and A < 512):

We can divide the square into 16-sub squares

32 x32
4 x 4 grid with 8 x 8 threads per block

 

This is 32 x 32 I’m using a small grid just for illustration purposes

We can code this as follow

dim3 gridBlocks(4,4,1); // 4 x 4 grid
dim3 threadsPerBlock(8,8,1);// each grid will have 8 x 8 threads
kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);

This examples are just illustrations, is not by any means a rule, remember the number of grids and threads is a specific matter of the problem itself.

Rectangle

So let us assume we have an image that is shaped like a rectangle with A x B pixels and A != B

Assuming A and B big enough for illustration purposes, I will set two cases, A > B and B > A

A < B:

We get the following layout

16x32
4 x 2 grid, each block has 8 x 8 threads

This can be coded as follow

dim3 gridBlocks(4,2,1); // 4 x 2 grid
dim3 threadsPerBlock(8,8,1);// each grid will have 8 x 8 threads
kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);

 

A > B:

We get the following layout

32x16.
2 x 4 grid, each block has 8 x 8 threads

This can be coded as follow

dim3 gridBlocks(2,4,1); // 4 x 2 grid
dim3 threadsPerBlock(8,8,1);// each grid will have 8 x 8 threads
kernel<<<gridBlocks, threadsPerBlock>>>(arg1, arg2...);

 

 

 

 

 

My First CUDA Program

So for this post I wanted to show you guys the “Hello World” in CUDA, and no, we are not going to printf(“Hello World “) , we are going to add vectors!, yeah feeling the hype!?

so let’s start:

First, if you are a complete newbie I recommend my previous post First steps on CUDA, get a handle on how I work and then come back.

Since we are all good software developers, let’s define our problem and propose a simple solution in the programming language C.

Our problem in C

We want to do something like C = A+B where A,B,C are vectors, obviously these 3 have the same lenght.

So:

Allocate memory for A, B and C (for simplicity, the lenght of the array is an input parameter, and the type will be float)

        int length;
        float *vA,*vB,*vC;

        if (argc!=2){
	        printf("./exec n \n");
	        exit(-1);
	}
        length = atoi(argv[1]);
        //Allocating memory
        vA = (float*)malloc(sizeof(float)*length);
        vB = (float*)malloc(sizeof(float)*length);
        vC = (float*)malloc(sizeof(float)*length);

Initialize A, B

/* Initialize A and B, A[i] will have it's own index by 2 and B[i] it's own index by 3*/
	for (int i=0;i < length; i++ ){
		vA[i]=i*2;
		vB[i]=i*3;
	}

Make the computation C = A + B

        for (int i=0;i < length ; i++ )
	        vC[i] = vA[i]+ vB[i];

Show result on console

         for (int i=0;i < length; i++ )
	        printf ("C[%i] = %f \n",i,vC[i]);

Free memory

         free(vA);
         free(vB);
         free(vC);

Seems pretty straight forward right? it’s easy to see that the computation of C[i] and C[i+1]  are completely independent, making it perfect to our GPU to make.

 

Our problem in CUDA

Now the fun part, is pretty similar to the solution in but now we have to add the GPU things ( DUH!).

To make the code more easy to understand, all the host variables will have the prefix h_

and our device variables will have the prefix d_

We need to:

Allocate memory for h_A, h_B and h_C on our Host. These Variables are our vectors

        int length;
        float *h_A,*h_B,*h_C,*d_A,*d_B,*d_C;

        if (argc!=2){
	        printf("./exec n \n");
	        exit(-1);
        }
        n = atoi(argv[1]);

        //Allocating memory
        h_A = (float*)malloc(sizeof(float)*length);
        h_B = (float*)malloc(sizeof(float)*length);
        h_C = (float*)malloc(sizeof(float)*length);

Allocate memory for d_A,d_B and d_C on our Device. These variables are our Vector in the device

        cudaMalloc((void **)&d_A,sizeof(float)*length);
        cudaMalloc((void **)&d_B,sizeof(float)*length);
        cudaMalloc((void **)&d_C,sizeof(float)*length);

Initialize the host vectors h_A and h_B.

         for (int i=0;i < length;i++){
	        h_A[i]=i*2;
	        h_B[i]=i*3;
         }

Transfer the memory from  (h_A, h_B) to (d_A,d_B).

      /* CPU to GPU */
        cudaMemcpy(d_A, h_A, sizeof(float)*length, cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, h_B, sizeof(float)*length, cudaMemcpyHostToDevice);	

Initialize grids and threads. Since is a 1D problem we will only work on the X axis making Y =Z=1.

//Each block will have 512 threads and as many blocks as needed
       dim3 dimGrid( ceil(length/512) +1, 1, 1); //	length/512 blocks
       dim3 dimBlock(512, 1, 1);	//512 threads for each block

Implement and call the kernel.

__global__ void addVector(float *d_A, float *d_B, float *d_C,int length )
{
        int index = blockIdx.x * blockDim.x + threadIdx.x;
        if (index < length)//we need to make sure our threads are within bounds
                d_C[index]= d_A[index]+d_B[index];
}   

//call to the kernel in the main
addVector <<< dimGrid , dimBlock >>> (d_A, d_B,d_C,length);
cudaThreadSynchronize();/*let's make sure all the threads finish here to avoid race conditions.*/

Transfer memory from d_C to h_C.

         cudaMemcpy(h_C, d_C, sizeof(float)*length, cudaMemcpyDeviceToHost);

Show Result on Console.

         for (int i=0; i<length;i++)
                 printf ("C[%i] = %f \n",i,h_C[i]);

Free host and device memories.

        free(h_A);
        free(h_B);
        free(h_C);
        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);

also pretty straight forward.

Now… this code shows the result on console, but for extreme large vector  is not helpful at all .

Let’s workThatGPU! by measuring the time with very big vectors. And comparing our CUDA solution to the C solution.

First  vectors with size 4096

CPU Execution time 0.000072 ms.
GPU Execution time 0.000270 ms.

The CPU time is way smaller than the GPU, why? because the cudaMemcpy , actually transferring the memory back and forth takes more time than the actual computation

Vectors size 20000

CPU Execution time 0.000347 ms.
GPU Execution time 0.000471 ms.

very similar times but still the CPU is ahead.

Vectors size 1000000

CPU Execution time 0.015172 ms.
GPU Execution time 0.009702 ms.

Wow, nice! our GPU is 1.66 times faster than our CPU!

Vectors size 700000000!!

CPU Execution time 13.240476 ms.
GPU Execution time 0.035979 ms.

So for this size of vector, our GPU is 378,2 times faster than our CPU!! it makes sense that our GPU works better with very large amount of information but when it’s little, just don’t even bother.

 

 

 

First steps on CUDA

On this post I’m going to use the programming language “C” for our first steps in CUDA. Remember that “C++” is also compatible.

To start we have to know the basic structure of any CUDA program: is like a regular program but in this one we have to communicate with our Nvidia GPU, keeping in mind memory assignment, memory transactions and calling the kernel itself.

For example, we want to create an array of 10 float elements in “C”. It would be something like this:

mallocArrayFloats
array of 10 float elements in C

This memory is allocated in our HOST memory, is not visible to our GPU. Now lets  do the same procedure in a GPU.

cudaMalloc

This directive will let us allocate memory in our GPU global memory. On the contrary as before, it is only visible from our GPU, therefore we cannot use this in our HOST program.

On the Nvidia Documentation we can check how a cudaMalloc works.

cudaError_t cudaMalloc(void** devPtr, size_t size): applying this procedure we can allocate an array of 10 float elements in our GPU memory.

cudaMallocArray10Elements
10 float elements allocated in the GPU global memory

For error management we can use the returned value cudaError_t. For more information click here.

Now, lets say we populate our array in our HOST and we want to move that array to our GPU for massive parallel computations. How can we do that?

cudaMemcpy

This directive will let us transfer memory from our host to our GPU and vice-versa.

On the Nvidia Documentation we can check how a cudaMemcpy works.

cudaError_t cudaMemcpy (void* dst, const void* src, size_t count, cudaMemcpyKind kind), where:

void* dst is our destination.

const void* src is our source.

size_t count is the total amount of memory to transfer.

cudaMemcpyKind kind is the kind of copy we want to make, it can be from our host to our device or from our device to our host. For a better explanation click here.

For our example it will go like this:

allocateAndPopulateHostArray
Allocate host memory and populate array h_array
allocateGPUmemoryAndCpyFromHostToDevice
Allocate device memory d_array and then use cudaMemcpy to copy the memory from h_array (HOST) to d_array (DEVICE) using cudaMemcpyHostToDevice

Now we have our memory in our device ready.

So let’s recap a little. We know how to allocate and transfer memory, so now it’s time for the fun part, the kernel.

Basically the kernel is the procedure that is going to be executed in our GPU.

Kernel, grids, blocks and threads

So lets say we want our device to add a number ‘p’ to each element in the array.

arrayNElementos+P
We add each element of the array by “p”

This seems like a problem that our GPU can easily take care of. For example, “N” threads and each thread can make 1 addition, so instead of “N” iterations in a loop, it would only be 1 instruction.

How to create a kernel procedure

The kernel procedure must be declared with the directive __global__. This indicates that is a kernel and it’s going to be executed in the GPU.

Our kernel will look something like this:

__global__ void addPtoArray(float* d_array,int lenght,int p)

{

Implementation of the kernel here:

}

And to invoke our kernel we have to use the triple chevron, define grids and threads and finally the name of the kernel.

Grids, blocks and threads

It’s better explained with an image:

gridsBlocksThreads
2d grid with 16 blocks and each block is 2d and has 4 threads

In order to call a kernel, we have to specify how many threads are going to be executed. We can manage the threads in a grid like the previous image.

The grid and blocks are specified using an structure called dim3. Each dim3 structure has 3 values (X,Y and Z), so we can create a 1d,2d or 3d grid.

Inside the kernel we have information about the grid and thread that is being executed.

blockIdx is the index of the block. In the image example, the block highlighted in green is blockIdx.x=2 and blockIdx.y=1

blockDim is the dimension of the grid, in the image example is blockDim.x=4 and blockDim.y=4

threadIdx is the index of the thread, in the image example is threadIdx.x=1 and threadIdx.y=1

Some examples about grids, blocks and threads

gridsAndThreadsExample

Going back to our example the kernel will look something like this:

kernelExample
Kernel implementation

Now we are set to work that GPU! 

Our code will look something like this:

#include<stdio.h>;
#include <stdlib.h>;
#include <sys/time.h>;
//CUDA
#include <cuda.h>;

//kernel implementation
__global__ void addPtoArray(float *d_array, int lenght,int p){
 int i;
 i = blockIdx.x * blockDim.x + threadIdx.x; // We calculate the current thread of our context
 if (i<lenght)//if the thread is inside the lenght of our array
     d_array[i]+=p;
}

int main(int argc, char *argv[])
{
 float *h_array, *d_array,*h_hostResult;
 int p = 20;
 int lenght=5;

 // Mallocs CPU
 h_array = (float *)malloc(sizeof(float)*lenght);
 for (int i=0; i<lenght; i++){ h_array[i] = i;}    /* Mallocs GPU */    cudaMalloc((void **) &amp;amp;amp;amp;amp;amp;amp;amp;amp;d_array,sizeof(float)*lenght);   /* CPU-&amp;amp;amp;amp;amp;amp;amp;amp;gt;GPU */
 cudaMemcpy(d_array, h_array, sizeof(float)*lenght, cudaMemcpyHostToDevice); 

 /*****************/
 /* Add Array GPU*/
 /*****************/
 dim3 GridBlocks(256);
 dim3 threadsPerBlock(256);
 addPtoArray<<<GridBlocks, threadsPerBlock>>>(d_array, lenght ,p );

 /* GPU->CPU */
 h_hostResult = (float *)malloc(sizeof(float)*lenght);
 cudaMemcpy(h_hostResult,d_array,sizeof(float)*lenght,cudaMemcpyDeviceToHost);

 /************/
 /* Results */
 /************/
 for (int i=0; i<lenght; i++)
 {
 printf("h_array[%i] = %f \n",i, h_array[i]);
 printf("h_hostResult[%i] = %f \n",i, h_hostResult[i]);

 } 

 /* Free CPU */
 free(h_hostResult);
 free(h_array);
 /* Free GPU */
 cudaFree(d_array);
 return(CUDA_SUCCESS);
}

 

Save this code on a file, let’s say addPtoArray.cu

Now to compile we do nvcc addPtoArray.cu -o addPtoArray

This will generate a file called addPtoArray

Now we just execute ./addPtoArray. It should look like this:

compileAndExecuteProgram
Compile and execute program
programResult
Result of the program. We add 20 to each element of the array