Getting started with OpenCL and GPU Computing
OpenCL (Open Computing Language) is a new framework for writing programs that execute in parallel on different compute devices (such as CPUs and GPUs) from different vendors (AMD, Intel, ATI, Nvidia etc.). The framework defines a language to write “kernels” in. These kernels are the functions which are to run on the different compute devices. In this post I explain how to get started with OpenCL and how to make a small OpenCL program that will compute the sum of two lists in parallel.
Installing and setting up OpenCL on your computer
First of all you need to download the newest drivers to your graphics card. This is important because OpenCL will not work if you don’t have drivers that support OpenCL.
To install OpenCL you need to download an implementation of OpenCL. The major graphic vendors Nvidia and AMD/ATI have both released implementations of OpenCL for their GPUs. These implementation come in a so called software development kits and often include some useful tools such as a visual profiler. The next step is to download and install the SDK for the GPU you have on your computer. Note that not all graphic cards are supported. A list of which graphic cards are supported can be found on the vendors websites.
For AMD/ATI GPUs download the AMD APP SDK (formerly known as AMD Stream SDK)
For Nvidia GPUs download the CUDA Toolkit
The installation steps differ for each SDK and the OS you are running. Follow the installation manual of the SDK carefully. Personally I use Ubuntu Linux and have an AMD/ATI graphics card. Below are some installation steps for this specific setup.
Installing OpenCL on Ubuntu Linux with ATI/AMD graphics card
Download and extract the AMD APP SDK.
AMD APP SDK 2.5 includes an installer. Run this with the command: (WARNING: this installer will forcefully restart your computer! n00bish by AMD imo)
sudo sh Install-AMD-APP.sh
Set the LD_LIBRARY_PATH to the lib folder in the AMD APP folder:
export LD_LIBRARY_PATH=/path-to-the-amd-app-sdk-folder/lib/x86_64/
The default path for AMD APP SDK 2.5 is /opt/AMDAPP/
If you are running a 32bit version of your OS you need to set the path to x86 instead of x86_64. If you get an error message saying something like “error while loading shared libraries: libOpenCL.so: cannot open shared object file: No such file or directory” you have forgotten to set the LD_LIBRARY_PATH above.
Next you need to register the ICD (note that in AMD APP 2.5, the installer does this for you). To do this open the icd-registration.tgz from the downloaded SDK and decompress it in the root folder so that you will you have a folder at /etc/OpenCL/vendors containing two .icd files.
If you get the error “Error: clGetPlatformIDs failed. Error code : unknown error code” when trying to run the samples in the SDK you haven’t registered the ICD.
I can also recommend using the debian packages created by a forum user on AMD Developer forums
Installing OpenCL on Ubuntu Linux with NVIDIA graphics card
Download the CUDA toolkit for Ubuntu from NVIDIAs CUDA site. Open a terminal an run the installation file with the command:
sudo sh cudatoolkit_3.1_linux_64_ubuntu9.10.run
Download the Developer Drivers for Linux at the same website and install it by first stopping X, running the file and start X again. Do stop X use:
sudo /etc/init.d/gdm stop
Then get a terminal up by pressing CTRL+ALT+F5, login and navigate to where you downloaded the devdriver then type:
sudo sh devdriver_3.1_linux_64_256.40.run
After the driver has been installed start x again by typing
startx
Before compiling an OpenCL application you need to add the path to the lib folder of CUDA to LD_LIBRARY_PATH like so:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64
Your first OpenCL program – Vector addition
To demonstrate OpenCL I explain how to perform the simple task of vector addition. Suppose we have two lists of numbers, A and B, of equal size. The task of vector addition is to add the elements of A with the elements of B and put the result in the element of a new list called C of the same size. The figure below explains the operation.

Two lists A and B and the result list C of vector addition on A and B
The naive way of performing this operation is to simply loop through the list and perform the operation on one element at a time like the C++ code below:
for(int i = 0; i < LIST_SIZE; i++) { C[i] = A[i] + B[i]; }
This algorithm is simple but has a linear time complexity, O(n) where n is the size of the list. But since each iteration of this loop is independent on the other iterations this operation is data parallel, meaning that each iteration can be computed simultaneously. So if we have n cores on a processor this operation can be performed in constant time O(1).
To make OpenCL perform this operation in parallel we need to make the kernel. The kernel is the function which will run on the compute device.
The kernel
The kernel is written in the OpenCL language which is a subset of C and has a lot of math and vector functions included. The kernel to perform the vector addition operation is defined below.
__kernel void vector_add(__global const int *A, __global const int *B, __global int *C) { // Get the index of the current element to be processed int i = get_global_id(0); // Do the operation C[i] = A[i] + B[i]; }
The host program
The host program controls the execution of kernels on the compute devices. The host program is written in C, but bindings for other languages like C++ and Python exists. The OpenCL API is defined in the cl.h (or opencl.h for apple) header file. Below is the code for the host program that executes the kernel above on compute device. I will not go into details on each step as this is supposed to be an introductory article although I can recommend the book “The OpenCL Programming Book” if you want to dive into the details. The main steps of a host program is as follows:
- Get information about the platform and the devices available on the computer (line 42)
- Select devices to use in execution (line 43)
- Create an OpenCL context (line 47)
- Create a command queue (line 50)
- Create memory buffer objects(line 53-58)
- Transfer data (list A and B) to memory buffers on the device (line 61-64)
- Create program object (line 67)
- Load the kernel source code (line 24-35) and compile it (line 71) (online exeuction) or load the precompiled binary OpenCL program (offline execution)
- Create kernel object (line 74)
- Set kernel arguments (line 77-79)
- Execute the kernel (line 84)
- Read memory objects. In this case we read the list C from the compute device (line 88-90)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | #include <stdio.h> #include <stdlib.h> #ifdef __APPLE__ #include <OpenCL/opencl.h> #else #include <CL/cl.h> #endif #define MAX_SOURCE_SIZE (0x100000) int main(void) { // Create the two input vectors int i; const int LIST_SIZE = 1024; int *A = (int*)malloc(sizeof(int)*LIST_SIZE); int *B = (int*)malloc(sizeof(int)*LIST_SIZE); for(i = 0; i < LIST_SIZE; i++) { A[i] = i; B[i] = LIST_SIZE - i; } // Load the kernel source code into the array source_str FILE *fp; char *source_str; size_t source_size; fp = fopen("vector_add_kernel.cl", "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); fclose( fp ); // Get platform and device information cl_platform_id platform_id = NULL; cl_device_id device_id = NULL; cl_uint ret_num_devices; cl_uint ret_num_platforms; cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms); ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, &ret_num_devices); // Create an OpenCL context cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret); // Create a command queue cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret); // Create memory buffers on the device for each vector cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); cl_mem b_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); cl_mem c_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, LIST_SIZE * sizeof(int), NULL, &ret); // Copy the lists A and B to their respective memory buffers ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), A, 0, NULL, NULL); ret = clEnqueueWriteBuffer(command_queue, b_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), B, 0, NULL, NULL); // Create a program from the kernel source cl_program program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); // Build the program ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); // Create the OpenCL kernel cl_kernel kernel = clCreateKernel(program, "vector_add", &ret); // Set the arguments of the kernel ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj); // Execute the OpenCL kernel on the list size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL); // Read the memory buffer C on the device to the local variable C int *C = (int*)malloc(sizeof(int)*LIST_SIZE); ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, LIST_SIZE * sizeof(int), C, 0, NULL, NULL); // Display the result to the screen for(i = 0; i < LIST_SIZE; i++) printf("%d + %d = %d\n", A[i], B[i], C[i]); // Clean up ret = clFlush(command_queue); ret = clFinish(command_queue); ret = clReleaseKernel(kernel); ret = clReleaseProgram(program); ret = clReleaseMemObject(a_mem_obj); ret = clReleaseMemObject(b_mem_obj); ret = clReleaseMemObject(c_mem_obj); ret = clReleaseCommandQueue(command_queue); ret = clReleaseContext(context); free(A); free(B); free(C); return 0; } |
To make OpenCL run the kernel on the GPU you can change the constant CL_DEVICE_TYPE_DEFAULT to CL_DEVICE_TYPE_GPU in line 43. To run on CPU you can set it to CL_DEVICE_TYPE_CPU. This shows how easy OpenCL makes it to run different programs on different compute devices.
The source code for this example can be downloaded here.
Compiling an OpenCL program
To compile an OpenCL program you need to include the OpenCL headers and link the OpenCL library. For instance if you have a host program in the file main.c written in C and you want to compile it with gcc you use the following commands:
gcc -c -I /path-to-the-include-dir-with-cl.h/ main.c -o main.o gcc main.o -o host -L /path-to-the-lib-folder-with-OpenCL-libfile/ -l OpenCL
For instance on my system this became:
gcc -c -Wall -I /home/smistad/Downloads/ati-stream-sdk-v2.1-lnx64/include main.c -o main.o gcc main.o -o host -L /home/smistad/Downloads/ati-stream-sdk-v2.1-lnx64/lib/x86_64 -l OpenCL
How to learn more
To learn more about OpenCL I recommend the book from Fixstars called The OpenCL programming book. Below are some links to useful sites with information on OpenCL:
Is it the case that the CPU must be sse3? I followed your blog but the OpenCL samples don’t run successfully (have sse2).
Also, when you say to decompress the registration in the root folder, I assumed you meant the sdk root folder … might want to clarify (for dummies like me) that it is the file system root folder.
Thanks for the article … nicely written.