OpenCL (Open Computing Language) is a new framework for writing programs that execute in parallel on different compute devices (such as CPUs and GPUs) from different vendors (AMD, Intel, ATI, Nvidia etc.). The framework defines a language to write “kernels” in. These kernels are the functions which are to run on the different compute devices. In this post I explain how to get started with OpenCL and how to make a small OpenCL program that will compute the sum of two lists in parallel.

Installing and setting up OpenCL on your computer

First of all you need to download the newest drivers to your graphics card. This is important because OpenCL will not work if you don’t have drivers that support OpenCL.

To install OpenCL you need to download an implementation of OpenCL. The major graphic vendors Nvidia and AMD/ATI have both released implementations of OpenCL for their GPUs. These implementation come in a so called software development kits and often include some useful tools such as a visual profiler. The next step is to download and install the SDK for the GPU you have on your computer. Note that not all graphic cards are supported. A list of which graphic cards are supported can be found on the vendors websites.

For AMD/ATI GPUs download the AMD APP SDK (formerly known as AMD Stream SDK)
For Nvidia GPUs download the CUDA Toolkit

The installation steps differ for each SDK and the OS you are running. Follow the installation manual of the SDK carefully. Personally I use Ubuntu Linux and have an AMD 7970 graphics card. Below are some installation steps for this specific setup.

Installing OpenCL on Ubuntu Linux with AMD graphics card

To install the latest AMD drivers on Ubuntu 12.04 open additional drivers and install/active the one called “ATI/AMD proprietary FGLRX graphic driver (post-release updates)”.
After that is done, restart and download and extract the AMD APP SDK.

AMD APP SDK 2.8 includes an installer. Run this with the command:

sudo sh Install-AMD-APP.sh

Next, install the OpenCL headers files

sudo apt-get install opencl-headers

And your done! Note that the AMD APP SDK and its samples is located at /opt/AMDAPP.

Installing OpenCL on Ubuntu Linux with NVIDIA graphics card

Download the CUDA toolkit for Ubuntu from NVIDIAs CUDA site. Open a terminal an run the installation file with the command:

sudo sh cudatoolkit_3.1_linux_64_ubuntu9.10.run

Download the Developer Drivers for Linux at the same website and install it by first stopping X, running the file and start X again. To stop X use:

sudo /etc/init.d/gdm stop

Then get a terminal up by pressing CTRL+ALT+F5, login and navigate to where you downloaded the devdriver then type:

sudo sh devdriver_3.1_linux_64_256.40.run

After the driver has been installed start x again by typing

startx

Before compiling an OpenCL application you need to add the path to the lib folder of CUDA to LD_LIBRARY_PATH like so:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Your first OpenCL program – Vector addition

To demonstrate OpenCL I explain how to perform the simple task of vector addition. Suppose we have two lists of numbers, A and B, of equal size. The task of vector addition is to add the elements of A with the elements of B and put the result in the element of a new list called C of the same size. The figure below explains the operation.

Two lists A and B and the result list C of vector addition on A and B

The naive way of performing this operation is to simply loop through the list and perform the operation on one element at a time like the C++ code below:

for(int i = 0; i < LIST_SIZE; i++) {
    C[i] = A[i] + B[i];
}

This algorithm is simple but has a linear time complexity, O(n) where n is the size of the list. But since each iteration of this loop is independent on the other iterations this operation is data parallel, meaning that each iteration can be computed simultaneously. So if we have n cores on a processor this operation can be performed in constant time O(1).

To make OpenCL perform this operation in parallel we need to make the kernel. The kernel is the function which will run on the compute device.

The kernel

The kernel is written in the OpenCL language which is a subset of C and has a lot of math and vector functions included. The kernel to perform the vector addition operation is defined below.

__kernel void vector_add(__global const int *A, __global const int *B, __global int *C) {
 
    // Get the index of the current element to be processed
    int i = get_global_id(0);
 
    // Do the operation
    C[i] = A[i] + B[i];
}

The host program

The host program controls the execution of kernels on the compute devices. The host program is written in C, but bindings for other languages like C++ and Python exists. The OpenCL API is defined in the cl.h (or opencl.h for apple) header file. Below is the code for the host program that executes the kernel above on compute device. I will not go into details on each step as this is supposed to be an introductory article although I can recommend the book “The OpenCL Programming Book” if you want to dive into the details. The main steps of a host program is as follows:

  • Get information about the platform and the devices available on the computer (line 42)
  • Select devices to use in execution (line 43)
  • Create an OpenCL context (line 47)
  • Create a command queue (line 50)
  • Create memory buffer objects(line 53-58)
  • Transfer data (list A and B) to memory buffers on the device (line 61-64)
  • Create program object (line 67)
  • Load the kernel source code (line 24-35) and compile it (line 71) (online exeuction) or load the precompiled binary OpenCL program (offline execution)
  • Create kernel object (line 74)
  • Set kernel arguments (line 77-79)
  • Execute the kernel (line 84)
  • Read memory objects. In this case we read the list C from the compute device (line 88-90)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#include <stdio.h>
#include <stdlib.h>
 
#ifdef __APPLE__
#include <OpenCL/opencl.h>
#else
#include <CL/cl.h>
#endif
 
#define MAX_SOURCE_SIZE (0x100000)
 
int main(void) {
    // Create the two input vectors
    int i;
    const int LIST_SIZE = 1024;
    int *A = (int*)malloc(sizeof(int)*LIST_SIZE);
    int *B = (int*)malloc(sizeof(int)*LIST_SIZE);
    for(i = 0; i < LIST_SIZE; i++) {
        A[i] = i;
        B[i] = LIST_SIZE - i;
    }
 
    // Load the kernel source code into the array source_str
    FILE *fp;
    char *source_str;
    size_t source_size;
 
    fp = fopen("vector_add_kernel.cl", "r");
    if (!fp) {
        fprintf(stderr, "Failed to load kernel.\n");
        exit(1);
    }
    source_str = (char*)malloc(MAX_SOURCE_SIZE);
    source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp);
    fclose( fp );
 
    // Get platform and device information
    cl_platform_id platform_id = NULL;
    cl_device_id device_id = NULL;   
    cl_uint ret_num_devices;
    cl_uint ret_num_platforms;
    cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
    ret = clGetDeviceIDs( platform_id, CL_DEVICE_TYPE_DEFAULT, 1, 
            &device_id, &ret_num_devices);
 
    // Create an OpenCL context
    cl_context context = clCreateContext( NULL, 1, &device_id, NULL, NULL, &ret);
 
    // Create a command queue
    cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
 
    // Create memory buffers on the device for each vector 
    cl_mem a_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            LIST_SIZE * sizeof(int), NULL, &ret);
    cl_mem b_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY,
            LIST_SIZE * sizeof(int), NULL, &ret);
    cl_mem c_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, 
            LIST_SIZE * sizeof(int), NULL, &ret);
 
    // Copy the lists A and B to their respective memory buffers
    ret = clEnqueueWriteBuffer(command_queue, a_mem_obj, CL_TRUE, 0,
            LIST_SIZE * sizeof(int), A, 0, NULL, NULL);
    ret = clEnqueueWriteBuffer(command_queue, b_mem_obj, CL_TRUE, 0, 
            LIST_SIZE * sizeof(int), B, 0, NULL, NULL);
 
    // Create a program from the kernel source
    cl_program program = clCreateProgramWithSource(context, 1, 
            (const char **)&source_str, (const size_t *)&source_size, &ret);
 
    // Build the program
    ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
 
    // Create the OpenCL kernel
    cl_kernel kernel = clCreateKernel(program, "vector_add", &ret);
 
    // Set the arguments of the kernel
    ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&a_mem_obj);
    ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&b_mem_obj);
    ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&c_mem_obj);
 
    // Execute the OpenCL kernel on the list
    size_t global_item_size = LIST_SIZE; // Process the entire lists
    size_t local_item_size = 64; // Divide work items into groups of 64
    ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
            &global_item_size, &local_item_size, 0, NULL, NULL);
 
    // Read the memory buffer C on the device to the local variable C
    int *C = (int*)malloc(sizeof(int)*LIST_SIZE);
    ret = clEnqueueReadBuffer(command_queue, c_mem_obj, CL_TRUE, 0, 
            LIST_SIZE * sizeof(int), C, 0, NULL, NULL);
 
    // Display the result to the screen
    for(i = 0; i < LIST_SIZE; i++)
        printf("%d + %d = %d\n", A[i], B[i], C[i]);
 
    // Clean up
    ret = clFlush(command_queue);
    ret = clFinish(command_queue);
    ret = clReleaseKernel(kernel);
    ret = clReleaseProgram(program);
    ret = clReleaseMemObject(a_mem_obj);
    ret = clReleaseMemObject(b_mem_obj);
    ret = clReleaseMemObject(c_mem_obj);
    ret = clReleaseCommandQueue(command_queue);
    ret = clReleaseContext(context);
    free(A);
    free(B);
    free(C);
    return 0;
}

To make OpenCL run the kernel on the GPU you can change the constant CL_DEVICE_TYPE_DEFAULT to CL_DEVICE_TYPE_GPU in line 43. To run on CPU you can set it to CL_DEVICE_TYPE_CPU. This shows how easy OpenCL makes it to run different programs on different compute devices.

The source code for this example can be downloaded here.

Compiling an OpenCL program

If the OpenCL header and library files are located in their proper folders (/usr/include and /usr/lib) the following command will compile the vectorAddition program.

gcc main.c -o vectorAddition -l OpenCL

How to learn more

To learn more about OpenCL I recommend the book from Fixstars called The OpenCL programming book. Below are some links to useful sites with information on OpenCL: