If used correctly, OpenCL images / textures can give you large speedups on GPUs. In this post, I’ll show you a very short example of how to use OpenCL to blur/smooth an image. The goal is to show how images/textures are used in OpenCL and the benefits of using them.

The source code can be download from by GitHub page.

How Gaussian blurring works

Blurring is to smooth an image as shown in the top of this post. In practice, this is done by discrete convolution of the image and a mask. Gaussian blurring entails using an approximation of the Gaussian distribution as the convolution mask. Discrete convolution performs the same set of instructions on each pixel and is thus ideal for execution on a GPU. Each pixel gets a new value that is a weighted average of its closest neighbors. Mathematically, convolution of an image I, and a Gaussian mask G, with a specific \(\sigma\), can be expressed as:

\((I * G_\sigma)(\vec v) = \frac{1}{Z} \sum^N_{a=-N} \sum^N_{b=-N} I(\vec v+(a,b)) e^{-\frac{a^2+b^2}{2\sigma^2}}\)

where Z is a normalization constant equal to the sum of the mask.

OpenCL – Images and Textures

Most modern GPUs have a separate texture cache. These texture caches exists on GPUs because a lot of video games and 3D applications use texture mapping to map an image to 3D objects to create a realistic 3D scene. Textures are simply images, either 1, 2 or 3 dimensional. When a specific pixel in the texture is requested, the GPU will store the data and the neighboring data in a special buffer that is close to where the actual calculations are performed. Unlike regular linear storage buffers which only have caching in one dimension, textures can cache neighboring data in 2 or 3 dimensions. Thus, when a pixel is requested, the neighboring pixels above and below as well as those to the left and right are cached.

Note that writing to a 3D texture from inside a kernel is not supported in OpenCL by default. It is enabled through an extension called cl_khr_3d_image_writes. AMD supports this extension, while NVIDIA does not.

Caching is one of the main features of textures, but other nice features are:

  • Interpolation – Request a point in the image and get the linear approximation of that point using the closest neighboring pixels (enabled with CLK_FILTER_LINEAR flag in sampler)
  • Data type conversion – The texture fetch units on the GPU can convert between different data types. Used with the suffixes of the read_image and write_image functions (f for float, i for integer and ui for unsigned integer).
  • Normalized data types – These data types (CL_SNORM_INT8, CL_UNORM_INT8, CL_SNORM_INT16, CL_UNORM_INT16) can reduce the amount of memory used and transferred and may give you large speedups. Use them if you are using floats with a normalized range (0.0-1.0, -1.0-1.0) and accuracy is not that important. These data types store floats as a 16 or 8 bit integer in the texture and the texture fetch unit converts it to a float with a normalized range when requested.
  • Out of bounds handling – Reading outside a texture will not create a seg fault. Instead, you can define what should be returned by the read_image function in the sampler (CLK_ADDRESS_CLAMP_TO_EDGE, CLK_ADDRESS_CLAMP, CLK_ADDRESS_NONE..).
  • Channels – Each pixel in a texture can contain 1 to 4 channels. If you want to pack several values in each pixel this can be very useful. Defined in the creation of the image. ImageFormat(CL_RGBA, CL_FLOAT) is for instance a 4 channel (RGBA: red, green, blue, alpha) image of floats.

The code

OpenCL Kernel Code

Below is the OpenCL code for the Gaussian blur kernel. The functions write_image{f|i|ui} are used to write to an image in a kernel and read_image{f|i|ui} to read. Note that the read_image functions will always return a vector of size 4 and since we only use one channel in this application we retrieve the first component (.x) read_imagef(..).x

Another thing to note, is that you can only read OR write to a an image in an kernel. You can’t read and write to the same image. If you want to use images in several kernel calls after another I suggest using a dual buffering approach.

When reading images you have to define a sampler which you can see in the first line in the code below. Normalized coordinates enables accessing the image using normalized floating point coordinates (0.0 to 1.0). CLK_ADDRESS_CLAMP_TO_EDGE means that if a pixel outside the image is requested, it will use the pixel value that is closest to the edge. The last flag CLK_FILTER_NEAREST instructs the fetch unit to fetch the pixel that is closest to the coordinate. The alternative is CLK_FILTER_LINEAR which fetches a linear approximation of the 4(2D) or 8(3D) closest pixels. For more info on samplers, see the specification.

__kernel void gaussian_blur(
        __read_only image2d_t image,
        __constant float * mask,
        __global float * blurredImage,
        __private int maskSize
    ) {
    const int2 pos = {get_global_id(0), get_global_id(1)};
    // Collect neighbor values and multiply with Gaussian
    float sum = 0.0f;
    for(int a = -maskSize; a < maskSize+1; a++) {
        for(int b = -maskSize; b < maskSize+1; b++) {
            sum += mask[a+maskSize+(b+maskSize)*(maskSize*2+1)]
                *read_imagef(image, sampler, pos + (int2)(a,b)).x;
    blurredImage[pos.x+pos.y*get_global_size(0)] = sum;

Host Code

The host code is written in C++ and depends on two small libraries (SIPL and OpenCLUtilities). The function createBlurMask creates a Gaussian blur mask with a specific sigma. The mask size N, is calculated from the sigma. Larger sigma gives a larger mask size. The memory flag CL_MEM_COPY_HOST_PTR orders OpenCL to copy the contents of the last argument pointer to the the device. Each image is created with a specific image format using the ImageFormat class. The first argument is the channel format and the second is the data type. A list of all channel formats and data types can be found here. The rest should be self-explanatory.

#include "SIPL/Core.hpp"
#include "OpenCLUtilities/openCLUtilities.hpp"
using namespace cl;
float * createBlurMask(float sigma, int * maskSizePointer) {
    int maskSize = (int)ceil(3.0f*sigma);
    float * mask = new float[(maskSize*2+1)*(maskSize*2+1)];
    float sum = 0.0f;
    for(int a = -maskSize; a < maskSize+1; a++) {
        for(int b = -maskSize; b < maskSize+1; b++) {
            float temp = exp(-((float)(a*a+b*b) / (2*sigma*sigma)));
            sum += temp;
            mask[a+maskSize+(b+maskSize)*(maskSize*2+1)] = temp;
    // Normalize the mask
    for(int i = 0; i < (maskSize*2+1)*(maskSize*2+1); i++)
        mask[i] = mask[i] / sum;
    *maskSizePointer = maskSize;
    return mask;
int main(int argc, char ** argv) {
    // Initialize SIPL
    // Load image
    SIPL::Image<float> * image = new SIPL::Image<float>("images/lena.jpg");
    // Create OpenCL context
    Context context = createCLContextFromArguments(argc, argv);
    // Compile OpenCL code
    Program program = buildProgramFromSource(context, "gaussian_blur.cl");
    // Select device and create a command queue for it
    vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
    CommandQueue queue = CommandQueue(context, devices[0]);
    // Create an OpenCL Image / texture and transfer data to the device
    Image2D clImage = Image2D(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, ImageFormat(CL_R, CL_FLOAT), image->getWidth(), image->getHeight(), 0, image->getData());
    // Create a buffer for the result
    Buffer clResult = Buffer(context, CL_MEM_WRITE_ONLY, sizeof(float)*image->getWidth()*image->getHeight());
    // Create Gaussian mask
    int maskSize;
    float * mask = createBlurMask(10.0f, &maskSize);
    // Create buffer for mask and transfer it to the device
    Buffer clMask = Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*(maskSize*2+1)*(maskSize*2+1), mask);
    // Run Gaussian kernel
    Kernel gaussianBlur = Kernel(program, "gaussian_blur");
    gaussianBlur.setArg(0, clImage);
    gaussianBlur.setArg(1, clMask);
    gaussianBlur.setArg(2, clResult);
    gaussianBlur.setArg(3, maskSize);
        NDRange(image->getWidth(), image->getHeight()),
    // Transfer image back to host
    float * data = image->getData();
    queue.enqueueReadBuffer(clResult, CL_TRUE, 0, sizeof(float)*image->getWidth()*image->getHeight(), data);
    // Save image to disk
    image->save("images/result.jpg", "jpeg");

Download and run the example

Download the source code at http://github.com/smistad/OpenCL-Gaussian-Blur or clone it using git (Note that the example uses two small submodules: SIPL and OpenCLUtilities. The two git submodule commands below downloads these as well):

# Download
git clone git://github.com/smistad/OpenCL-Gaussian-Blur.git
cd OpenCL-Gaussian-Blur
git submodule init
git submodule update
# Compile and run (use ./blur --device cpu/gpu to run on different devices)
cmake CMakeLists.txt
./blur --device gpu