OpenCL bugs


OpenCL is a great framework, however, AMD’s, NVIDIA’s and Apple’s OpenCL compilers are not. I have worked over 4 years with OpenCL and I have experienced a lot of bugs. Here I present a list of the OpenCL bugs I have encountered along with possible solutions/work-arounds to help other frustrated OpenCL programmers. I create this list with a small hope that these bugs will eventually be fixed, and I will try to keep it up to date.

Apple OpenCL

The function “mix” is missing and using it results in a “could not find __gpu_mix” error when building the CL code.
Work-around: Instead of mix(x, y, a) write this: x + (y – x)*a, which is the operation mix actually performs.
System: Mac OS X 10.9 Mavericks, NVIDIA 780M GPU


Reading a binary OpenCL file where the printf function is used segfaults.
System: Ubuntu 14.04, AMD APP 2.9.1, Catalyst 14.9 driver, R9 290 GPU

M_PI is missing, resulting in “error: identifier “M_PI” is undefined” when compiling.
Work-around: Define M_PI yourself: #define M_PI 3.14159265358979323846
System: Ubuntu 14.10, AMD APP 2.9.1, Catalyst 14.9 driver, Radeon HD5780M

Building a program from binary source which uses printf causes a segmentation fault.
Work-around: Remove printf and compile again
System: Ubuntu 14.04, AMD APP 2.9.1, Catalyst 14.9 driver, R9 290 GPU


OpenCL-OpenGL interoperability problems on AMD GPUs and Linux


SimpleGLAfter experimenting with the OpenCL-OpenGL interoperability on AMD GPUs on Ubuntu Linux I got some cryptic error messages from X (see below). This happens both for the AMD APP samples like SimpleGL and my own OpenCL implementation of Marching Cubes.

Erorr message:

XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0.0"
      after 28 requests (28 known processed) with 0 events remaining.

Or this message:

X Error of failed request:  BadMatch (invalid parameter attributes)
  Major opcode of failed request:  160 (GLX)
  Minor opcode of failed request:  5 (X_GLXMakeCurrent)
  Serial number of failed request:  28
  Current serial number in output stream:  28

Or something like this:

libGL error: dlopen /usr/lib/fglrx/dri/ failed (/usr/lib/fglrx/dri/
libGL error: unable to load driver:
libGL error: unable to load driver: swrast

The problem seems to be that the dynamic linker links to the wrong OpenGL libraries. When using OpenCL-OpenGL interoperability we want to use AMDs OpenGL implementation and not mesa. To fix this set the following environment variable before running your code:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/fglrx

You can add this to you .bashrc file if you want it to be permanent.

Level set segmentation on GPUs using OpenCL


Brain segmented from synthetic MR images (generated at BrainWeb) on the GPU using OpenCL and the Level Set method

Brain segmented from synthetic MR images (generated at BrainWeb) on the GPU using OpenCL and the Level Set method

Level sets is a mathematical method of evolving contours in Cartesian grids such as images. The method works by considering a function \(\phi\), called the level set function, which has one more dimension than the Cartesian grid we want to evolve the contour on. Thus, for a 2D image the level set function defines a 3D surface, while for a 3D volume the level set function is a 4D hypersurface. For each point on the grid (x, y, z), it defines the height h from the surface to the grid at a given time t: \(h = \phi(x,y,z,t)\)

The actual contour, is defined by the zero level set, which are the coordinates (x,y,z) where the level set function is zero:

\(\phi(x,y,z,t) = 0\)

To move the contour, the level set function is derivated in respect to time:

\(\frac{\partial \phi}{\partial t} = -F|\nabla \phi|\)

F is called the speed function and defines how fast and in which direction the contour moves. The speed function can be tailored for any problem. In image segmentation it is usual to model the speed function to be high at coordinates where the image has a desired intensity and visa versa. To make the contour smooth and avoid leaking into surrounding regions a curvature term (\(\kappa = \nabla \cdot \frac{\nabla \phi}{|\nabla \phi|}\)) is often included in the speed function. A popular choice of speed function for image segmentation is:

\(F = -\alpha (\epsilon – |T – I(x,y,z)|) + (1-\alpha)\kappa(x,y,z)\)

Here \(\alpha \in [0,1]\) is a weighting parameter between the intensity and the curvature term. The parameters T and \(\epsilon\) are used to drive to contour toward voxels with intensity in the range \(I \in [T-\epsilon,T+\epsilon]\).

Level set surface moving in the image plane. The red circles show the zero level set at various time steps. As time goes, the surface is moved down through the image plane and the zero level set change according to the shape of the surface.

Level set surface moving in the image plane. The red circles show the zero level set at various time steps. As time goes, the surface is moved down through the image plane and the zero level set change according to the shape of the surface.

The level set method is very computationally expensive because each voxel has to be updated for each iteration. However, each voxel can be updated in parallel using the same instructions, making level sets ideal for GPUs (see [2,3,4] for details on different GPU implementations). I have created a simple GPU accelerated version of level set volume segmentation using OpenCL. The implementation uses 3D textures on the GPU to reduce memory access latency. Read more on textures in OpenCL my previous post on Gaussian Blur using OpenCL. If you want to look into further optimizing the level set computation you should look into the narrow band, sparse field or fast marching methods (see [1] for more details).

The level set gradient \(\nabla \phi\) and the curvature \(\kappa\) has to approximated numerically. This can be done using the upwinding scheme.

The level set function has to be initialized. It is common to initialize it to the distance transform which calculates the distance from each voxel to the initial contour. The signed distance is negative for voxels inside the initial contour and positive outside. If we use a spherical initial contour the signed distance transform can be easily calculated in parallel for each voxel using the following equation \(d = |\vec x – \vec c| – r\) where \(\vec x\) is the coordinate of the voxel, \(\vec c\) is the position of the center and r is the radius.

Download and run the example

The code is available on GitHub for download.

The program uses the Simple Image Processing Library (SIPL) for loading, storing and displaying the volumes. This library is dependent on GTK 2.

Below are a set of commands for downloading, compiling and running the example on Ubuntu.

# Install dependencies (OpenCL has to be installed manually)
sudo apt-get install libgtk2.0-dev
# Download
git clone git://
cd OpenCL-Level-Set-Segmentation
git submodule init
git submodule update
# Compile and run
cmake .
./levelSetSeg example_data/mr_brain.mhd result.mhd 100 100 100 10 2000 125 40 0.05 125 255


1. Level Set Methods and Fast Marching Methods by J.A. Sethian. Cambridge University Press
2. Rumpf, M., Strzodka, R. Level set segmentation in graphics hardware. Proceedings 2001 International Conference on Image Processing 1103–1106
3. Lefohn, A., Cates, J., & Whitaker, R. . Interactive, gpu-based level sets for 3d segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2003. 564–572
4. Roberts, M., Packer, J., Sousa, M. C., & Mitchell, J. R. (2010). A Work-Efficient GPU Algorithm for Level Set Segmentation. Proceedings of the Conference on High Performance Graphics. 123–132.
5. BrainWeb.

Memory-mapped files using the boost library


The objective of memory mapping files is to increase I/O performance. Memory mapping a file creates a pointer to a segment in virtual memory and the actual loading is performed by the Operating System one page at a time. For large files, this is much faster than using traditional methods in C such as fopen/fread/fwrite.

In this post, I show an example of how to use the boost iostreams library to create a memory mapped file that, unlike mmap, works for both Windows and Linux.

Start with installing the boost iostreams library. On ubuntu this is done by installing the libboost-iostreams-dev package.

sudo apt-get install libboost-iostreams-dev

The example below will create a memory mapping of 1000000 integers for the file filename.raw. The integers will be available from the pointer called data.

#include <boost/iostreams/device/mapped_file.hpp>
#include <iostream>
int main() {
    boost::iostreams::mapped_file_source file;
    int numberOfElements = 1000000;
    int numberOfBytes = numberOfElements*sizeof(int);"filename.raw", numberOfBytes);
    // Check if file was successfully opened
    if(file.is_open()) {
        // Get pointer to the data
        int * data = (int *);
        // Do something with the data
        for(int i = 0; i < numberOfElements; i++)
            std::cout << data[i] << " ";
        // Remember to unmap the file
    } else {
        std::cout << "could not map the file filename.raw" << std::endl;

Here is a minimal CMakeLists.txt file for compiling this example together with the boost iostreams library.

cmake_minimum_required(VERSION 2.8)
find_package(Boost COMPONENTS iostreams REQUIRED)
add_executable(memory-map main.cpp)
target_link_libraries(memory-map ${Boost_LIBRARIES})

As usual you can download/clone the code and the sample raw file from my GitHub page

GPU-based Gradient Vector Flow using OpenCL


Illustration of Gradient Vector Flow performed on an image. The colors represents the vector direction.

Gradient Vector Flow (GVF) is a feature-preserving diffusion of gradient information. It was originally introduced by Xu and Prince to drive snakes, or active contours, towards edges of interest in image segmentation. However, GVF is also used for detection of tubular structures and skeletonization.

I just recently published an article in the Journal of Real-Time Image Processing entitled “Real-time gradient vector flow on GPUs using OpenCL” describing an optimized OpenCL implementation of Gradient Vector Flow (GVF) that runs on GPUs and CPUs for both 2D and 3D.

Gaussian Blur using OpenCL and the built-in Images/Textures

If used correctly, OpenCL images / textures can give you large speedups on GPUs. In this post, I’ll show you a very short example of how to use OpenCL to blur/smooth an image. The goal is to show how images/textures are used in OpenCL and the benefits of using them.

The source code can be download from by GitHub page.

Measuring runtime in milliseconds using the C++ 11 chrono library


I have been playing around with the new C++ 11 standard. It includes a nice new library called chrono which includes some useful clocks and timers. Below is an example of some macros you can use to time your applications in milliseconds and print out the result. Timing can be turned off by removing the #define TIMING line. Remember to compile the program with C++11 (or C++0x) enabled. For GCC this should be:

g++ main.cpp -std=c++0x
#include <iostream>
#include <chrono>
#define TIMING
#ifdef TIMING
#define INIT_TIMER auto start = std::chrono::high_resolution_clock::now();
#define START_TIMER  start = std::chrono::high_resolution_clock::now();
#define STOP_TIMER(name)  std::cout << "RUNTIME of " << name << ": " << \
    std::chrono::duration_cast<std::chrono::milliseconds>( \
            std::chrono::high_resolution_clock::now()-start \
    ).count() << " ms " << std::endl; 
#define INIT_TIMER
#define STOP_TIMER(name)
int main() {
    STOP_TIMER("sleeping for 2 seconds")
    long unsigned int b = 0;
    for(int i = 0; i < 10000000; i++) {
        b += i;
    STOP_TIMER("some long loop")

Example output:

RUNTIME of sleeping for 2 seconds: 2000 ms 
RUNTIME of some long loop: 24 ms 

Getting started with Google Test (GTest) on Ubuntu


Google test is a framework for writing C++ unit tests. In this short post, I explain how to set it up in Ubuntu.

Start by installing the gtest development package:

sudo apt-get install libgtest-dev

Note that this package only install source files. You have to compile the code yourself to create the necessary library files. These source files should be located at /usr/src/gtest. Browse to this folder and use cmake to compile the library:

sudo apt-get install cmake # install cmake
cd /usr/src/gtest
sudo cmake CMakeLists.txt
sudo make
# copy or symlink libgtest.a and libgtest_main.a to your /usr/lib folder
sudo cp *.a /usr/lib

Lets say we now want to test the following simple squareRoot function:

// whattotest.cpp
#include <math.h>
double squareRoot(const double a) {
    double b = sqrt(a);
    if(b != b) { // nan check
        return -1.0;
        return sqrt(a);

In the following code, we create two tests that test the function using a simple assertion. There exists many other assertion macros in the framework (see The code contains a small main function that will run all of the tests automatically. Nice and simple!

// tests.cpp
#include "whattotest.cpp"
#include <gtest/gtest.h>
TEST(SquareRootTest, PositiveNos) { 
    ASSERT_EQ(6, squareRoot(36.0));
    ASSERT_EQ(18.0, squareRoot(324.0));
    ASSERT_EQ(25.4, squareRoot(645.16));
    ASSERT_EQ(0, squareRoot(0.0));
TEST(SquareRootTest, NegativeNos) {
    ASSERT_EQ(-1.0, squareRoot(-15.0));
    ASSERT_EQ(-1.0, squareRoot(-0.2));
int main(int argc, char **argv) {
    testing::InitGoogleTest(&argc, argv);
    return RUN_ALL_TESTS();

The next step is to compile the code. I’ve set up a small CMakeLists.txt file below to compile the tests. This file locates the google test library and links it with the test application. Note that we also have to link to the pthread library or the application won’t compile.

cmake_minimum_required(VERSION 2.6)
# Locate GTest
find_package(GTest REQUIRED)
# Link runTests with what we want to test and the GTest and pthread library
add_executable(runTests tests.cpp)
target_link_libraries(runTests ${GTEST_LIBRARIES} pthread)

Compile and run the tests:

cmake CMakeLists.txt

Have fun testing! You can download all of the code above at my Github page:


Simple Image Processing Library


I do a lot image processing both on images and 3D images / volumes. There exist many image processing libraries out there. Some are big and some are small, but none seems to fit my taste. ITK is one of the major image processing libraries used in my field of research, but this library is, in my opinion, extremly cumbersome. And I can’t be the only one who think so since there has been made an alternative called Simple ITK. There exists many other image processing libraries that tries to be simple to use, but most of them don’t allow you to do volume processing, which I do a lot of. I want a library that allows me to quickly go from an algorithm concept to getting actual pictures on the screen so that I can quickly verify the results. So far I’ve been using Matlab for prototyping image processing algoritmhs, and it have worked quite well, but as I see it Matlab has two major problem: speed and computation and GUI in one thread. A long story short, I’ve made my own Simple Image Processing Library (SIPL) which I now use in my research. I’ve added a short guide here on how to use and install it in case anybody else feel the same as I do and thinks this library could be of any use to them as well. Also, this small little library is still in development so if you have any feedback, suggestions, comments or bug reports please let me know.

Main goals of the library:

  • Simple and condensed – Easy to get from an algorithm concept to pictures on the screen
  • GUI in seperat thread – Display and explore images interactively while computation is still going on
  • Cross-platform – Linux, Windows and Mac compatible

Press more below to see code examples, download and read install instruction.

Marching Cubes implementation using OpenCL and OpenGL


In a school project I recently created a fast implementation of Marching Cubes that uses OpenCL to extract surfaces from volumetric datasets and OpenGL to render the surfaces on screen. I wrote a paper together with my two supervisors about the implementation and presented it at the Joint Workshop on High Performance and Distributed Computing for Medical Imaging at the MICCAI 2011 conference. Our implementation achieved real-time speeds for volumes of sizes up to 512x512x512 on a standard GPU with 1GB memory. The paper entitled “Real-Time Surface Extraction and Visualization of Medical Images using OpenCL and GPUs” describing the implementation can be downloaded here. The source code of the implementation can be downloaded from my GitHub page.


Go to Top