Monday, October 29, 2012

My first steps in CUDA

c=-0.8+0.156i


Very recently I got a new Nvidia graphic card (the cheapest I could find with compute capability greater than 1.3), and I started learning the very basic of GPU programming, in hope of using this for my projects.

I considered myself as an experienced Matlab programmer but I have practically no programming experience in c++, so in reality I have to learn simultaneously c++, CUDA and openGL (for some of the exercises)

My first challenge was to setup the environment to compile the CUDA programs. I had already  Visual Studio 2010 express (VS 2010) installed, so I just followed the three basic steps to install the necessary CUDA stuff, which are described here.
Next I configured the VS2010 so that I can compile CUDA programs. Although I couldn't find any official info about that (from nvidia or VS), there are many nice people, which describe these steps, for example here.

I'm repeating them briefly emphasizing on few minor things that can be a headache if they are missed.

1) Open a new project and select win32 console application and at the very last step of the wizard, before you hit finish CHOOSE EMPTY PROJECT (I have spent hours trying to find why I cant see the CUDA C/C++ compiler just because I forgot to check that box)

2) Right after you insert your first file RENAME it from *.cpp to *.cu (again I have spent few more hours to understand why my new project complains about CUDA commands, while the previous one compiled perfectly few minutes ago)

3) Next, right click on the project folder, choose the build customizations and tick the 4.2 CUDA box (or the CUDA version you use).

4) Right click on the project folder again and choose properties. Expand Likner and highlight input. Click on the box next to additional dependencies and add the cudart.lib library.

5) Last, right click on the file you want to compile and choose properties. From the general menu select for item type CUDA C/C++ and you are ready to be in a parallel universe.

I started reading this  book which I highly recommend for beginners (note that I'm beginner in both CUDA and C++. Note also that I'm not affiliated with the author in any way).
One of the book exercises is to program the Julia Set . Unfortunately, for some still unexplained reasons I couldn’t compile the provided code, so I decided to start from scratch and write a more simplistic version of the code without using classes and fancy data structures, although I realize that their programing style is actually match better than mine.

Of course to run this program you need to install openGL.
To compile it with VS2010 you need download the freeglut. (For example I choose the MSVC version from here). Unzip it anywhere and link the VS to that folder (right click on the solution, properties -> Linker ->General->Additional Library Directories add the .../freeglut/lib. Last you need to copy the freeglut.dll file into the folder where the executable is created, for example into the .../Debug folder (Unfortunately I haven't found any better way to overcome this, so I copy this file every time I need to use openGL)
 
The following program requires three inputs. (Note that if you dont give all the three inputs you get an error). First an integer number with the image resolution, second the real part of the constant complex and third the imaginary part of the constant complex. You have to be careful when choosing the complex constant number as you don't always get something nice and many times you get nothing but a white screen. For the figure on top for example I used the famous constant -0.8+0.156i.


c=-0.6+0.4i

GPU Version of Julia Set


#include "windows.h"
#include <GL/gl.h>
#include <GL/glu.h>
#include <GL/glut.h>

#include <iostream>
#include <cuda_runtime.h>
#include <cuda.h>
#include <math.h>
#include <ctime>

/* because I'm primarily a matlab user matrices always start with 1. so I use the ij function to convert to zero based*/
#define ij(i,j,ni) (i-1+ni*(j-1))
#define sgn(x) (x > 0) - (x < 0)

using namespace std;
int *host_BITMAP;
int DIM;
int host_kmin = 10000;
int host_kmax=-10000;

// OpenGl functions
void display()
{
   glClear(GL_COLOR_BUFFER_BIT);
    glColor3f (1.0, 1.0, 1.0);
    glPointSize(1);
    glBegin(GL_POINTS);
    float pnt;

    for (int i=0;i<DIM;i++){
        for (int j=0;j<DIM;j++){
            pnt=(float)host_BITMAP[ij(i+1,j+1,DIM)]*(1.0-0.0)/((float)host_kmax-(float)host_kmin);
            glColor3f((GLfloat)pnt, (GLfloat)pnt, (GLfloat)pnt);
            glVertex2i((GLint)j,(GLint)i);
       }
    }
    glEnd();
    glFlush();
}

void reshape (int w, int h)
{
   glViewport (0, 0, (GLsizei) w, (GLsizei) h);
   glMatrixMode (GL_PROJECTION);
   glLoadIdentity ();
   gluOrtho2D (0.0, (GLdouble) w, 0.0, (GLdouble) h);
}

__device__ int Julia(float x, float y, float *devc1, float *devc2){
    float xnew, ynew, Z;
    int k, TF1, TF;
    TF1=0;
    for (k=0;k<200;k++){
        xnew=pow(x,2)-pow(y,2)+devc1[0];
        ynew=2*x*y+devc2[0];
        x=xnew;
        y=ynew;
        Z=sqrt(x*x+y*y);
        TF1=TF1+ max(sgn(Z-5),0)*(200-k);
        TF=max(sgn(5-Z),0);
        x=x*TF;
        y=y*TF;
    }
    return TF1;
}

__global__ void makeBitmap(int *devBITMAP, int *devDIM,
    float *devc1, float *devc2, int *dev_kmin, int *dev_kmax){//
    int tf;
    float scale=1.5;
    float x;
    float y;
    int jj = blockIdx.x;
    int ii = blockIdx.y;
    x=scale*(float)(devDIM[0]/2.0-jj)/(devDIM[0]/2.0);
    y=scale*(float)(devDIM[0]/2.0-ii)/(devDIM[0]/2.0);
    tf=Julia(x,y,devc1,devc2);
    if (tf>dev_kmax[0])
        dev_kmax[0]=tf;
    if (tf<dev_kmin[0])
        dev_kmin[0]=tf;
    devBITMAP[ij(ii+1,jj+1,devDIM[0])]=tf;
}


int main(int nNumberofArgs, char* pszArgs[]){
    cudaDeviceProp prop;
    int count;
    cudaGetDeviceCount( &count );

    for (int i=0; i< count; i++) {
        printf( "--- General Information for device %d ---\n", i );
        cudaGetDeviceProperties( &prop, i );
        printf( "Name: %s\n", prop.name );
        printf( "Max threads per block: %d\n",prop.maxThreadsPerBlock );
    }

    int dev;
    cudaGetDevice( &dev );
    printf( "ID of current CUDA device: %d\n", dev );
    memset( &prop, 0, sizeof( cudaDeviceProp ) );
    prop.major = 2;
    prop.minor = 0;
    cudaChooseDevice( &dev, &prop );
    printf( "ID of CUDA device closest to revision 2.0: %d\n", dev );
    cudaSetDevice( dev );

    float c1, c2;
    DIM=atoi(pszArgs[1]);
    c1=atof(pszArgs[2]);
    c2=atof(pszArgs[3]);

    cout << "The image dimension is: " << DIM << "\n";
    cout << "The real part is: " << c1 << "\n";
    cout << "The imaginary part is: " << c2 << "\n";

    host_BITMAP = new int[DIM*DIM];

    int *dev_BITMAP;
    int *dev_DIM;
    float *dev_c1;
    float *dev_c2;
    int *dev_kmin;
    int *dev_kmax;

    cudaMalloc((void**)&dev_BITMAP, DIM*DIM*sizeof(int));
    cudaMalloc((void**)&dev_DIM, 1*sizeof(int));
    cudaMalloc((void**)&dev_c1, 1*sizeof(float));
    cudaMalloc((void**)&dev_c2, 1*sizeof(float));
    cudaMalloc((void**)&dev_kmin, 1*sizeof(float));
    cudaMalloc((void**)&dev_kmax, 1*sizeof(float));

    cudaMemcpy(dev_DIM, &DIM, 1*sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c1, &c1, 1*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c2, &c2, 1*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_kmin, &host_kmin, 1*sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_kmax, &host_kmax, 1*sizeof(int),cudaMemcpyHostToDevice);

    dim3 grid(DIM,DIM);

    std::clock_t start;
    double duration;
    start = std::clock();
    makeBitmap <<< grid, 1 >>> (dev_BITMAP, dev_DIM, dev_c1, dev_c2, dev_kmin, dev_kmax);
    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    std::cout<<"Julia Set in: " << duration << " sec\n";

    start = std::clock();
    cudaMemcpy(host_BITMAP,dev_BITMAP,DIM*DIM*sizeof(int),cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_kmin,dev_kmin,1*sizeof(int),cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_kmax,dev_kmax,1*sizeof(int),cudaMemcpyDeviceToHost);
    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    std::cout<<"Data Copied from GPU in: " << duration << " sec\n";

    cout << host_kmin << "\n";
    cout << host_kmax << "\n";

    glutInit(&nNumberofArgs, pszArgs);
    glutInitDisplayMode(GLUT_SINGLE | GLUT_RGB);
    glutInitWindowSize(DIM, DIM);
    glutInitWindowPosition(100, 100);
    glutCreateWindow(pszArgs[0]);
    glClearColor (0.0, 0.0, 0.0, 0.0);
    glShadeModel(GL_FLAT);
    glutDisplayFunc(display);
    glutReshapeFunc(reshape);
    glutMainLoop();

    delete [] host_BITMAP;
    cudaFree (dev_BITMAP);
    cudaFree (dev_DIM);
    cudaFree (dev_c1);
    cudaFree (dev_c2);
    cudaFree (dev_kmin);
    cudaFree (dev_kmax);

    return 0;

}


(I dont remember the number)

CPU version of Julia Set

/* For ubuntu compile using : g++ filename.cpp -o julia -lGL -lglut -lGLU afer you remove the window include*/
#include "windows.h"
#include <GL/gl.h>
#include <GL/glu.h>
#include <GL/glut.h>

#include <iostream>

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <ctime>

#define ij(i,j,ni) (i-1+ni*(j-1))

#define sgn(x) (x > 0) - (x < 0)

using namespace std;

int *myBITMAP;
int DIM;
int kmin = 10000;
int kmax=-10000;

int Julia(double x, double y, double c1, double c2){

    double xnew, ynew, Z;
    int k, TF1, TF;
    TF1=0;
    for (k=0;k<200;k++){
        xnew=pow(x,2)-pow(y,2)+c1;
        ynew=2*x*y+c2;
        x=xnew;
        y=ynew;
        Z=sqrt(x*x+y*y);
        TF1=TF1+ max(sgn(Z-5),0)*(200-k);
        TF=max(sgn(5-Z),0);
        x=x*TF;
        y=y*TF;
        //cout << TF1 << "\n";

    }

    return TF1;
}

void makeBitmap(int DIM, double c1, double c2){

    int tf;
    double scale=1.5;
    double x;
    double y;

    for (int i=0;i<DIM;i++){

        //cout << i << "\n";
        for (int j=0;j<DIM;j++){
            x=scale*(double)(DIM/2-j)/(DIM/2);
            y=scale*(double)(DIM/2-i)/(DIM/2);
            tf=Julia(x,y,c1,c2);
            if (tf>kmax)
                kmax=tf;
            if (tf<kmin)
                kmin=tf;
            myBITMAP[ij(i+1,j+1,DIM)]=tf;
        }
    }

}


void display()

{
    glClear(GL_COLOR_BUFFER_BIT);
    glColor3f (1.0, 1.0, 1.0);
    glPointSize(1);
    glBegin(GL_POINTS);
    float pnt;

    for (int i=0;i<DIM;i++){

        for (int j=0;j<DIM;j++){
            pnt=(float)myBITMAP[ij(i+1,j+1,DIM)]*(1.0-0.0)/((float)kmax-(float)kmin);
            glColor3f((GLfloat)pnt, (GLfloat)pnt, (GLfloat)pnt);
            glVertex2i((GLint)j,(GLint)i);
       }
    }
    glEnd();
    glFlush();
}

void reshape (int w, int h)

{
   glViewport (0, 0, (GLsizei) w, (GLsizei) h);
   glMatrixMode (GL_PROJECTION);
   glLoadIdentity ();
   gluOrtho2D (0.0, (GLdouble) w, 0.0, (GLdouble) h);
}

int main(int nNumberofArgs, char* pszArgs[]){
    double c1;
    double c2;

    DIM=atoi(pszArgs[1]);
    c1=atof(pszArgs[2]);
    c2=atof(pszArgs[3]);
    cout << "Number of inputs:" << nNumberofArgs << "\n";
    cout << "The resolution is " << DIM << " \n";
    cout << "The real part is " << c1 << " \n";
    cout << "The imaginary part is " << c2 << " \n";

    myBITMAP = new int [DIM*DIM];

    std::clock_t start;

    double duration;
    start = std::clock();
    makeBitmap(DIM,c1,c2);
    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    std::cout<<"Julia Set in: " << duration << " sec\n";

    cout << kmin << "\n";
    cout << kmax << "\n";

    glutInit(&nNumberofArgs, pszArgs);

    glutInitDisplayMode(GLUT_SINGLE | GLUT_RGB);
    glutInitWindowSize(DIM, DIM);
    glutInitWindowPosition(100, 100);
    glutCreateWindow(pszArgs[0]);

    glClearColor (0.0, 0.0, 0.0, 0.0);

    glShadeModel(GL_FLAT);
    glutDisplayFunc(display);
    glutReshapeFunc(reshape);
    glutMainLoop();

    delete [] myBITMAP;

    return 0;

}

c=0.285+0.01i

PS.
I would like to mention that initially, every time I was running the GPU code I used to get the following warning "Display driver stopped responding and has recovered"

while the CPU code was working fine. I read in few forums that the OS gives only 2 seconds to the GPU to carry out any task and after that stops and recovers, so instead of getting nice pictures I was getting blank screens and few times I was getting partial images.
 To overcome this I followed the advices from  this forum and in particular I add these values in my registry
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Contro l\GraphicsDrivers\DCI]
"Timeout"=dword:00000014
"TdrDdiDelay"=dword:00000014
"TdrDelay"=dword:00000014
"TdrLevel"=dword:00000000

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Contro l\GraphicsDrivers]

"TdrDdiDelay"=dword:00000014
"TdrDelay"=dword:00000014
"TdrLevel"=dword:00000000