Monday, October 29, 2012

My first steps in CUDA

c=-0.8+0.156i


Very recently I got a new Nvidia graphic card (the cheapest I could find with compute capability greater than 1.3), and I started learning the very basic of GPU programming, in hope of using this for my projects.

I considered myself as an experienced Matlab programmer but I have practically no programming experience in c++, so in reality I have to learn simultaneously c++, CUDA and openGL (for some of the exercises)

My first challenge was to setup the environment to compile the CUDA programs. I had already  Visual Studio 2010 express (VS 2010) installed, so I just followed the three basic steps to install the necessary CUDA stuff, which are described here.
Next I configured the VS2010 so that I can compile CUDA programs. Although I couldn't find any official info about that (from nvidia or VS), there are many nice people, which describe these steps, for example here.

I'm repeating them briefly emphasizing on few minor things that can be a headache if they are missed.

1) Open a new project and select win32 console application and at the very last step of the wizard, before you hit finish CHOOSE EMPTY PROJECT (I have spent hours trying to find why I cant see the CUDA C/C++ compiler just because I forgot to check that box)

2) Right after you insert your first file RENAME it from *.cpp to *.cu (again I have spent few more hours to understand why my new project complains about CUDA commands, while the previous one compiled perfectly few minutes ago)

3) Next, right click on the project folder, choose the build customizations and tick the 4.2 CUDA box (or the CUDA version you use).

4) Right click on the project folder again and choose properties. Expand Likner and highlight input. Click on the box next to additional dependencies and add the cudart.lib library.

5) Last, right click on the file you want to compile and choose properties. From the general menu select for item type CUDA C/C++ and you are ready to be in a parallel universe.

I started reading this  book which I highly recommend for beginners (note that I'm beginner in both CUDA and C++. Note also that I'm not affiliated with the author in any way).
One of the book exercises is to program the Julia Set . Unfortunately, for some still unexplained reasons I couldn’t compile the provided code, so I decided to start from scratch and write a more simplistic version of the code without using classes and fancy data structures, although I realize that their programing style is actually match better than mine.

Of course to run this program you need to install openGL.
To compile it with VS2010 you need download the freeglut. (For example I choose the MSVC version from here). Unzip it anywhere and link the VS to that folder (right click on the solution, properties -> Linker ->General->Additional Library Directories add the .../freeglut/lib. Last you need to copy the freeglut.dll file into the folder where the executable is created, for example into the .../Debug folder (Unfortunately I haven't found any better way to overcome this, so I copy this file every time I need to use openGL)
 
The following program requires three inputs. (Note that if you dont give all the three inputs you get an error). First an integer number with the image resolution, second the real part of the constant complex and third the imaginary part of the constant complex. You have to be careful when choosing the complex constant number as you don't always get something nice and many times you get nothing but a white screen. For the figure on top for example I used the famous constant -0.8+0.156i.


c=-0.6+0.4i

GPU Version of Julia Set


#include "windows.h"
#include <GL/gl.h>
#include <GL/glu.h>
#include <GL/glut.h>

#include <iostream>
#include <cuda_runtime.h>
#include <cuda.h>
#include <math.h>
#include <ctime>

/* because I'm primarily a matlab user matrices always start with 1. so I use the ij function to convert to zero based*/
#define ij(i,j,ni) (i-1+ni*(j-1))
#define sgn(x) (x > 0) - (x < 0)

using namespace std;
int *host_BITMAP;
int DIM;
int host_kmin = 10000;
int host_kmax=-10000;

// OpenGl functions
void display()
{
   glClear(GL_COLOR_BUFFER_BIT);
    glColor3f (1.0, 1.0, 1.0);
    glPointSize(1);
    glBegin(GL_POINTS);
    float pnt;

    for (int i=0;i<DIM;i++){
        for (int j=0;j<DIM;j++){
            pnt=(float)host_BITMAP[ij(i+1,j+1,DIM)]*(1.0-0.0)/((float)host_kmax-(float)host_kmin);
            glColor3f((GLfloat)pnt, (GLfloat)pnt, (GLfloat)pnt);
            glVertex2i((GLint)j,(GLint)i);
       }
    }
    glEnd();
    glFlush();
}

void reshape (int w, int h)
{
   glViewport (0, 0, (GLsizei) w, (GLsizei) h);
   glMatrixMode (GL_PROJECTION);
   glLoadIdentity ();
   gluOrtho2D (0.0, (GLdouble) w, 0.0, (GLdouble) h);
}

__device__ int Julia(float x, float y, float *devc1, float *devc2){
    float xnew, ynew, Z;
    int k, TF1, TF;
    TF1=0;
    for (k=0;k<200;k++){
        xnew=pow(x,2)-pow(y,2)+devc1[0];
        ynew=2*x*y+devc2[0];
        x=xnew;
        y=ynew;
        Z=sqrt(x*x+y*y);
        TF1=TF1+ max(sgn(Z-5),0)*(200-k);
        TF=max(sgn(5-Z),0);
        x=x*TF;
        y=y*TF;
    }
    return TF1;
}

__global__ void makeBitmap(int *devBITMAP, int *devDIM,
    float *devc1, float *devc2, int *dev_kmin, int *dev_kmax){//
    int tf;
    float scale=1.5;
    float x;
    float y;
    int jj = blockIdx.x;
    int ii = blockIdx.y;
    x=scale*(float)(devDIM[0]/2.0-jj)/(devDIM[0]/2.0);
    y=scale*(float)(devDIM[0]/2.0-ii)/(devDIM[0]/2.0);
    tf=Julia(x,y,devc1,devc2);
    if (tf>dev_kmax[0])
        dev_kmax[0]=tf;
    if (tf<dev_kmin[0])
        dev_kmin[0]=tf;
    devBITMAP[ij(ii+1,jj+1,devDIM[0])]=tf;
}


int main(int nNumberofArgs, char* pszArgs[]){
    cudaDeviceProp prop;
    int count;
    cudaGetDeviceCount( &count );

    for (int i=0; i< count; i++) {
        printf( "--- General Information for device %d ---\n", i );
        cudaGetDeviceProperties( &prop, i );
        printf( "Name: %s\n", prop.name );
        printf( "Max threads per block: %d\n",prop.maxThreadsPerBlock );
    }

    int dev;
    cudaGetDevice( &dev );
    printf( "ID of current CUDA device: %d\n", dev );
    memset( &prop, 0, sizeof( cudaDeviceProp ) );
    prop.major = 2;
    prop.minor = 0;
    cudaChooseDevice( &dev, &prop );
    printf( "ID of CUDA device closest to revision 2.0: %d\n", dev );
    cudaSetDevice( dev );

    float c1, c2;
    DIM=atoi(pszArgs[1]);
    c1=atof(pszArgs[2]);
    c2=atof(pszArgs[3]);

    cout << "The image dimension is: " << DIM << "\n";
    cout << "The real part is: " << c1 << "\n";
    cout << "The imaginary part is: " << c2 << "\n";

    host_BITMAP = new int[DIM*DIM];

    int *dev_BITMAP;
    int *dev_DIM;
    float *dev_c1;
    float *dev_c2;
    int *dev_kmin;
    int *dev_kmax;

    cudaMalloc((void**)&dev_BITMAP, DIM*DIM*sizeof(int));
    cudaMalloc((void**)&dev_DIM, 1*sizeof(int));
    cudaMalloc((void**)&dev_c1, 1*sizeof(float));
    cudaMalloc((void**)&dev_c2, 1*sizeof(float));
    cudaMalloc((void**)&dev_kmin, 1*sizeof(float));
    cudaMalloc((void**)&dev_kmax, 1*sizeof(float));

    cudaMemcpy(dev_DIM, &DIM, 1*sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c1, &c1, 1*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c2, &c2, 1*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_kmin, &host_kmin, 1*sizeof(int),cudaMemcpyHostToDevice);
    cudaMemcpy(dev_kmax, &host_kmax, 1*sizeof(int),cudaMemcpyHostToDevice);

    dim3 grid(DIM,DIM);

    std::clock_t start;
    double duration;
    start = std::clock();
    makeBitmap <<< grid, 1 >>> (dev_BITMAP, dev_DIM, dev_c1, dev_c2, dev_kmin, dev_kmax);
    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    std::cout<<"Julia Set in: " << duration << " sec\n";

    start = std::clock();
    cudaMemcpy(host_BITMAP,dev_BITMAP,DIM*DIM*sizeof(int),cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_kmin,dev_kmin,1*sizeof(int),cudaMemcpyDeviceToHost);
    cudaMemcpy(&host_kmax,dev_kmax,1*sizeof(int),cudaMemcpyDeviceToHost);
    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    std::cout<<"Data Copied from GPU in: " << duration << " sec\n";

    cout << host_kmin << "\n";
    cout << host_kmax << "\n";

    glutInit(&nNumberofArgs, pszArgs);
    glutInitDisplayMode(GLUT_SINGLE | GLUT_RGB);
    glutInitWindowSize(DIM, DIM);
    glutInitWindowPosition(100, 100);
    glutCreateWindow(pszArgs[0]);
    glClearColor (0.0, 0.0, 0.0, 0.0);
    glShadeModel(GL_FLAT);
    glutDisplayFunc(display);
    glutReshapeFunc(reshape);
    glutMainLoop();

    delete [] host_BITMAP;
    cudaFree (dev_BITMAP);
    cudaFree (dev_DIM);
    cudaFree (dev_c1);
    cudaFree (dev_c2);
    cudaFree (dev_kmin);
    cudaFree (dev_kmax);

    return 0;

}


(I dont remember the number)

CPU version of Julia Set

/* For ubuntu compile using : g++ filename.cpp -o julia -lGL -lglut -lGLU afer you remove the window include*/
#include "windows.h"
#include <GL/gl.h>
#include <GL/glu.h>
#include <GL/glut.h>

#include <iostream>

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <ctime>

#define ij(i,j,ni) (i-1+ni*(j-1))

#define sgn(x) (x > 0) - (x < 0)

using namespace std;

int *myBITMAP;
int DIM;
int kmin = 10000;
int kmax=-10000;

int Julia(double x, double y, double c1, double c2){

    double xnew, ynew, Z;
    int k, TF1, TF;
    TF1=0;
    for (k=0;k<200;k++){
        xnew=pow(x,2)-pow(y,2)+c1;
        ynew=2*x*y+c2;
        x=xnew;
        y=ynew;
        Z=sqrt(x*x+y*y);
        TF1=TF1+ max(sgn(Z-5),0)*(200-k);
        TF=max(sgn(5-Z),0);
        x=x*TF;
        y=y*TF;
        //cout << TF1 << "\n";

    }

    return TF1;
}

void makeBitmap(int DIM, double c1, double c2){

    int tf;
    double scale=1.5;
    double x;
    double y;

    for (int i=0;i<DIM;i++){

        //cout << i << "\n";
        for (int j=0;j<DIM;j++){
            x=scale*(double)(DIM/2-j)/(DIM/2);
            y=scale*(double)(DIM/2-i)/(DIM/2);
            tf=Julia(x,y,c1,c2);
            if (tf>kmax)
                kmax=tf;
            if (tf<kmin)
                kmin=tf;
            myBITMAP[ij(i+1,j+1,DIM)]=tf;
        }
    }

}


void display()

{
    glClear(GL_COLOR_BUFFER_BIT);
    glColor3f (1.0, 1.0, 1.0);
    glPointSize(1);
    glBegin(GL_POINTS);
    float pnt;

    for (int i=0;i<DIM;i++){

        for (int j=0;j<DIM;j++){
            pnt=(float)myBITMAP[ij(i+1,j+1,DIM)]*(1.0-0.0)/((float)kmax-(float)kmin);
            glColor3f((GLfloat)pnt, (GLfloat)pnt, (GLfloat)pnt);
            glVertex2i((GLint)j,(GLint)i);
       }
    }
    glEnd();
    glFlush();
}

void reshape (int w, int h)

{
   glViewport (0, 0, (GLsizei) w, (GLsizei) h);
   glMatrixMode (GL_PROJECTION);
   glLoadIdentity ();
   gluOrtho2D (0.0, (GLdouble) w, 0.0, (GLdouble) h);
}

int main(int nNumberofArgs, char* pszArgs[]){
    double c1;
    double c2;

    DIM=atoi(pszArgs[1]);
    c1=atof(pszArgs[2]);
    c2=atof(pszArgs[3]);
    cout << "Number of inputs:" << nNumberofArgs << "\n";
    cout << "The resolution is " << DIM << " \n";
    cout << "The real part is " << c1 << " \n";
    cout << "The imaginary part is " << c2 << " \n";

    myBITMAP = new int [DIM*DIM];

    std::clock_t start;

    double duration;
    start = std::clock();
    makeBitmap(DIM,c1,c2);
    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    std::cout<<"Julia Set in: " << duration << " sec\n";

    cout << kmin << "\n";
    cout << kmax << "\n";

    glutInit(&nNumberofArgs, pszArgs);

    glutInitDisplayMode(GLUT_SINGLE | GLUT_RGB);
    glutInitWindowSize(DIM, DIM);
    glutInitWindowPosition(100, 100);
    glutCreateWindow(pszArgs[0]);

    glClearColor (0.0, 0.0, 0.0, 0.0);

    glShadeModel(GL_FLAT);
    glutDisplayFunc(display);
    glutReshapeFunc(reshape);
    glutMainLoop();

    delete [] myBITMAP;

    return 0;

}

c=0.285+0.01i

PS.
I would like to mention that initially, every time I was running the GPU code I used to get the following warning "Display driver stopped responding and has recovered"

while the CPU code was working fine. I read in few forums that the OS gives only 2 seconds to the GPU to carry out any task and after that stops and recovers, so instead of getting nice pictures I was getting blank screens and few times I was getting partial images.
 To overcome this I followed the advices from  this forum and in particular I add these values in my registry
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Contro l\GraphicsDrivers\DCI]
"Timeout"=dword:00000014
"TdrDdiDelay"=dword:00000014
"TdrDelay"=dword:00000014
"TdrLevel"=dword:00000000

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Contro l\GraphicsDrivers]

"TdrDdiDelay"=dword:00000014
"TdrDelay"=dword:00000014
"TdrLevel"=dword:00000000

Thursday, August 23, 2012

Inverse transformation of parametric elements

When one solves numerically partial differential equations, using finite element methods, soon or later will encounter with the concept of isoparametric elements. 
My goal here is not to explain the concept (I assume you have an idea), but to provide the solution to a problem that I faced up recently: I had to convert parametric or natural or local coordinates to global and vice versa.

A bit of background:
Every point in a domain is characterized by its coordinates (x, y ,z). These coordinates are called global. However in real applications it is not easy to work with the real or global coordinates, therefore it is common to use another coordinate system, where the location and the values of each point is a function of the location or values of the nodes of the element that contain the point. In other words we interpolate the value of a point  from the values of the point corners.

For example, in a triangle where v1, v2, v3 are the values on the three corners, we can express the value of a  point using the following formula:
vp=v1*N1(u,w)+v2*N2(u,w)+v3*N3(u,w) 
where, N1, N2, N3 are called shape functions, u, w, are the parametric variables and vp is the interpolated value (it can be simply the coordinates or hydraulic conductivity, or even a vector e.g velocity)
The parametric variables usually span within [0 1] or [-1 1].

Below is a small matlab/octave code that may help you to understand the concept of parametric variables.
(In case you don't know, the shape functions of a 2D triangle are N1=u, N2=w, and N3=1-u-w)

hold on % create a window plot
[x y]=ginput(3); %pick manually the three corners of your triangle
z=[0.7; 0.1; 0.3];%Assign some random property on the nodes (eg. height)
p=[]; %create an empty variable
for u=0:0.1:1 %loop through the parametric variable
   for w=0:0.1:1 %same here
      if u+w<=1% for triangles u+w must be less than one
               %otherwise the point is outside of the triangle
           p=[p;u*x(1)+w*x(2)+(1-u-w)*x(3)... %find the global
                u*y(1)+w*y(2)+(1-u-w)*y(3)... %coordinates for
                u*z(1)+w*z(2)+(1-u-w)*z(3)];  %each u,w pair
      end
   end
end
plot3(x([1:3 1]), y([1:3 1]), [0 0 0 0]);
hold on;
plot3(x([1:3 1]), y([1:3 1]), z([1:3 1]),'r'); 
plot3(p(:,1), p(:,2), p(:,3),'o');

 
The above code will produce a red triangle in 3D, the projection of the triangle in the x-y plane and a series of points that lay on the plane

We see that we can  interpolate the height value of every point within the triangle using very simple calculations.


So far we showed how you can find the global coordinates of a point if you know the local ones. Generally this is a straightforward task, yet the opposite can be cumbersome. 
Bellow I will explain a general method where using the symbolic toolbox of matlab this becomes also a trivial task.    

For a 2D triangle the problem is stated as follows, given the three corners of a triangle (x1,y1), (x2,y2), (x3,y3) and the coordinates of a point (xp,yp) find the parametric coordinates (u,w).
In case of triangles this is quite easy because the shape functions have relatively simple form. First we write the two equations:
xp=x1*u+x2*w+x3*(1-u-w)
yp=y1*u+y2*w+y3*(1-u-w)
and then we solve a system of two equations with two unknowns (u and w). This is actually something that matlab symbolic toolbox is very good at.
 Lets first create few symbolic variables
>> syms x1 x2 x3 y1 y2 y3 xp yp u w
>> eq1 = x1*u+x2*w+x3*(1-u-w) - xp
>> eq2 = y1*u+y2*w+y3*(1-u-w) - yp
Finally, all we have to do is to tell matlab to solve this system for us
>>  S=solve(eq1==0,eq2==0,u,w);
The outcome of the last command is a structured variable with fields u and w, which contains the solutions for u and w.
S.u=(x2*y3 - x3*y2 - x2*yp + xp*y2 + x3*yp - xp*y3)/(x1*y2 - x2*y1 - x1*y3 + x3*y1 + x2*y3 - x3*y2)
S.w=-(x1*y3 - x3*y1 - x1*yp + xp*y1 + x3*yp - xp*y3)/(x1*y2 - x2*y1 - x1*y3 + x3*y1 + x2*y3 - x3*y2)

Of course for the 2D triangle one can derive these formulas using a different approach.

Wednesday, August 22, 2012

Reading Modflow output files (again)

In many cases its very useful to be able to read and process in batch mode the output file of Modflow (Optimization, inverse modeling, Monte Carlo simulation etc...)

The USGS has provided a utility program to convert the binary output to ascii, which is easier to read, yet as you will see below it is rather straightforward to read the binary files without using an intermediate step.

In general, reading binary files requires precise knowledge of how the data are written on the file. Finding this information was actually more difficult than writing the program itself. Go to http://water.usgs.gov/nrp/gwsoftware/modflow2000/MFDOC/ and then and then under the FAQ look for "How can I read data from a binary file generated by MODFLOW?". At the bottom of the description click the plus next to Array Data and there they are!!.

Let's suppose we want to read the "mymodel.hds" binary file that contains the head field of a multi-layer aquifer with Nlay layers.
To read the file  we will use few Matlab/Octave commands, fopen/fclose, fread.

>>fid=fopen("mymodel.hds",'r'); % open the file for reading

Next we will start reading the data but as the description suggest we should know whether the data are written in single-precision or double-precision format. As we would never know beforehand the correct format, we just make an assumption and check this later.

 First we need to read an integer of 4 bytes, which is the the time step number.
>>KSPT = fread(fid, 1, 'uint');
similarly:
>>KPER = fread(fid, 1, 'uint');% read the stress period number
>>PERTIM = fread(fid, 1, 'float'); % the time in the current stress period, which is a real number and that's why we use 'float' format.
>> TOTIM = fread(fid, 1, 'float'); %the total elapsed time, which again is a real number.

Next we have to read a word, which is actually the moment of truth. If the assumptions about the single/double precision format or the number of bytes (4 or 8) for the real numbers are correct, we should read something that makes sense. When we read a head field, for example, we should simply read : '            HEAD'.
According to the description  we have to read 16 ANSI characters
>> DESC = fread(fid, 16, 'char');
The DESC now is an array that looks like this:
[32 32 32 32 32 72 69 65 68], which are the ANSI characters of the word we just read.
To convert it to a readable format use:
>>DESC=char(DESC');
DESC  is now the word '     HEAD' in ascii characters.
If DESC matches one of the values of the table, on the FAQ then our assumptions are correct.
Next we read 3 integer numbers
>>NCOL = fread(fid, 1, 'uint'); % number of columns
>>NROW = fread(fid, 1, 'uint'); % number of rows
>>ILAY = fread(fid, 1, 'uint'); % layer number
Finally we have to read the head field itself using a loop
>> for ir = 1 : NROW
       for ic = 1 : NCOL
           H(ir,ic,ILAY)=fread(fid, 1, float);
       end
   end

In case of multi layer we need to put all the above commands into a loop.

Last we need to close the file (which is something I always forget)
>>fclose(fid);

(I have used many times the above implementation without any problems. Therefore I cannot give recommendations if the file uses a different format).