Allocating Local Memory

>> Friday, October 3, 2014

Recently, someone asked me about the difference between the two methods of allocating local memory. That is, why would you use one method over the other? This isn't an easy question, and before I answer it here, I'd like to explain what the two methods are.

The first allocation method is performed inside the kernel. If a variable is declared in the kernel function, its declaration must identify which memory region contains its data. If the declaration is preceded by __local, the variable will be allocated from local memory. For example, the following declaration sets aside 1024 bytes of local memory to be accessed as an array named foo.

__local float foo[256];

The second method is performed in host code. clSetKernelArg assigns data to be passed into the kernel function as an argument. The last parameter of clSetKernelArg points to the argument's data. If the last argument points to a memory object (cl_mem), the kernel will access the data from global/constant memory. If the argument points to a primitive like an int, the kernel will access the data from private memory. But if the last argument is NULL, no data will be passed to the kernel. In this case, the purpose of clSetKernelArg is to allocate local memory for the kernel's use.

An example will help make this clear. Consider the following calls to clSetKernelArg:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &example_buffer);
clSetKernelArg(kernel, 1, sizeof(int), &example_int);
clSetKernelArg(kernel, 2, 256 * sizeof(float), NULL);

These lines define the first three arguments of the kernel. If the kernel function's name is foo, the following code shows how foo's arguments might be accessed as function parameters:
__kernel void foo(__global float4* ex_buffer,
                  int ex_int,
                  __local float* local_var)

In the third call to clSetKernelArg, the data pointer is set to NULL. This tells the kernel that its third argument can be allocated from global memory or from local memory. This data is uninitialized—the kernel will read/write to the memory as it performs its computation.

So, the two methods of allocating local memory are declaring a local variable in a kernel and calling clSetKernelArg with a NULL data pointer. When do you use one over the other? The answer is straightforward.

If the size of the local memory is constant, use the first method because it's simpler. But if the kernel must be executed multiple times with different sizes of local memory, use the second method. This is because the second method makes it possible to control the local memory size with a variable.

17 comments:

Manish Kumar October 7, 2014 at 5:17 AM  

Thanks A lot. Its helpful :)

Anonymous,  October 8, 2014 at 8:15 PM  

To be precise it's not the clSetKernelArg's last parameter that decides whether that argument refers to local or global memory, OpenCL specification allows any memory object argument to be set to NULL.
It's decided by the __global/__local qualifier preceding the parameter in the kernel's signature (eg. __local float* local_var).

Matt Scarpino October 9, 2014 at 6:53 AM  

Ah. I didn't know that. Thank you.

Matt Scarpino October 11, 2014 at 3:21 PM  

With regard to Anonymous's comment, I can't think of any reason why you'd set the last parameter of clSetKernelArg to NULL, and then access the kernel parameter as a global variable instead of as a local variable. After all, the host has no way to read the variable's data or write to it. Hmm...

Matt Scarpino October 11, 2014 at 3:25 PM  

I modified the article to state that it's possible to allocate data from global or local memory if the last parameter of clSetKernelArg is set to NULL. Thank you, Anonymous.

Manish Kumar October 13, 2014 at 3:44 AM  

Hi Matt and Anonymous,
Just a question out of topic.
Which type of operations are faster on the GPU side. Fixed point or Floating point?
I am getting almost same performance result for both.
Can you guide me here?

Basically I need to decide whether I should write my optimizations in fixed point code or floating point code for GPU

Matt Scarpino October 13, 2014 at 4:15 PM  

Hi Manish,

GPU architecture changes from vendor to vendor, but 99% of a GPU's regular work involves floating-point computation. For this reason, many GPUs don't have fixed-point processing units. Instead, they convert fixed-point values to floating-point values and processing the floating-point values normally.

Therefore, if you know your target will be a GPU, I recommend optimizing for floating-point code. But if you target a heterogeneous processor like Intel's Haswell or AMD's Fusion, you might be better off optimizing for fixed-point computation.

Best of luck,
Matt

Manish Kumar October 28, 2014 at 8:32 AM  

Can you give me your email id or any communication id. I just wanted to ask one question on OpenCl optimization ?

Matt Scarpino November 1, 2014 at 7:39 PM  

Sure thing. I'm at mattscar@gmail.com.

wgh April 1, 2015 at 3:41 AM  

I think using the first method, you can still control the size of local memory.
What you can do is define your local memory with a macro:

__local int array[SIZE];

then when building program, passing the SIZE as a an building option like this:

std::stringstream ss;
ss << "-D SIZE=" << size;
char * build_options = ss.str().c_str();

clBuildProgram(program, devices, device_list, build_options, NULL, NULL);

Matt Scarpino April 5, 2015 at 2:40 PM  

Ah yes. That's another way to set the size of a local memory block from the host.

youngwanLEE,  May 1, 2015 at 11:00 AM  

Hi Matt. I studied your OpenCL book recently. I have a qeustion in Chapter 10 about memory bank conflict.
I'm developing face detection app in android platform using OpenCL. face detection algorithm is based on Viola Jones algorithm. I tried to make Cascade classification step kernel code. and I set **classifier data** of cascade stage 1 among cascade stages to **local memory(__local)** because of classifier data are used for all work-items.

but, kernel profiling time without using local mem(using global mem) is more faster than that does with using local memory.

------------------------------------------------------------------------------

youngwanLEE,  May 1, 2015 at 11:11 AM  

__kernel void CASCADE(__read_only image2d_t input_image, __write_only image2d_t output_image,__constant float* classifierMem,__constant int* idxNumValStageArray, int numTotStage, __constant int* vecSkin)
{
int cascadeLocalSize = get_local_size(0);

__local float localS1F1[42];

int localIdx = get_local_id(1)*cascadeLocalSize + get_local_id(0);
if(localIdx<42)
{
int stage1Idx = localIdx + idxNumValStageArray[0]+4;
localS1F1[localIdx] = classifierMem[stage1Idx];
}

barrier(CLK_LOCAL_MEM_FENCE);

float resizeFactor = 1.0;

int2 im_dim = get_image_dim(input_image);

unsigned int srcWidth = im_dim.x*(float)resizeFactor;
unsigned int srcHeight = im_dim.y*(float)resizeFactor;

int gx = get_global_id(0);
int gy = get_global_id(1);


int skinX=0;
int skinY=0;
int coordi=vecSkin[512*gy+gx];
skinX = coordi%im_dim.x;
skinY = coordi/im_dim.x;

if( skinX >= 10 && skinY >= 10 )
{
skinX -= 10;
skinY -= 10;
}



int type = gx%3;

unsigned int windowWidth = classifierMem[0];
unsigned int windowHeight = classifierMem[1];


unsigned int stageIndex;
float stageThres;
float numFeatures;
unsigned int featureIndex;
float featureValue;

if(skinX0;i--){
if(stagePass){
if(index == 0){
stageIndex = idxNumValStageArray[0];
stageThres = classifierMem[stageIndex+2];
numFeatures = classifierMem[stageIndex+3];
featureIndex = 0;
featureValue = 0.0;
}
else{
stageIndex = idxNumValStageArray[index];
stageThres = classifierMem[stageIndex+2];
numFeatures = classifierMem[stageIndex+3];
featureIndex = stageIndex+4;
featureValue = 0.0;
}
float featureThres;
float succVal;
float failVal;
unsigned int numRegions;
float regionValue;

if(type ==0 && index==0)
{
for(unsigned int j=numFeatures; j>0;j--){
if(stagePass){
featureThres=localS1F1[featureIndex++]*(windowWidth*windowHeight);
succVal=localS1F1[featureIndex++];
failVal=localS1F1[featureIndex++];
numRegions = localS1F1[featureIndex++];
regionValue =0.0;

float4 rectValue;
int4 regionP;
for(unsigned int k=numRegions; k>0;k--){

regionP.x=(int)(localS1F1[featureIndex])+skinX;
regionP.y=(int)(localS1F1[featureIndex+1])+skinY;
regionP.z=(int)(localS1F1[featureIndex+2])+regionP.x;
regionP.w=(int)(localS1F1[featureIndex+3])+regionP.y;

rectValue.x = read_imagef(input_image, sampler, regionP.xy).x;
rectValue.y = read_imagef(input_image, sampler, regionP.zy).x;
rectValue.z = read_imagef(input_image, sampler, regionP.xw).x;
rectValue.w = read_imagef(input_image, sampler, regionP.zw).x;

regionValue += dot(rectValue, (float4)(1.0f, -1.0f, -1.0f, 1.0f))*localS1F1[featureIndex+4];

featureIndex+=5;
}
featureValue += (regionValue < featureThres)?failVal:succVal;
if(j*2 == (unsigned int)numFeatures && featureValue*2 < stageThres) stagePass =false;

}// end of if(stagePass)
}// end of for(unsigned int j=numFeatures; j>0;j--)
index++;
if(featureValue < stageThres) stagePass =false;
else if(index==numTotStage) write_imagef(output_image, (int2)(skinX, skinY), (0.1));

}

youngwanLEE,  May 1, 2015 at 11:13 AM  

else if(type ==1 && index ==0)
{
featureIndex +=14;
for(unsigned int j=numFeatures; j>0;j--){
if(stagePass){
if(j==1)
featureIndex -= 42;

featureThres=localS1F1[featureIndex++]*(windowWidth*windowHeight);
succVal=localS1F1[featureIndex++];
failVal=localS1F1[featureIndex++];
numRegions = localS1F1[featureIndex++];
regionValue =0.0;


float4 rectValue;
int4 regionP;
for(unsigned int k=numRegions; k>0;k--){

regionP.x=(int)(localS1F1[featureIndex])+skinX;
regionP.y=(int)(localS1F1[featureIndex+1])+skinY;
regionP.z=(int)(localS1F1[featureIndex+2])+regionP.x;
regionP.w=(int)(localS1F1[featureIndex+3])+regionP.y;

rectValue.x = read_imagef(input_image, sampler, regionP.xy).x;
rectValue.y = read_imagef(input_image, sampler, regionP.zy).x;
rectValue.z = read_imagef(input_image, sampler, regionP.xw).x;
rectValue.w = read_imagef(input_image, sampler, regionP.zw).x;

regionValue += dot(rectValue, (float4)(1.0f, -1.0f, -1.0f, 1.0f))*localS1F1[featureIndex+4];

featureIndex+=5;
}
featureValue += (regionValue < featureThres)?failVal:succVal;
if(j*2 == (unsigned int)numFeatures && featureValue*2 < stageThres) stagePass =false;

}
}
index++;
if(featureValue < stageThres) stagePass =false;
else if(index==numTotStage) write_imagef(output_image, (int2)(skinX, skinY), (0.1));
}

youngwanLEE,  May 1, 2015 at 11:13 AM  

else if(index == 0)
{
featureIndex +=28;
for(unsigned int j=numFeatures; j>0;j--){
if(stagePass){

if(j==2) featureIndex -= 42;

featureThres=localS1F1[featureIndex++]*(windowWidth*windowHeight);
succVal=localS1F1[featureIndex++];
failVal=localS1F1[featureIndex++];
numRegions = localS1F1[featureIndex++];
regionValue =0.0;

float4 rectValue;
int4 regionP;
for(unsigned int k=numRegions; k>0;k--){

regionP.x=(int)(localS1F1[featureIndex])+skinX;
regionP.y=(int)(localS1F1[featureIndex+1])+skinY;
regionP.z=(int)(localS1F1[featureIndex+2])+regionP.x;
regionP.w=(int)(localS1F1[featureIndex+3])+regionP.y;

rectValue.x = read_imagef(input_image, sampler, regionP.xy).x;
rectValue.y = read_imagef(input_image, sampler, regionP.zy).x;
rectValue.z = read_imagef(input_image, sampler, regionP.xw).x;
rectValue.w = read_imagef(input_image, sampler, regionP.zw).x;

regionValue += dot(rectValue, (float4)(1.0f, -1.0f, -1.0f, 1.0f))*localS1F1[featureIndex+4];

featureIndex+=5;
}// end of for(unsigned int k=numRegions; k>0;k--)
featureValue += (regionValue < featureThres)?failVal:succVal;
if(j*2 == (unsigned int)numFeatures && featureValue*2 < stageThres) stagePass =false;

}// end of if(stagePass)
}//end of for(unsigned int j=numFeatures; j>0;j--)
index++;
if(featureValue < stageThres) stagePass =false;
else if(index==numTotStage) write_imagef(output_image, (int2)(skinX, skinY), (0.1));
}
//stage
else{
for(unsigned int j=numFeatures; j>0;j--){
if(stagePass){
featureThres=classifierMem[featureIndex++]*(windowWidth*windowHeight);
succVal=classifierMem[featureIndex++];
failVal=classifierMem[featureIndex++];
numRegions = classifierMem[featureIndex++];
regionValue =0.0;
float4 rectValue;
int4 regionP;
for(unsigned int k=numRegions; k>0;k--){
regionP.x=(int)(classifierMem[featureIndex])+skinX;
regionP.y=(int)(classifierMem[featureIndex+1])+skinY;
regionP.z=(int)(classifierMem[featureIndex+2])+regionP.x;
regionP.w=(int)(classifierMem[featureIndex+3])+regionP.y;
rectValue.x = read_imagef(input_image, sampler, regionP.xy).x;
rectValue.y = read_imagef(input_image, sampler, regionP.zy).x;
rectValue.z = read_imagef(input_image, sampler, regionP.xw).x;
rectValue.w = read_imagef(input_image, sampler, regionP.zw).x;
regionValue += dot(rectValue, (float4)(1.0f, -1.0f, -1.0f, 1.0f))*classifierMem[featureIndex+4];
featureIndex+=5;
}
featureValue += (regionValue < featureThres)?failVal:succVal;
if(j*2 == (unsigned int)numFeatures && featureValue*2 < stageThres) stagePass =false;
}
}
index++;
if(featureValue < stageThres) stagePass =false;
else if(index==numTotStage) write_imagef(output_image, (int2)(skinX, skinY), (0.1));
}
}
}
}else return;
}
------------------------------------------------------------------------------
this is my code. why using local memory slower? thanks in advance.

Anonymous,  April 3, 2016 at 11:33 PM  

can any one send opencl code for prim's or kruskal's algorithm

Post a Comment

  © Blogger template Werd by Ourblogtemplates.com 2009

Back to TOP