这是我的个人小站,当然,还有杜比。

初涉cuda加速并行计算所遇问题的总结

cuda domon 354浏览 1评论

1. 核函数直接返回或发生未知错误,错误原因无从获知。

解决方法:

cudaError_t error = cudaGetLastError();
printf("CUDA gal_signal_thread error: %s\n", cudaGetErrorString(error));

**核函数之后加入如下代码片段,捕获错误信息**

2. 主机端cuda代码执行结果错误检测,如cudaMalloc()。

解决办法:可以直接引用 cuda by example 一书中的 book.h 中代码

static void HandleError( cudaError_t err, const char *file, int line ) {
    if (err != cudaSuccess) {
	printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line );
        exit( EXIT_FAILURE );
    }
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ )

**Example : HANDLE_ERROR( cudaMalloc() )**

3. 我遇到的 Error:Too many resources requested for launch.

Reason:
This error means that the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem.
	
solution:
Reduce the number of threads per block to solve the problem,或者减小核函数的规模以减小使用的寄存器数量。

4. Common Errors

error: a host function call can not be configured
- simply means that you tried to call a routine as if it was a kernel to be executed on the device, but you forgot to put __global__ in front of that routine.

error: Invalid Configuration Argument
- This error means that the dimension of either the specified grid of blocks (dimGrid) , or number of threads in a block (dimBlock), is incorrect. In such a case, the dimension is either zero or the dimension is larger than it should be. This error will only occur if you dynamically determine the dimensions.

error: Unspecified launch failure 
- This error means that CUDA does not know what the problem was. This is the worst error to get because you do not know where to look to correct the error. One way to look at this error message is to mentally translate it to "segmentation fault" for the host code.

5. NSIGHT 进行cuda调试时挂起。

可能原因及解决办法:
	
- 安装了新的 NSIGHT 版本,而显卡驱动没有更新,下载最新的显卡驱动更新之。
- 查看系统是否开启了视频硬件加速,如果是关闭之。 

6. 存在相继调用多个核函数的时候,核函数之间是否需要添加同步操作(如cudaDeviceSynchronize)?

解答: 尽管cuda核的启动是异步的,但是所有GPU相关的任务会被放在一个流里(默认行为)交替执行。所以没有必要在核函数之间调用cudaDeviceSynchronize()之类的同步操作,但你可以用它来检测核函数导致的错误或应用于cuda 多流并行的条件下。
	
例如: 

kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until ememory is copied, memory copy starts only after kernel2 finishes

7. cos等三角函数传入参数数值过大,导致cuda核函数运行时间加倍。

原因: 未知	
解决办法: 传入参数向 2k*PI 取模, 问题可以规避。

8. CPU 与 GPU double型数据精度存在一定差异,对于高精度数据计算需注意。

9. **未解,WHY?**

- 不会进入的for循环,由于其中存在代码而产生gpu处理时间消耗的问题.

转载请注明:Show Me Code » 初涉cuda加速并行计算所遇问题的总结

喜欢 (2)

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请狠狠点击下面的

发表我的评论
取消评论
表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
(1)个小伙伴在吐槽
  1. 处女作 😀
    domon2017-02-28 13:36 回复