Parent Node:
Code Snippet Name Code Snippet Description Parent Node Communities
Pixel Buffer Objects: Mixing CUDA and OpenGL within the same application

Following is the source for the Doctor Dobb's Journal article Part 15 Using Pixel Buffer Objects with CUDA and OpenGL. This source includes Microsoft Visual Studio build files as well as a Linux command-line to build an executable.

Many thanks to Joe Stam at NVIDIA for providing the Visual Studio build files. Joe also notes you need to remove the following lines from perlinKernelPBO.cu:

#include <cutil_gl_inline.h>

#include <cuda_gl_interop.h>

 

Machine Learning and Data Mining
Using Vertex Buffer Objects with CUDA and very fast surface rendering with primitive restart

Following is the source for the Doctor Dobb's Journal article for a future article entitled Using Vertex Buffer Objects with CUDA and OpenGL. This source includes Microsoft Visual Studio build files as well as a Linux command-line to build an executable.

This code demonstrates how to draw 3D points, wireframe and surfaces using the framework described in Part 15 of my Doctor Dobb's article Using Pixel Buffer Objects with CUDA and OpenGL. I left in some ifdef statements so you can verify for yourself the speed of using Primitive Restart to bypass PCI bus bandwidth limitations.

Many thanks to Joe Stam at NVIDIA for providing the Visual Studio build files. Joe also notes you need to remove the following lines from perlinKernelVBO.cu and change the uint variable in runCUDA to "unsigned int":

#include <cutil_gl_inline.h>

#include <cuda_gl_interop.h>

 

Machine Learning and Data Mining
Line forward projection on CUDA

 

GPU Computing Gems Source Code
Cone-Beam CT image reconstruction using the Katsevich Algorithm

This program reconstructs Helical Cone-Beam CT images using the Katsevich algorithm.

Two versions are included. Kat_1024 reconstructs the 1024x1024x1024 volumes, it allocates and deallocates intermediary memory to accomodate the large

size projections used for 1024.

The version Kat_512 is a bit better, it allocates all the device memory needed at the beginning and free only at the end. It is better when reconstructing 512x512x512 volumes or smaller. It can not reconstruct 1024x1024x1024 volumes unless the GPU board has 6GB memory or more.

The program genproj, generates the projections to be used for reconstruction tests. Real projections from CT machines can be used,

 

GPU Computing Gems Source Code
MAGMA Library

 Major chip manufacturers are developing next-generation microprocessor designs that are heterogeneous/hybrid in nature, integrating homogeneous x86-based multicore CPU components and GPU components. The MAGMA (Matrix Algebra on GPU and Multicore Architectures) project’s goal is to develop innovative linear algebra algorithms and to incorporate them into a library that is


• similar to LAPACK in functionality, data storage, and interface

but targeting the


• next-generation of highly parallel, and heterogeneous processors.


This will allow scientists to effortlessly port any of their LAPACK-relying software components and to take advantage of the new architectures. MAGMA is designed to run on homogeneous x86-based multicores and take advantage of GPU components (if available). This is achieved by developing a class of multi-level blocking algorithms that split the computation into tasks of varying granularity (e.g. large for available GPUs) and dynamically scheduling their execution.


The transition from small tasks (of small block size) to large tasks is done in a recursive fashion where the intermediate for the transition tasks are executed in parallel using dynamic scheduling. The new algorithms, when run on just homogeneous x86-based multicores, outperform vendor implementations (e.g. MKL) in LAPACK accuracy and data layout (no block data-layouts). Adding a GPU increases the performance proportionally to the GPU’s computational characteristics. These results are for the one-sided matrix factorizations – LU, QR, and Cholesky. Work on the two-sided factorizations, e.g. Hessenberg reduction, shows more drastic performance improvements (significantly exceeding an order of magnitude) when comparing homogeneous multicores to hybrid multicores+GPUs. The main reason for these performance improvements is mainly due to the fact that the two-sided factorizations have bandwidth limitations that can not be overcome using just homogeneous multicores. In addition to standard accuracy algorithms (LAPACK compliant accuracy), we develop algorithms within MAGMA that would allow a user-defined tradeoff between accuracy and speed. These algorithms are based on mixed-precision arithmetic and take advantage of GPU’s still much higher single vs double precision arithmetic.

Alternative GPU Programming Systems
A Programmable Graphics Pipeline in CUDA for Order Independent Transparency

This work present a rasterization based rendering pipeline using CUDA. We discuss the implementation details of the basic functionalities in hardware rendering pipeline, with focus on triangle rasterization and raster operations. Within this architecture, we propose two single pass algorithms for efficient rendering of order independent transparency. The results demonstrate significant performance speedups in comparison to the state-of-the-art methods that are based on traditional graphics pipelines.

This work is based on the SI3D paper "FreePipe: a Programmable Parallel Rendering Architecture for Efficient Multi-Fragment Effects"  (http://portal.acm.org/citation.cfm?id=1730804.1730817).  The source code is attached below,  please read the "Readme" before running the code.

GPU Computing Gems Source Code
Multiclass Support Vector Machine

 The scaling of serial algorithms cannot rely on the improvement of CPUs anymore. The performance of classical Support Vector Machine (SVM) implementations has reached its limit and the arrival of the multi core era requires these algorithms to adapt to a new parallel scenario. Graphics Processing Units (GPU) have arisen as high performance platforms to implement data parallel algorithms. In this project, it is described how a naïve implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput. Empirical results show that the training and classification time of the algorithm can be reduced an order of magnitude compared to a classical solver, LIBSVM, while guaranteeing the same accuracy.

Please find attached the multisvm 2.0 release of the source code.

The link to the source code repository where future versions will be available is http://code.google.com/p/multisvm/

* Sample datasets were removed due to their large file size. These can be obtained from the code repository site or the LIBSVM site.

** To compile the code please add the following CUDA libraries to the bin folder of the project or download the release code from the google code repository (which already contains these as part of the visual studio solution).

     cublas64_30_14.dll

     cudart64_30_14.dll

     cudpp64_30_14.dll

     cufft64_30_14.dll

     cutil64.dll

     glew64.dll

     glut32.dll

Computational Finance, Machine Learning and Data Mining, GPU Computing Gems Source Code
Parent Node:
Code Snippet Name Code Snippet Description Parent Node Communities
Parallelization of the x264 encoder using OpenCL

We present an OpenCL enhanced version of the x264 video encoder, using GPUs to accelerate the processing of motion estimation and other significant parts of the algorithm. We present a system wide approach, where we concentrate on the whole encoder architecture, not only in accelerating the critical paths.

This demo includes the full source code for the OpenCL enhanced version of x264, plus some scripts to fetch images from "Big Buck Bunny" and encode them using x264. Please note that the download of the source images might take a long time and take some serious space in your hard drive (~3.4G). We are working in adding the source code to the x264 development tree.

More info can be found at: http://li5.ziti.uni-heidelberg.de/x264gpu/

GPU Computing Gems Source Code
Haar Classifiers for Object Detection with CUDA: Pixel-parallel processing kernel

This kernel performs pixel-parallel processing of the image using Haar classifiers cascade. The getter-functions implement interfaces to various kinds of GPU memory, which is dispatched by the kernel template parameters. The snippet is presented as pseudo-code. To query the status of the project source code, contact me directly Anton Obukhov < aobukhov@nvidia.com > or devsupport@nvidia.com.

GPU Computing Gems Source Code
RNA folding GPU

This code is a GPU implementation of the 'hybrid-ss-min' function of the Unafold package computing RNA secondary structure.

GPU Computing Gems Source Code