GPU computing gems NVIDIA

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (19.24 MB, 889 trang )

GPU Computing Gems
Emerald Edition

Morgan Kaufmann’s Applications of GPU Computing Series
Computing is quickly becoming the third pillar of scientific research, due in large part to the performance gains achieved through graphics processing units (GPUs), which have become ubiquitous in
handhelds, laptops, desktops, and supercomputer clusters. Morgan Kaufmann’s Applications of GPU
Computing series offers training, examples, and inspiration for researchers, engineers, students, and
supercomputing professionals who want to leverage the power of GPUs incorporated into their simulations or experiments. Each high-quality, peer-reviewed book is written by leading experts uniquely
qualified to provide parallel computing insights and guidance.
Each GPU Computing Gems volume offers a snapshot of the state of parallel computing across a
carefully selected subset of industry domains, giving you a window into the lead-edge research occurring across the breadth of science, and the opportunity to observe others’ algorithm work that might
apply to your own projects. Find out more at />Recommended Parallel Computing Titles
Programming Massively Parallel Processors
A Hands-on Approach
By David B. Kirk and Wen-mei W. Hwu
ISBN: 9780123814722
GPU Computing Gems: Jade Edition
Editor-in-Chief: Wen-mei W. Hwu
ISBN: 9780123859631
Coming Summer 2011
The Art of Multiprocessor Programming
By Maurice Herlihy and Nir Shavit
ISBN: 9780123705914

GPU Computing Gems
Emerald Edition

Wen-mei W. Hwu

AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann Publishers is an imprint of Elsevier

Acquiring Editor: Todd Green
Assistant Editor: Robyn Day
Project Manager: Paul Gottehrer
Designer: Dennis Schaefer
Morgan Kaufmann is an imprint of Elsevier
30 Corporate Drive, Suite 400, Burlington, MA 01803, USA
c 2011 NVIDIA Corporation and Wen-mei W. Hwu. Published by Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or
mechanical, including photocopying, recording, or any information storage and retrieval system, without
permission in writing from the publisher. Details on how to seek permission, further information about the
Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center
and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other
than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods or professional practices, may become necessary. Practitioners and
researchers must always rely on their own experience and knowledge in evaluating and using any information or
methods described herein. In using such information or methods they should be mindful of their own safety and
the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability
for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or
from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

Library of Congress Cataloging-in-Publication Data
GPU computing gems / editor, Wen-mei W. Hwu.
p. cm.
Includes bibliographical references.
ISBN 978-0-12-384988-5
1. Graphics processing units–Programming. 2. Imaging systems. 3. Computer graphics. 4. Image
processing–Digital techniques. I. Hwu, Wen-mei.
T385.G6875 2011
006.6–dc22
2010047487
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
For information on all MK publications visit our website at
www.mkp.com
Printed in the United States of America
11 12 13 14 15 11 10 9 8 7 6 5 4 3 2 1

Contents
Editors, Reviewers, and Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Wen-mei W. Hwu

SECTION 1 SCIENTIFIC SIMULATION
Robert M. Farber

CHAPTER 1 GPU-Accelerated Computation and Interactive Display of Molecular
Orbitals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

John E. Stone, David J. Hardy, Jan Saam, Kirby L. Vandivort, Klaus Schulten

CHAPTER 2 Large-Scale Chemical Informatics on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Imran S. Haque, Vijay S. Pande

CHAPTER 3 Dynamical Quadrature Grids: Applications in Density Functional
Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

Nathan Luehr, Ivan Ufimtsev, Todd Martinez

CHAPTER 4 Fast Molecular Electrostatics Algorithms on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
David J. Hardy, John E. Stone, Kirby L. Vandivort, David Gohara, Christopher Rodrigues,
Klaus Schulten

CHAPTER 5 Quantum Chemistry: Propagation of Electronic Structure on a GPU . . . . . . . . . . . . . 59
Jacek Jakowski, Stephan Irle, Keiji Morokuma

CHAPTER 6 An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Martin Burtscher, Keshav Pingali

CHAPTER 7 Leveraging the Untapped Computation Power of GPUs: Fast Spectral
Synthesis Using Texture Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

Richard Townsend, Karthikeyan Sankaralingam, Matthew D. Sinclair

CHAPTER 8 Black Hole Simulations with CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Frank Herrmann, John Silberholz, Manuel Tiglio

CHAPTER 9 Treecode and Fast Multipole Method for N -Body Simulation with CUDA. . . . . . . . 113
Rio Yokota, Lorena A. Barba

v

vi

Contents

CHAPTER 10 Wavelet-Based Density Functional Theory Calculation on Massively
Parallel Hybrid Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
´
Luigi Genovese, Matthieu Ospici, Brice Videau, Thierry Deutsch, Jean-Franc¸ois Mehaut

SECTION 2 LIFE SCIENCES
Bertil Schmidt

CHAPTER 11 Accurate Scanning of Sequence Databases with the Smith-Waterman
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Łukasz Ligowski, Witold R. Rudnicki, Yongchao Liu, Bertil Schmidt

CHAPTER 12 Massive Parallel Computing to Accelerate Genome-Matching . . . . . . . . . . . . . . . . . . 173

Ben Weiss, Mike Bailey

CHAPTER 13 GPU-Supercomputer Acceleration of Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . 185
Ali Khajeh-Saeed, J. Blair Perot

CHAPTER 14 GPU Accelerated RNA Folding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Guillaume Rizk, Dominique Lavenier, Sanjay Rajopadhye

CHAPTER 15 Temporal Data Mining for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Wu-chun Feng, Yong Cao, Debprakash Patnaik, Naren Ramakrishnan

SECTION 3 STATISTICAL MODELING
Mike Giles

CHAPTER 16 Parallelization Techniques for Random Number Generators . . . . . . . . . . . . . . . . . . . . 231
Thomas Bradley, Jacques du Toit, Robert Tong, Mike Giles, Paul Woodhams

CHAPTER 17 Monte Carlo Photon Transport on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
´ o´ Szirmay-Kalos, Balazs
´ Toth,
´
´ Magdics
Laszl
Milan

CHAPTER 18 High-Performance Iterated Function Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Christoph Schied, Johannes Hanika, Holger Dammertz, Hendrik P. A. Lensch

SECTION 4 EMERGING DATA-INTENSIVE APPLICATIONS
Volodymyr Kindratenko

CHAPTER 19 Large-Scale Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Jerod J. Weinman, Augustus Lidaka, Shitanshu Aggarwal

Contents

vii

CHAPTER 20 Multiclass Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Sergio Herrero-Lopez

CHAPTER 21 Template-Driven Agent-Based Modeling and Simulation with CUDA . . . . . . . . . . . . 313
Paul Richmond, Daniela Romano

CHAPTER 22 GPU-Accelerated Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Robin M. Weiss

SECTION 5 ELECTRONIC DESIGN AUTOMATION
Sunil P. Khatri

CHAPTER 23 High-Performance Gate-Level Simulation with GP-GPUs . . . . . . . . . . . . . . . . . . . . . . . . 343
Debapriya Chatterjee, Andrew DeOrio, Valeria Bertacco

CHAPTER 24 GPU-Based Parallel Computing for Fast Circuit Optimization . . . . . . . . . . . . . . . . . . . 365
Yifang Liu, Jiang Hu

SECTION 6 RAY TRACING AND RENDERING
Austin Robison

CHAPTER 25 Lattice Boltzmann Lighting Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Robert Geist, James Westall

CHAPTER 26 Path Regeneration for Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
´ Vlastimil Havran, Carsten Dachsbacher
Jan Novak,

CHAPTER 27 From Sparse Mocap to Highly Detailed Facial Animation . . . . . . . . . . . . . . . . . . . . . . . 413
Bernd Bickel, Manuel Lang

CHAPTER 28 A Programmable Graphics Pipeline in CUDA for Order-Independent
Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
Mengcheng Huang, Fang Liu, Xuehui Liu, Enhua Wu

SECTION 7 COMPUTER VISION
James Fung

CHAPTER 29 Fast Graph Cuts for Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
P.J. Narayanan, Vibhav Vineet, Timo Stich

CHAPTER 30 Visual Saliency Model on Multi-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Anis Rahman, Dominique Houzet, Denis Pellerin

viii

Contents

CHAPTER 31 Real-Time Stereo on GPGPU Using Progressive Multiresolution Adaptive
Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

Yong Zhao, Gabriel Taubin

CHAPTER 32 Real-Time Speed-Limit-Sign Recognition on an Embedded System
Using a GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
¨ ¸ elik, Vladimir Glavtchev, Jeffrey M. Ota, John D. Owens
Pinar Muyan-Ozc

CHAPTER 33 Haar Classifiers for Object Detection with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Anton Obukhov

SECTION 8 VIDEO AND IMAGE PROCESSING
Timo Stich

CHAPTER 34 Experiences on Image and Video Processing with CUDA and OpenCL . . . . . . . . . . 547
Alptekin Temizel, Tugba Halici, Berker Logoglu, Tugba Taskaya Temizel,
Fatih Omruuzun, Ersin Karaman

CHAPTER 35 Connected Component Labeling in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
Ondˇrej Sˇ t´ava, Bedˇrich Beneˇs

CHAPTER 36 Image De-Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
Joe Stam, James Fung

SECTION 9 SIGNAL AND AUDIO PROCESSING
John Roberts

CHAPTER 37 Efficient Automatic Speech Recognition on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
Jike Chong, Ekaterina Gonina, Kurt Keutzer

CHAPTER 38 Parallel LDPC Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619

Gabriel Falcao, Vitor Silva, Leonel Sousa

CHAPTER 39 Large-Scale Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
Yifeng Chen, Xiang Cui, Hong Mei

SECTION 10 MEDICAL IMAGING
Lawrence Tarbox

CHAPTER 40 GPU Acceleration of Iterative Digital Breast Tomosynthesis . . . . . . . . . . . . . . . . . . . . 647
Dana Schaa, Benjamin Brown, Byunghyun Jang, Perhaad Mistry, Rodrigo Dominguez,
David Kaeli, Richard Moore, Daniel B. Kopans

Contents

ix

CHAPTER 41 Parallelization of Katsevich CT Image Reconstruction Algorithm
on Generic Multi-Core Processors and GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Abderrahim Benquassmi, Eric Fontaine, Hsien-Hsin S. Lee

CHAPTER 42 3-D Tomographic Image Reconstruction from Randomly Ordered Lines
with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
Guillem Pratx, Jing-Yu Cui, Sven Prevrhal, Craig S. Levin

CHAPTER 43 Using GPUs to Learn Effective Parameter Settings for GPU-Accelerated
Iterative CT Reconstruction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
Wei Xu, Klaus Mueller

CHAPTER 44 Using GPUs to Accelerate Advanced MRI Reconstruction with Field

Inhomogeneity Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
Yue Zhuo, Xiao-Long Wu, Justin P. Haldar, Thibault Marin, Wen-mei W. Hwu,
Zhi-Pei Liang, Bradley P. Sutton

CHAPTER 45 1 Minimization in 1-SPIRiT Compressed Sensing MRI Reconstruction . . . . . . . 723
Mark Murphy, Miki Lustig

CHAPTER 46 Medical Image Processing Using GPU-Accelerated ITK Image Filters . . . . . . . . . . 737
Won-Ki Jeong, Hanspeter Pfister, Massimiliano Fatica

CHAPTER 47 Deformable Volumetric Registration Using B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
James Shackelford, Nagarajan Kandasamy, Gregory Sharp

CHAPTER 48 Multiscale Unbiased Diffeomorphic Atlas Construction on Multi-GPUs . . . . . . . . . 771
´
Linh Ha, Jens Kruger,
¨
Sarang Joshi, Claudio
T. Silva

CHAPTER 49 GPU-Accelerated Brain Connectivity Reconstruction and
Visualization in Large-Scale Electron Micrographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
Won-Ki Jeong, Hanspeter Pfister, Johanna Beyer, Markus Hadwiger

CHAPTER 50 Fast Simulation of Radiographic Images Using a Monte Carlo X-Ray
Transport Algorithm Implemented in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
Andreu Badal, Aldo Badano

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831

This page intentionally left blank

Editors, Reviewers, and Authors
Editor-In-Chief
Wen-mei W. Hwu, University of Illinois at Urbana Champaign

Managing Editor
Andrew Schuh, University of Illinois at Urbana Champaign

NVIDIA Editor
Nadeem Mohammad, NVIDIA

Area Editors
Robert M. Farber, Pacific Northwest National Laboratory (Section 1)
James Fung, NVIDIA (Section 7)
Mike Giles, Oxford University (Section 3)
Sunil P. Khatri, Texas A&M University (Section 5)
Volodymyr Kindratenko, University of Illinois at Urbana Champaign (Section 4)
John Roberts, NVIDIA (Section 9)
Austin Robison, NVIDIA (Section 6)
Bertil Schmidt, Nanyang Technical University (Section 2)
Timo Stich, NVIDIA (Section 8)
Lawrence Tarbox, Washington University in St. Louis (Section 10)

Reviewers
Franc¸ois Beaune, /*jupiter jazz*/ visual effects consultants
Jiawen Chen, Massachusetts Institute of Technology
Andrea Di Blas, University of California, Santa Cruz

Roshan Dsouza, University of Wisconsin-Milwaukee
Richard Edgar, Harvard University
Martin Eisemann, Technical University, Braunschweig
John Estabrook, University of Illinois at Urbana-Champaign
Cass Everitt, NVIDIA

xi

xii

Editors, Reviewers, and Authors

Reza Farivar, University of Illinois at Urbana-Champaign
Vladimir Frolov, NVIDIA
Vladimir Glavtchev, BMW Technology Office
Kanupriya Gulati, Intel Corporation
Trym Vegard Haavardsholm, Norwegian Defense Research Establishment
Ken Hawick, University of Auckland, New Zealand
Jared Hoberock, NVIDIA
Tim Kaldewey, Oracle
Vinay Karkala, Advanced Micro Devices
Christian Linz, Technical University, Braunschweig
Christian Lipski, Technical University, Braunschweig
Weiguo Liu, Nanyang Technological University
Dave Luebke, NVIDIA
W. James MacLean, Google
Corey Manders, A*STAR Institute for Infocomm Research
Morgan McGuire, Williams College, Massachusetts
Derek Nowrouzezahrai, Disney Research Zurich

Ming Ouyang, University of Louisville, Kentucky
Steven Parker, NVIDIA
Kalyan Perumalla, Oak Ridge National Laboratory
Nicolas Pinto, Massachusetts Institute of Technology
Tobias Preis, Johannes Gutenberg University
Ramtin Shams, Australian National University
Craig Steffen, University of Illinois at Urbana-Champaign
Andrei Tatarinov, NVIDIA
˜ Universidade Federal Fluminense, Brazil
Cristina Nader Vasconcelos, Institulo de Computac¸ao,
Ben Weiss, Shell and Slate Software
Ruediger Westermann, Technical University, Munich
Jan Woetzel, MeVis Medical Solutions, AG
Kesheng Wu, Berkeley Lab, University of California
Ren Wu, HP Labs
Weihang Zhu, Lamar University, Texas

Editors, Reviewers, and Authors

Authors
Shitanshu Aggarwal, Grinnell College, Iowa (Chapter 19)
Mike Bailey, Oregon State University (Chapter 12)
Andreu Badal, US Food and Drug Administration (CDRH/OSEL/DIAM) (Chapter 50)
Aldo Badano, US Food and Drug Administration (CDRH/OSEL/DIAM) (Chapter 50)
Lorena A. Barba, Boston University (Chapter 9)
Bedˇrich Beneˇs, Purdue University, Indiana (Chapter 35)
Abderrahim Benquassmi, Georgia Institute of Technology (Chapter 41)
Valeria Bertacco, University of Michigan (Chapter 23)
Johanna Beyer, King Abdullah University of Science and Technology (KAUST) (Chapter 49)

Bernd Bickel, Disney Research, Zurich (Chapter 27)
Thomas Bradley, NVIDIA (Chapter 16)
Benjamin Brown, Northeastern University (Chapter 40)
Martin Burtscher, Texas State University, San Marcos (Chapter 6)
Yong Cao, Virginia Tech (Chapter 15)
Debapriya Chatterjee, University of Michigan (Chapter 23)
Yifeng Chen, Peking University (Chapter 39)
Jike Chong, University of California, Berkeley (Chapter 37)
Jing-Yu Cui, Stanford University (Chapter 42)
Xiang Cui, Peking University (Chapter 39)
Carsten Dachsbacher, Karlsruhe Institute of Technology (Chapter 26)
Holger Dammertz, Ulm University (Chapter 18)
Andrew DeOrio, University of Michigan (Chapter 23)
Thierry Deutsch, Laboratoire de Simulation Atomistique (Chapter 10)
Rodrigo Dominguez, Northeastern University (Chapter 40)
Jacques Du Toit, Numerical Algorithms Group (Chapter 16)
Gabriel Falcao, University of Coimbra (Chapter 38)
Massimiliano Fatica, NVIDIA (Chapter 46)
Wu-chu Feng, Virginia Tech and Wake Forest University (Chapter 15)
Eric Fontaine, Georgia Institute of Technology (Chapter 41)
James Fung, NVIDIA (Chapter 36)
Robert Geist, Clemson University (Chapter 25)

xiii

xiv

Editors, Reviewers, and Authors

Luigi Genovese, European Synchrotron Radiation Facility (Chapter 10)
Mike Giles, Oxford University (Chapter 16)
Vladimir Glavtchev, BMW Group Technology Office (Chapter 32)
David Gohara, Saint Louis University School of Medicine (Chapter 4)
Ekaterina Gonina, University of California, Berkeley (Chapter 37)
Linh Ha, University of Utah (Chapter 48)
Markus Hadwiger, King Abdullah University of Science and Technology (KAUST) (Chapter 49)
Justin P. Haldar, University of Illinois at Urbana-Champaign (Chapter 44)
Tugba Halici, Middle East Technical University (Chapter 34)
Johannes Hanika, Ulm University (Chapter 18)
Imran S. Haque, Stanford University (Chapter 2)
David J. Hardy, University of Illinois at Urbana-Champaign (Chapters 1 and 4)
Vlastimil Havran, Czech Technical University in Prague (Chapter 26)
Sergio Herrero-Lopez, Massachusetts Institute of Technology (Chapter 20)
Frank Herrmann, University of Maryland, College Park (Chapter 8)
Dominique Houzet, GIPSA-lab (Chapter 30)
Jiang Hu, Texas A&M University (Chapter 24)
Mengcheng Huang, Chinese Academy of Sciences (Chapter 28)
Wen-mei W. Hwu, University of Illinois at Urbana-Champaign (Chapter 44)
Stephan Irle, Nagoya University (Chapter 5)
Jacek Jakowski, National Institute for Computational Sciences (Chapter 5)
Byunghyun Jang, Northeastern University (Chapter 40)
Won-Ki Jeong, Harvard University (Chapters 46 and 49)
Sarang Joshi, University of Utah, Salt Lake City (Chapter 48)
David Kaeli, Northeastern University (Chapter 40)
Nagarajan Kandasamy, Drexel University (Chapter 47)
Ersin Karaman, Middle East Technical University (Chapter 34)
Kurt Keutzer, University of California, Berkeley (Chapter 37)
Ali Khajeh-Saeed, University of Massachusetts, Amherst (Chapter 13)
Daniel B. Kopans, Massachusetts General Hospital (Chapter 40)

¨
¨
Jens Kruger,
Interactive Visualization and Data Analysis Group, Saarbrucken
(Chapter 48)
Manuel Lang, Disney Research, Zurich (Chapter 27)

Editors, Reviewers, and Authors

´
´
Dominique Lavenier, Ecole
Normale Superieure
de Cachan (Chapter 14)
Hsien-Hsin S. Lee, Georgia Institute of Technology (Chapter 41)
Hendrik Lensch, Ulm University (Chapter 18)
Craig S. Levin, Stanford University (Chapter 42)
Zhi-Pei Liang, University of Illinois at Urbana-Champaign (Chapter 44)
Augustus Lidaka, Grinnell College (Chapter 19)
Łukasz Ligowski, University of Warsaw (Chapter 11)
Fang Liu, Chinese Academy of Sciences (Chapter 28)
Xuehui Liu, Chinese Academy of Sciences (Chapter 28)
Yifang Liu, Texas A&M University (Chapter 24)
Yongchao Liu, Nanyang Technological University (Chapter 11)
Berker Logoglu, Middle East Technical University (Chapter 34)
Nathan Luehr, Stanford University and SLAC National Accelerator Laboratory (Chapter 3)
Miki Lustig, University of California, Berkeley (Chapter 45)
´ Magdics, Budapest University of Technology and Economics (Chapter 17)
Milan

Thibault Marin, Illinois Institute of Technology (Chapter 44)
Todd Martinez, Stanford University and SLAC National Accelerator Laboratory (Chapter 3)
´
Jean-Franc¸ois Mehaut,
Universite Joseph Fourier (Chapter 10)
Hong Mei, Peking University (Chapter 39)
Perhaad Mistry, Northeastern University (Chapter 40)
Richard Moore, Massachusetts General Hospital (Chapter 40)
Keiji Morokuma, Kyoto University (Chapter 5)
Klaus Mueller, State University of New York, Stony Brook (Chapter 43)
Mark Murphy, University of California, Berkeley (Chapter 45)
¨ ¸ elik, University of California, Davis (Chapter 32)
Pinar Muyan-Ozc
P. J. Narayanan, International Institute of Information Technology Hyderabad (Chapter 29)
´ Karlsruhe Institute of Technology (Chapter 26)
Jan Novak,
Anton Obukhov, NVIDIA (Chapter 33)
Fatih Omruuzun, Middle East Technical University (Chapter 34)
Matthieu Ospici, Laboratoire d’Informatique de Grenoble (Chapter 10)
Jeffery M. Ota, BMW Group Technology Office (Chapter 32)
John D. Owens, University of California, Davis (Chapter 32)

xv

xvi

Editors, Reviewers, and Authors

Vijay S. Pande, Stanford University (Chapter 2)

Debprakash Patnaik, Virginia Tech (Chapter 15)
Denis Pellerin, GIPSA-lab (Chapter 30)
J. Blair Perot, University of Massachusetts, Amherst (Chapter 13)
Hanspeter Pfister, Harvard University (Chapters 46 and 49)
Keshay Pingali, Texas State University, San Marcos (Chapter 6)
Guillem Pratx, Stanford University (Chapter 42)
Sven Prevrhal, Philips Healthcare (Chapter 42)
Anis Rahman, GIPSA-lab (Chapter 30)
Sanjay Rajopadhye, Colorado State University (Chapter 14)
Naren Ramakrishnan, Virginia Tech (Chapter 15)
Paul Richmond, University of Sheffield (Chapter 21)
`
´
Guillaume Rizk, Institut de Recherche en Informatique et Systemes
Aleatoires,
Universite´ de
Rennes (Chapter 14)
Christopher Rodrigues, University of Illinois at Urbana-Champaign (Chapter 4)
Daniela Romano, University of Sheffield (Chapter 21)
Witold R. Rudnicki, University of Warsaw (Chapter 11)
Jan Saam, University of Illinois at Urbana-Champaign (Chapter 1)
Karthikeyan Sankaralingam, University of Wisconsin-Madison (Chapter 7)
Dana Schaa, Northeastern University (Chapter 40)
Christoph Schied, Ulm University (Chapter 18)
Bertil Schmidt, Nanyang Technological University (Chapter 11)
Klaus Schulten, University of Illinois at Urbana-Champaign (Chapters 1 and 4)
James Shackleford, Drexel University (Chapter 47)
Gregory Sharp, Massachusetts General Hospital (Chapter 47)
John Silberholz, University of Maryland (Chapter 8)
Claudio Silva, University of Utah (Chapter 48)

Vitor Silva, University of Coimbra (Chapter 38)
Matthew D. Sinclair, University of Wisconsin-Madison (Chapter 7)
Leonel Sousa, Technical University of Lisbon (Chapter 38)
Joe Stam, NVIDIA (Chapter 36)
ˇ
Ondˇrej Stava,
Purdue University (Chapter 35)

Editors, Reviewers, and Authors

Timo Stich, NVIDIA (Chapter 29)
John E. Stone, University of Illinois at Urbana-Champaign (Chapters 1 and 4)
Bradley P. Sutton, University of Illinois at Urbana-Champaign (Chapter 44)
´ o´ Szirmay-Kalos, Budapest University of Technology and Economics (Chapter 17)
Laszl
Gabriel Taubin, Brown University (Chapter 31)
Alptekin Temizel, Middle East Technical University (Chapter 34)
Tugba Taskaya Temizel, Middle East Technical University (Chapter 34)
Manuel Tiglio, University of Maryland (Chapter 8)
Robert Tong, Numerical Algorithms Group(Chapter 16)
´ Toth,
´ Budapest University of Technology and Economics (Chapter 17)
Balazs
Richard Townsend, University of Wisconsin-Madison (Chapter 7)
Ivan Ufimtsev, Stanford University and SLAC National Accelerator Labortory (Chapter 3)
Kirby L. Vandivort, University of Illinois at Urbana-Champaign (Chapters 1 and 4)
Brice Videau, Laboratoire de Simulation Atomistique, Grenoble (Chapter 10)
Vibhav Vineet, International Institute of Information Technology, Hyderabad (Chapter 29)
Jerod J. Weinman, Grinnell College (Chapter 19)

Ben Weiss, Oregon State University (Chapter 12)
Robin M. Weiss, Macalester College (Chapter 22)
James Westall, Clemson University (Chapter 25)
Paul Woodhams, Numerical Algorithms Group (Chapter 16)
Enhua Wu, Chinese Academy of Sciences (Chapter 28)
Xiao-Long Wu, University of Illinois at Urbana-Champaign (Chapter 44)
Wei Xu, State University of New York, Stony Brook (Chapter 43)
Rio Yokota, Brown University (Chapter 9)
Yong Zhao, Brown University (Chapter 31)
Yue Zhuo, University of Illinois at Urbana-Champaign (Chapter 44)

xvii

This page intentionally left blank

Introduction

Wen-mei W. Hwu

STATE OF GPU COMPUTING
We are entering the golden age of GPU computing. Since the introduction of CUDA in 2007, more
than 100 million computers with CUDA-capable GPUs have been shipped to end users. Unlike the
previous GPGPU shader programming models, CUDA supports parallel programming in C. From my
own experience in teaching CUDA programming, C programmers can begin to write basic CUDA
programs after only attending one lecture and reading one textbook chapter. With such a low barrier of
entry, researchers all over the world have been engaged in developing new algorithms and applications
to take advantage of the extreme floating point execution throughout these GPUs.
Today, there is a large community of GPU computing practitioners. Many of them have reported a

10 to 100 times speedup of their applications with GPU computing. To put this into perspective, with
the historical 2X performance growth every 2 years, these researchers are experiencing the equivalent
of time travel of 8 to 12 years. That is, they are getting the performance today that they would have to
wait for 8 to 12 years if they went for the “free-ride” advancement of performance in microprocessors.
Interestingly, such “free ride” advancement is no longer available. Furthermore, once they develop
their application in CUDA, they will likely see continued performance growth of 2X for every two
years from this day forward.
After discussing with numerous researchers, I have reached the conclusion that many of them are
solving similar algorithm problems in their programming efforts. Although they are working on diverse
applications, they often end up developing similar algorithmic strategies. The idea of GPU Computing Gems is to provide a convenient means for application developers in diverse application areas to
benefit from each other’s experience. In this volume, we have collected 50 gem articles written by
researchers in 10 diverse areas. Each gems article reports a successful application experience in GPU
computing. These articles describe the techniques or “secret sauce” that contributed to the success.
The authors highlight the potential applicability of their techniques to other application areas. In our
editorial process, we have emphasized the accessibility of these gems to researchers in other areas.
When we issued the call for proposals for the first GPU Computing Gems, we received more than
280 submissions, an overwhelming response. After careful review, we accepted 110 proposals that
have a high likelihood of making valuable contributions to other application developers. Many highquality proposals were not accepted because of concerns that they may not be accessible to a large
audience. With so many accepted proposals, we were forced to divide these gems into two volumes.
This volume covers 50 gems in the application areas of scientific simulation, life sciences, statistical
modeling, emerging data-intensive applications, electronic design automation, ray tracing and rendering, computer vision, video and image processing, signal and audio processing, and medical imaging.

xix

xx

Introduction

Each gem is first edited by an area editor who is a GPU computing expert in that area. This is followed

by my own editing of these articles.
I would like to thank the people who have worked tirelessly on this project. Nadeem Mohammad
at NVIDIA and Andrew Schuh at UIUC have done so much heavy lifting for this project. Without
them, it would have been impossible for me to coordinate so many authors and area editors. My area
editors, whose names are in front of each section of this volume, have volunteered their valuable time
and energy to improve the quality of the gems. They worked closely with the authors to make sure that
the gems indeed meet high technical standards while remain accessible to a wide audience. I would like
to thank all the authors who have shared their innovative work with the GPU computing community.
All authors have worked hard to respond to our requests for improvements. Finally, I would like to
acknowledge Manju Hegde, who championed the creation of GPU Computing Gems and pursued me
to serve as the editor in chief. It has been a true privilege to work with all of these great people.

Online Resources
Visit and click the ONLINE RESOURCES tab to connect
to gpucomputing.net, the vibrant official community site for GPU computing, where you can download
source code examples for most chapters and join discussions with other readers and GPU developers. You’ll also find links to additional material including chapter walk-through videos and full-color
versions of many figures from the book.

SECTION

Scientific Simulation
Area Editor’s Introduction
Robert M. Farber

1

1 GPU-Accelerated Computation and Interactive Display of Molecular Orbitals . . . . . . . . . . . . .

5

2 Large-Scale Chemical Informatics on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3 Dynamical Quadrature Grids: Applications in Density Functional Calculations . . . . . . . . . . . .

35

4 Fast Molecular Electrostatics Algorithms on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

5 Quantum Chemistry: Propagation of Electronic Structure on a GPU . . . . . . . . . . . . . . . . . . . . . .

59

6 An Efficient CUDA Implementation of the Tree-Based Barnes
Hut n-Body Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

7 Leveraging the Untapped Computation Power of GPUs: Fast Spectral Synthesis
Using Texture Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

8 Black Hole Simulations with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9 Treecode and Fast Multipole Method for N-Body Simulation with CUDA. . . . . . . . . . . . . . . . . . 113
10 Wavelet-Based Density Functional Theory Calculation on Massively Parallel Hybrid

Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

2

SECTION 1 Scientific Simulation

THE STATE OF GPU COMPUTING IN SCIENTIFIC SIMULATION
GPU computing is revolutionizing scientific simulation by providing one to two orders of magnitude
of increased computing performance per GPU at price points even students can afford. Exciting things
are happening with this technology in the hands of the masses, as reflected by the applications, CUDA
Gems, and the extraordinary number of papers that have appeared in the literature since CUDA was
first introduced in February 2007.
Technology that provides two or more orders of magnitude of increased computational capability
is disruptive and has the potential to fundamentally affect scientific research by removing time-todiscovery barriers. I cannot help getting excited by the potential as simulations that previously would
have taken a year or more to complete can now be finished in days. Better scientific insight also becomes
possible because researchers can work with more data and have the ability to utilize more accurate,
albeit computationally expensive, approximations and numerical methods. We are now entering the
era where hybrid clusters and supercomputers containing large numbers of GPUs are being built and
used around the world. As a result, many researchers (and funding agencies) now have to rethink their
computational models and invest in software to create scalable, high-performance applications based on
this technology. The potential is there, and some lucky researchers may find themselves with a Galilean
first opportunity to see, study, and model using exquisitely detailed data from projects utilizing GPU
technology and these hybrid systems.

IN THIS SECTION
The chapters in this section provide gems of insight both in thought and CUDA implementation to
map challenging scientific simulation problems to GPU technology. Techniques to work with irregular
grids, dynamic surfaces, treecodes, and far-field calculations are presented. All of these CUDA gems
can be adapted and should provide food for thought in solving challenging computational problems

in many areas. Innovative solutions are discussed, including just-in-time (JIT) compilation; appropriate and effective use of fast on-chip GPU memory resources across GPU technology generations; the
application of texture unit arithmetic to augment GPU computational and global memory performance;
and the creation of solutions that can scale across multiple GPUs in a distributed environment. General kernel optimization principles are also provided in many chapters. Some of the kernels presented
require fewer than 200 lines of CUDA code, yet still provide impressive performance.
In Chapter 1: Evaluating molecular orbitals on 3-D lattices is a common problem in molecular visualization. This chapter discusses the design trade-offs in the popular VMD (visual molecular dynamics)
software system plus the appropriate and effective use of fast on-chip GPU memory resources across
various generations of GPUs. Several kernel optimization principles are provided. To account for varying problem size and GPU performance regimes, an innovative just-in-time (JIT) kernel compilation
technique is utilized.
In Chapter 2: The authors discuss the techniques they used to adapt the LIGO string similarity
algorithm to run efficiently on GPUs and avoid the memory bandwidth and conditional operations that
limit parallelism in the CPU implementation. These techniques as well as the discussion on minimizing
CPU-GPU transfer overhead and exploiting thread level parallelism should benefit readers in many
areas; not just those interested in large scale chemical informatics.

In This Section

3

In Chapter 3: This chapter discusses a GPU-accelerated dynamic quadrature grid method where
the grid points move over the course of the calculation. The merits of several parallelization schemes,
mixed precision arithmetic as an optimization technique, and problems arising from branching within
a warp are discussed.
In Chapter 4: GPU kernels are presented that calculate electrostatic potential maps on structured
grids containing a large amount of fine-grained data parallelism. Approaches to regularize the computation work are discussed along with kernel loop optimizations and implementation notes on how to
best use the GPU memory subsystem. All of this is phrased in the context of the popular VMD (visual
molecular dynamics) and APBS (Adaptive Poisson-Boltzmann Solver) software packages.
In Chapter 5: Direct molecular dynamics (MD) requires repeated calculation of the potential energy
surface obtained from electronic structure calculations. This chapter shows how this calculation can be
rethought to propagate the electronic structure without diagonalization — a time-consuming step that

is difficult to implement on GPUs. Other topics discussed include efficiently using CUBLAS and the
integration of CUDA within a FORTRAN framework.
In Chapter 6: Irregular tree-based data structures are a challenge given the GPGPU memory subsystem likes coalesced memory accesses. This chapter describes a number of techniques — both novel
and conventional — to reduce main memory accesses on an irregular tree-based data structure. All the
methods run on the GPU.
In Chapter 7: The GRASSY spectral synthesis platform is described, which utilizes GPUs to address
the computational needs of asteroseismology. In particular, this chapter demonstrates an innovative use
of interpolation by CUDA texture memory to augment arithmetic performance and reduce memory
access overhead. The low precision of texture memory arithmetic is discussed and shown to not affect
solution accuracy. Mesh building and rasterization are also covered.
In Chapter 8: Exploring the parameter space of a complex dynamical system is an important facet
of scientific simulation. Many problems require integration of a coupled set of ordinary differential
equations (ODEs). Rather than parallelizing a single integration, the authors use CUDA to turn the
GPU into a survey engine that performs many integrations at once. With this technology, scientists can
examine more of the phase space of the problem to gain a better understanding of the dynamics of
the simulation. In the case of black holes in spirals, GPU technology might have a significant impact
in the quest for direct measurement of gravity waves. Robustness across GPUs in a distributed MPI
environment is also discussed.
In Chapter 9: As this chapter shows, constructing fast N-body algorithms is far from a formidable
task. Basic kernels are discussed that achieve substantial speedups (15x to 150x) in fewer than 200 lines
of CUDA code. These same kernels extend previous GPU gems N-body CUDA mappings to encompass parallel far-field approximations that are useful for astrophysics, acoustics, molecular dynamics,
particle simulation, electromagnetics, and boundary integral formulations. Other topics include structuring the data to preserve coalesced memory accesses and balancing parallelism and data reuse through
the use of tiles.
In Chapter 10: The authors discuss the GPU-specific thought and implementation details for
BigDFT, a massively parallel implementation of a full DFT (density functional theory) code for quantum chemistry that runs on hybrid clusters and supercomputers containing many GPUs. From the
unconventional use of Daubechies wavelets, which are well suited for GPU-accelerated environments,
the authors progress to a discussion of scalability and integration in a distributed runtime environment.

This page intentionally left blank

GPU computing gems NVIDIA

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về