Data Parallel C++ : Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL.

Bibliographic Details
Main Author: Reinders, James.
Other Authors: Ashbaugh, Ben., Brodman, James., Kinsner, Michael., Pennycook, John., Tian, Xinmin.
Format: eBook
Language:English
Published: Berkeley, CA : Apress L. P., 2020.
Edition:1st ed.
Subjects:
Online Access:Click to View
Table of Contents:
  • Intro
  • Table of Contents
  • About the Authors
  • Preface
  • Acknowledgments
  • Chapter 1: Introduction
  • Read the Book, Not the Spec
  • SYCL 1.2.1 vs. SYCL 2020, and DPC++
  • Getting a DPC++ Compiler
  • Book GitHub
  • Hello, World! and a SYCL Program Dissection
  • Queues and Actions
  • It Is All About Parallelism
  • Throughput
  • Latency
  • Think Parallel
  • Amdahl and Gustafson
  • Scaling
  • Heterogeneous Systems
  • Data-Parallel Programming
  • Key Attributes of DPC++ and SYCL
  • Single-Source
  • Host
  • Devices
  • Sharing Devices
  • Kernel Code
  • Kernel: Vector Addition (DAXPY)
  • Asynchronous Task Graphs
  • Race Conditions When We Make a Mistake
  • C++ Lambda Functions
  • Portability and Direct Programming
  • Concurrency vs. Parallelism
  • Summary
  • Chapter 2: Where Code Executes
  • Single-Source
  • Host Code
  • Device Code
  • Choosing Devices
  • Method#1: Run on a Device of Any Type
  • Queues
  • Binding a Queue to a Device, When Any Device Will Do
  • Method#2: Using the Host Device for Development and Debugging
  • Method#3: Using a GPU (or Other Accelerators)
  • Device Types
  • Accelerator Devices
  • Device Selectors
  • When Device Selection Fails
  • Method#4: Using Multiple Devices
  • Method#5: Custom (Very Specific) Device Selection
  • device_selector Base Class
  • Mechanisms to Score a Device
  • Three Paths to Device Code Execution on CPU
  • Creating Work on a Device
  • Introducing the Task Graph
  • Where Is the Device Code?
  • Actions
  • Fallback
  • Summary
  • Chapter 3: Data Management
  • Introduction
  • The Data Management Problem
  • Device Local vs. Device Remote
  • Managing Multiple Memories
  • Explicit Data Movement
  • Implicit Data Movement
  • Selecting the Right Strategy
  • USM, Buffers, and Images
  • Unified Shared Memory
  • Accessing Memory Through Pointers
  • USM and Data Movement
  • Explicit Data Movement in USM.
  • Implicit Data Movement in USM
  • Buffers
  • Creating Buffers
  • Accessing Buffers
  • Access Modes
  • Ordering the Uses of Data
  • In-order Queues
  • Out-of-Order (OoO) Queues
  • Explicit Dependences with Events
  • Implicit Dependences with Accessors
  • Choosing a Data Management Strategy
  • Handler Class: Key Members
  • Summary
  • Chapter 4: Expressing Parallelism
  • Parallelism Within Kernels
  • Multidimensional Kernels
  • Loops vs. Kernels
  • Overview of Language Features
  • Separating Kernels from Host Code
  • Different Forms of Parallel Kernels
  • Basic Data-Parallel Kernels
  • Understanding Basic Data-Parallel Kernels
  • Writing Basic Data-Parallel Kernels
  • Details of Basic Data-Parallel Kernels
  • The range Class
  • The id Class
  • The item Class
  • Explicit ND-Range Kernels
  • Understanding Explicit ND-Range Parallel Kernels
  • Work-Items
  • Work-Groups
  • Sub-Groups
  • Writing Explicit ND-Range Data-Parallel Kernels
  • Details of Explicit ND-Range Data-Parallel Kernels
  • The nd_range Class
  • The nd_item Class
  • The group Class
  • The sub_group Class
  • Hierarchical Parallel Kernels
  • Understanding Hierarchical Data-Parallel Kernels
  • Writing Hierarchical Data-Parallel Kernels
  • Details of Hierarchical Data-Parallel Kernels
  • The h_item Class
  • The private_memory Class
  • Mapping Computation to Work-Items
  • One-to-One Mapping
  • Many-to-One Mapping
  • Choosing a Kernel Form
  • Summary
  • Chapter 5: Error Handling
  • Safety First
  • Types of Errors
  • Let's Create Some Errors!
  • Synchronous Error
  • Asynchronous Error
  • Application Error Handling Strategy
  • Ignoring Error Handling
  • Synchronous Error Handling
  • Asynchronous Error Handling
  • The Asynchronous Handler
  • Invocation of the Handler
  • Errors on a Device
  • Summary
  • Chapter 6: Unified Shared Memory
  • Why Should We Use USM?
  • Allocation Types
  • Device Allocations.
  • Host Allocations
  • Shared Allocations
  • Allocating Memory
  • What Do We Need to Know?
  • Multiple Styles
  • Allocations à la C
  • Allocations à la C++
  • C++ Allocators
  • Deallocating Memory
  • Allocation Example
  • Data Management
  • Initialization
  • Data Movement
  • Explicit
  • Implicit
  • Migration
  • Fine-Grained Control
  • Queries
  • Summary
  • Chapter 7: Buffers
  • Buffers
  • Creation
  • Buffer Properties
  • use_host_ptr
  • use_mutex
  • context_bound
  • What Can We Do with a Buffer?
  • Accessors
  • Accessor Creation
  • What Can We Do with an Accessor?
  • Summary
  • Chapter 8: Scheduling Kernels and Data Movement
  • What Is Graph Scheduling?
  • How Graphs Work in DPC++
  • Command Group Actions
  • How Command Groups Declare Dependences
  • Examples
  • When Are the Parts of a CG Executed?
  • Data Movement
  • Explicit
  • Implicit
  • Synchronizing with the Host
  • Summary
  • Chapter 9: Communication and Synchronization
  • Work-Groups and Work-Items
  • Building Blocks for Efficient Communication
  • Synchronization via Barriers
  • Work-Group Local Memory
  • Using Work-Group Barriers and Local Memory
  • Work-Group Barriers and Local Memory in ND-Range Kernels
  • Local Accessors
  • Synchronization Functions
  • A Full ND-Range Kernel Example
  • Work-Group Barriers and Local Memory in Hierarchical Kernels
  • Scopes for Local Memory and Barriers
  • A Full Hierarchical Kernel Example
  • Sub-Groups
  • Synchronization via Sub-Group Barriers
  • Exchanging Data Within a Sub-Group
  • A Full Sub-Group ND-Range Kernel Example
  • Collective Functions
  • Broadcast
  • Votes
  • Shuffles
  • Loads and Stores
  • Summary
  • Chapter 10: Defining Kernels
  • Why Three Ways to Represent a Kernel?
  • Kernels As Lambda Expressions
  • Elements of a Kernel Lambda Expression
  • Naming Kernel Lambda Expressions
  • Kernels As Named Function Objects.
  • Elements of a Kernel Named Function Object
  • Interoperability with Other APIs
  • Interoperability with API-Defined Source Languages
  • Interoperability with API-Defined Kernel Objects
  • Kernels in Program Objects
  • Summary
  • Chapter 11: Vectors
  • How to Think About Vectors
  • Vector Types
  • Vector Interface
  • Load and Store Member Functions
  • Swizzle Operations
  • Vector Execution Within a Parallel Kernel
  • Vector Parallelism
  • Summary
  • Chapter 12: Device Information
  • Refining Kernel Code to Be More Prescriptive
  • How to Enumerate Devices and Capabilities
  • Custom Device Selector
  • Being Curious: get_info&lt
  • &gt
  • Being More Curious: Detailed Enumeration Code
  • Inquisitive: get_info&lt
  • &gt
  • Device Information Descriptors
  • Device-Specific Kernel Information Descriptors
  • The Specifics: Those of "Correctness"
  • Device Queries
  • Kernel Queries
  • The Specifics: Those of "Tuning/Optimization"
  • Device Queries
  • Kernel Queries
  • Runtime vs. Compile-Time Properties
  • Summary
  • Chapter 13: Practical Tips
  • Getting a DPC++ Compiler and Code Samples
  • Online Forum and Documentation
  • Platform Model
  • Multiarchitecture Binaries
  • Compilation Model
  • Adding SYCL to Existing C++ Programs
  • Debugging
  • Debugging Kernel Code
  • Debugging Runtime Failures
  • Initializing Data and Accessing Kernel Outputs
  • Multiple Translation Units
  • Performance Implications of Multiple Translation Units
  • When Anonymous Lambdas Need Names
  • Migrating from CUDA to SYCL
  • Summary
  • Chapter 14: Common Parallel Patterns
  • Understanding the Patterns
  • Map
  • Stencil
  • Reduction
  • Scan
  • Pack and Unpack
  • Pack
  • Unpack
  • Using Built-In Functions and Libraries
  • The DPC++ Reduction Library
  • The reduction Class
  • The reducer Class
  • User-Defined Reductions
  • oneAPI DPC++ Library
  • Group Functions.
  • Direct Programming
  • Map
  • Stencil
  • Reduction
  • Scan
  • Pack and Unpack
  • Pack
  • Unpack
  • Summary
  • For More Information
  • Chapter 15: Programming for GPUs
  • Performance Caveats
  • How GPUs Work
  • GPU Building Blocks
  • Simpler Processors (but More of Them)
  • Expressing Parallelism
  • Expressing More Parallelism
  • Simplified Control Logic (SIMD Instructions)
  • Predication and Masking
  • SIMD Efficiency
  • SIMD Efficiency and Groups of Items
  • Switching Work to Hide Latency
  • Offloading Kernels to GPUs
  • SYCL Runtime Library
  • GPU Software Drivers
  • GPU Hardware
  • Beware the Cost of Offloading!
  • Transfers to and from Device Memory
  • GPU Kernel Best Practices
  • Accessing Global Memory
  • Accessing Work-Group Local Memory
  • Avoiding Local Memory Entirely with Sub-Groups
  • Optimizing Computation Using Small Data Types
  • Optimizing Math Functions
  • Specialized Functions and Extensions
  • Summary
  • For More Information
  • Chapter 16: Programming for CPUs
  • Performance Caveats
  • The Basics of a General-Purpose CPU
  • The Basics of SIMD Hardware
  • Exploiting Thread-Level Parallelism
  • Thread Affinity Insight
  • Be Mindful of First Touch to Memory
  • SIMD Vectorization on CPU
  • Ensure SIMD Execution Legality
  • SIMD Masking and Cost
  • Avoid Array-of-Struct for SIMD Efficiency
  • Data Type Impact on SIMD Efficiency
  • SIMD Execution Using single_task
  • Summary
  • Chapter 17: Programming for FPGAs
  • Performance Caveats
  • How to Think About FPGAs
  • Pipeline Parallelism
  • Kernels Consume Chip "Area"
  • When to Use an FPGA
  • Lots and Lots of Work
  • Custom Operations or Operation Widths
  • Scalar Data Flow
  • Low Latency and Rich Connectivity
  • Customized Memory Systems
  • Running on an FPGA
  • Compile Times
  • The FPGA Emulator
  • FPGA Hardware Compilation Occurs "Ahead-of-Time"
  • Writing Kernels for FPGAs.
  • Exposing Parallelism.