Data Parallel C++ : Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL.
Main Author: | |
---|---|
Other Authors: | , , , , |
Format: | eBook |
Language: | English |
Published: |
Berkeley, CA :
Apress L. P.,
2020.
|
Edition: | 1st ed. |
Subjects: | |
Online Access: | Click to View |
Table of Contents:
- Intro
- Table of Contents
- About the Authors
- Preface
- Acknowledgments
- Chapter 1: Introduction
- Read the Book, Not the Spec
- SYCL 1.2.1 vs. SYCL 2020, and DPC++
- Getting a DPC++ Compiler
- Book GitHub
- Hello, World! and a SYCL Program Dissection
- Queues and Actions
- It Is All About Parallelism
- Throughput
- Latency
- Think Parallel
- Amdahl and Gustafson
- Scaling
- Heterogeneous Systems
- Data-Parallel Programming
- Key Attributes of DPC++ and SYCL
- Single-Source
- Host
- Devices
- Sharing Devices
- Kernel Code
- Kernel: Vector Addition (DAXPY)
- Asynchronous Task Graphs
- Race Conditions When We Make a Mistake
- C++ Lambda Functions
- Portability and Direct Programming
- Concurrency vs. Parallelism
- Summary
- Chapter 2: Where Code Executes
- Single-Source
- Host Code
- Device Code
- Choosing Devices
- Method#1: Run on a Device of Any Type
- Queues
- Binding a Queue to a Device, When Any Device Will Do
- Method#2: Using the Host Device for Development and Debugging
- Method#3: Using a GPU (or Other Accelerators)
- Device Types
- Accelerator Devices
- Device Selectors
- When Device Selection Fails
- Method#4: Using Multiple Devices
- Method#5: Custom (Very Specific) Device Selection
- device_selector Base Class
- Mechanisms to Score a Device
- Three Paths to Device Code Execution on CPU
- Creating Work on a Device
- Introducing the Task Graph
- Where Is the Device Code?
- Actions
- Fallback
- Summary
- Chapter 3: Data Management
- Introduction
- The Data Management Problem
- Device Local vs. Device Remote
- Managing Multiple Memories
- Explicit Data Movement
- Implicit Data Movement
- Selecting the Right Strategy
- USM, Buffers, and Images
- Unified Shared Memory
- Accessing Memory Through Pointers
- USM and Data Movement
- Explicit Data Movement in USM.
- Implicit Data Movement in USM
- Buffers
- Creating Buffers
- Accessing Buffers
- Access Modes
- Ordering the Uses of Data
- In-order Queues
- Out-of-Order (OoO) Queues
- Explicit Dependences with Events
- Implicit Dependences with Accessors
- Choosing a Data Management Strategy
- Handler Class: Key Members
- Summary
- Chapter 4: Expressing Parallelism
- Parallelism Within Kernels
- Multidimensional Kernels
- Loops vs. Kernels
- Overview of Language Features
- Separating Kernels from Host Code
- Different Forms of Parallel Kernels
- Basic Data-Parallel Kernels
- Understanding Basic Data-Parallel Kernels
- Writing Basic Data-Parallel Kernels
- Details of Basic Data-Parallel Kernels
- The range Class
- The id Class
- The item Class
- Explicit ND-Range Kernels
- Understanding Explicit ND-Range Parallel Kernels
- Work-Items
- Work-Groups
- Sub-Groups
- Writing Explicit ND-Range Data-Parallel Kernels
- Details of Explicit ND-Range Data-Parallel Kernels
- The nd_range Class
- The nd_item Class
- The group Class
- The sub_group Class
- Hierarchical Parallel Kernels
- Understanding Hierarchical Data-Parallel Kernels
- Writing Hierarchical Data-Parallel Kernels
- Details of Hierarchical Data-Parallel Kernels
- The h_item Class
- The private_memory Class
- Mapping Computation to Work-Items
- One-to-One Mapping
- Many-to-One Mapping
- Choosing a Kernel Form
- Summary
- Chapter 5: Error Handling
- Safety First
- Types of Errors
- Let's Create Some Errors!
- Synchronous Error
- Asynchronous Error
- Application Error Handling Strategy
- Ignoring Error Handling
- Synchronous Error Handling
- Asynchronous Error Handling
- The Asynchronous Handler
- Invocation of the Handler
- Errors on a Device
- Summary
- Chapter 6: Unified Shared Memory
- Why Should We Use USM?
- Allocation Types
- Device Allocations.
- Host Allocations
- Shared Allocations
- Allocating Memory
- What Do We Need to Know?
- Multiple Styles
- Allocations à la C
- Allocations à la C++
- C++ Allocators
- Deallocating Memory
- Allocation Example
- Data Management
- Initialization
- Data Movement
- Explicit
- Implicit
- Migration
- Fine-Grained Control
- Queries
- Summary
- Chapter 7: Buffers
- Buffers
- Creation
- Buffer Properties
- use_host_ptr
- use_mutex
- context_bound
- What Can We Do with a Buffer?
- Accessors
- Accessor Creation
- What Can We Do with an Accessor?
- Summary
- Chapter 8: Scheduling Kernels and Data Movement
- What Is Graph Scheduling?
- How Graphs Work in DPC++
- Command Group Actions
- How Command Groups Declare Dependences
- Examples
- When Are the Parts of a CG Executed?
- Data Movement
- Explicit
- Implicit
- Synchronizing with the Host
- Summary
- Chapter 9: Communication and Synchronization
- Work-Groups and Work-Items
- Building Blocks for Efficient Communication
- Synchronization via Barriers
- Work-Group Local Memory
- Using Work-Group Barriers and Local Memory
- Work-Group Barriers and Local Memory in ND-Range Kernels
- Local Accessors
- Synchronization Functions
- A Full ND-Range Kernel Example
- Work-Group Barriers and Local Memory in Hierarchical Kernels
- Scopes for Local Memory and Barriers
- A Full Hierarchical Kernel Example
- Sub-Groups
- Synchronization via Sub-Group Barriers
- Exchanging Data Within a Sub-Group
- A Full Sub-Group ND-Range Kernel Example
- Collective Functions
- Broadcast
- Votes
- Shuffles
- Loads and Stores
- Summary
- Chapter 10: Defining Kernels
- Why Three Ways to Represent a Kernel?
- Kernels As Lambda Expressions
- Elements of a Kernel Lambda Expression
- Naming Kernel Lambda Expressions
- Kernels As Named Function Objects.
- Elements of a Kernel Named Function Object
- Interoperability with Other APIs
- Interoperability with API-Defined Source Languages
- Interoperability with API-Defined Kernel Objects
- Kernels in Program Objects
- Summary
- Chapter 11: Vectors
- How to Think About Vectors
- Vector Types
- Vector Interface
- Load and Store Member Functions
- Swizzle Operations
- Vector Execution Within a Parallel Kernel
- Vector Parallelism
- Summary
- Chapter 12: Device Information
- Refining Kernel Code to Be More Prescriptive
- How to Enumerate Devices and Capabilities
- Custom Device Selector
- Being Curious: get_info<
- >
- Being More Curious: Detailed Enumeration Code
- Inquisitive: get_info<
- >
- Device Information Descriptors
- Device-Specific Kernel Information Descriptors
- The Specifics: Those of "Correctness"
- Device Queries
- Kernel Queries
- The Specifics: Those of "Tuning/Optimization"
- Device Queries
- Kernel Queries
- Runtime vs. Compile-Time Properties
- Summary
- Chapter 13: Practical Tips
- Getting a DPC++ Compiler and Code Samples
- Online Forum and Documentation
- Platform Model
- Multiarchitecture Binaries
- Compilation Model
- Adding SYCL to Existing C++ Programs
- Debugging
- Debugging Kernel Code
- Debugging Runtime Failures
- Initializing Data and Accessing Kernel Outputs
- Multiple Translation Units
- Performance Implications of Multiple Translation Units
- When Anonymous Lambdas Need Names
- Migrating from CUDA to SYCL
- Summary
- Chapter 14: Common Parallel Patterns
- Understanding the Patterns
- Map
- Stencil
- Reduction
- Scan
- Pack and Unpack
- Pack
- Unpack
- Using Built-In Functions and Libraries
- The DPC++ Reduction Library
- The reduction Class
- The reducer Class
- User-Defined Reductions
- oneAPI DPC++ Library
- Group Functions.
- Direct Programming
- Map
- Stencil
- Reduction
- Scan
- Pack and Unpack
- Pack
- Unpack
- Summary
- For More Information
- Chapter 15: Programming for GPUs
- Performance Caveats
- How GPUs Work
- GPU Building Blocks
- Simpler Processors (but More of Them)
- Expressing Parallelism
- Expressing More Parallelism
- Simplified Control Logic (SIMD Instructions)
- Predication and Masking
- SIMD Efficiency
- SIMD Efficiency and Groups of Items
- Switching Work to Hide Latency
- Offloading Kernels to GPUs
- SYCL Runtime Library
- GPU Software Drivers
- GPU Hardware
- Beware the Cost of Offloading!
- Transfers to and from Device Memory
- GPU Kernel Best Practices
- Accessing Global Memory
- Accessing Work-Group Local Memory
- Avoiding Local Memory Entirely with Sub-Groups
- Optimizing Computation Using Small Data Types
- Optimizing Math Functions
- Specialized Functions and Extensions
- Summary
- For More Information
- Chapter 16: Programming for CPUs
- Performance Caveats
- The Basics of a General-Purpose CPU
- The Basics of SIMD Hardware
- Exploiting Thread-Level Parallelism
- Thread Affinity Insight
- Be Mindful of First Touch to Memory
- SIMD Vectorization on CPU
- Ensure SIMD Execution Legality
- SIMD Masking and Cost
- Avoid Array-of-Struct for SIMD Efficiency
- Data Type Impact on SIMD Efficiency
- SIMD Execution Using single_task
- Summary
- Chapter 17: Programming for FPGAs
- Performance Caveats
- How to Think About FPGAs
- Pipeline Parallelism
- Kernels Consume Chip "Area"
- When to Use an FPGA
- Lots and Lots of Work
- Custom Operations or Operation Widths
- Scalar Data Flow
- Low Latency and Rich Connectivity
- Customized Memory Systems
- Running on an FPGA
- Compile Times
- The FPGA Emulator
- FPGA Hardware Compilation Occurs "Ahead-of-Time"
- Writing Kernels for FPGAs.
- Exposing Parallelism.