Pro TBB : C++ Parallel Programming with Threading Building Blocks.
Main Author: | |
---|---|
Other Authors: | , |
Format: | eBook |
Language: | English |
Published: |
Berkeley, CA :
Apress L. P.,
2019.
|
Edition: | 1st ed. |
Subjects: | |
Online Access: | Click to View |
Table of Contents:
- Intro
- Table of Contents
- About the Authors
- Acknowledgments
- Preface
- Part 1
- Chapter 1: Jumping Right In: "Hello, TBB!"
- Why Threading Building Blocks?
- Performance: Small Overhead, Big Benefits for C++
- Evolving Support for Parallelism in TBB and C++
- Recent C++ Additions for Parallelism
- The Threading Building Blocks (TBB) Library
- Parallel Execution Interfaces
- Interfaces That Are Independent of the Execution Model
- Using the Building Blocks in TBB
- Let's Get Started Already!
- Getting the Threading Building Blocks (TBB) Library
- Getting a Copy of the Examples
- Writing a First "Hello, TBB!" Example
- Building the Simple Examples
- Steps to Set Up an Environment
- Building on Windows Using Microsoft Visual Studio
- Building on a Linux Platform from a Terminal
- Using the Intel Compiler
- tbbvars and pstlvars Scripts
- Setting Up Variables Manually Without Using the tbbvars Script or the Intel Compiler
- A More Complete Example
- Starting with a Serial Implementation
- Adding a Message-Driven Layer Using a Flow Graph
- Adding a Fork-Join Layer Using a parallel_for
- Adding a SIMD Layer Using a Parallel STL Transform
- Summary
- Chapter 2: Generic Parallel Algorithms
- Functional / Task Parallelism
- A Slightly More Complicated Example: A Parallel Implementation of Quicksort
- Loops: parallel_for, parallel_reduce, and parallel_scan
- parallel_for: Applying a Body to Each Element in a Range
- A Slightly More Complicated Example: Parallel Matrix Multiplication
- parallel_reduce: Calculating a Single Result Across a Range
- A Slightly More Complicated Example: Calculating π by Numerical Integration
- parallel_scan: A Reduction with Intermediate Values
- How Does This Work?
- A Slightly More Complicated Example: Line of Sight
- Cook Until Done: parallel_do and parallel_pipeline.
- parallel_do: Apply a Body Until There Are No More Items Left
- A Slightly More Complicated Example: Forward Substitution
- parallel_pipeline: Streaming Items Through a Series of Filters
- A Slightly More Complicated Example: Creating 3D Stereoscopic Images
- Summary
- For More Information
- Chapter 3: Flow Graphs
- Why Use Graphs to Express Parallelism?
- The Basics of the TBB Flow Graph Interface
- Step 1: Create the Graph Object
- Step 2: Make the Nodes
- Step 3: Add Edges
- Step 4: Start the Graph
- Step 5: Wait for the Graph to Complete Executing
- A More Complicated Example of a Data Flow Graph
- Implementing the Example as a TBB Flow Graph
- Understanding the Performance of a Data Flow Graph
- The Special Case of Dependency Graphs
- Implementing a Dependency Graph
- Estimating the Scalability of a Dependency Graph
- Advanced Topics in TBB Flow Graphs
- Summary
- Chapter 4: TBB and the Parallel Algorithms of the C++ Standard Template Library
- Does the C++ STL Library Belong in This Book?
- A Parallel STL Execution Policy Analogy
- A Simple Example Using std::for_each
- What Algorithms Are Provided in a Parallel STL Implementation?
- How to Get and Use a Copy of Parallel STL That Uses TBB
- Algorithms in Intel's Parallel STL
- Capturing More Use Cases with Custom Iterators
- Highlighting Some of the Most Useful Algorithms
- std::for_each, std::for_each_n
- std::transform
- std::reduce
- std::transform_reduce
- A Deeper Dive into the Execution Policies
- The sequenced_policy
- The parallel_policy
- The unsequenced_policy
- The parallel_unsequenced_policy
- Which Execution Policy Should We Use?
- Other Ways to Introduce SIMD Parallelism
- Summary
- For More Information
- Chapter 5: Synchronization: Why and How to Avoid It
- A Running Example: Histogram of an Image
- An Unsafe Parallel Implementation.
- A First Safe Parallel Implementation: Coarse-Grained Locking
- Mutex Flavors
- A Second Safe Parallel Implementation: Fine-Grained Locking
- A Third Safe Parallel Implementation: Atomics
- A Better Parallel Implementation: Privatization and Reduction
- Thread Local Storage, TLS
- enumerable_thread_specific, ETS
- combinable
- The Easiest Parallel Implementation: Reduction Template
- Recap of Our Options
- Summary
- For More Information
- Chapter 6: Data Structures for Concurrency
- Key Data Structures Basics
- Unordered Associative Containers
- Map vs. Set
- Multiple Values
- Hashing
- Unordered
- Concurrent Containers
- Concurrent Unordered Associative Containers
- concurrent_hash_map
- Concurrent Support for map/multimap and set/multiset Interfaces
- Built-In Locking vs. No Visible Locking
- Iterating Through These Structures Is Asking for Trouble
- Concurrent Queues: Regular, Bounded, and Priority
- Bounding Size
- Priority Ordering
- Staying Thread-Safe: Try to Forget About Top, Size, Empty, Front, Back
- Iterators
- Why to Use This Concurrent Queue: The A-B-A Problem
- When to NOT Use Queues: Think Algorithms!
- Concurrent Vector
- When to Use tbb::concurrent_vector Instead of std::vector
- Elements Never Move
- Concurrent Growth of concurrent_vectors
- Summary
- Chapter 7: Scalable Memory Allocation
- Modern C++ Memory Allocation
- Scalable Memory Allocation: What
- Scalable Memory Allocation: Why
- Avoiding False Sharing with Padding
- Scalable Memory Allocation Alternatives: Which
- Compilation Considerations
- Most Popular Usage (C/C++ Proxy Library): How
- Linux: malloc/new Proxy Library Usage
- macOS: malloc/new Proxy Library Usage
- Windows: malloc/new Proxy Library Usage
- Testing our Proxy Library Usage
- C Functions: Scalable Memory Allocators for C.
- C++ Classes: Scalable Memory Allocators for C++
- Allocators with std::allocator<
- T>
- Signature
- scalable_allocator
- tbb_allocator
- zero_allocator
- cached_aligned_allocator
- Memory Pool Support: memory_pool_allocator
- Array Allocation Support: aligned_space
- Replacing new and delete Selectively
- Performance Tuning: Some Control Knobs
- What Are Huge Pages?
- TBB Support for Huge Pages
- scalable_allocation_mode(int mode, intptr_t value)
- TBBMALLOC_USE_HUGE_PAGES
- TBBMALLOC_SET_SOFT_HEAP_LIMIT
- int scalable_allocation_command(int cmd, void ∗param)
- TBBMALLOC_CLEAN_ALL_BUFFERS
- TBBMALLOC_CLEAN_THREAD_BUFFERS
- Summary
- Chapter 8: Mapping Parallel Patterns to TBB
- Parallel Patterns vs. Parallel Algorithms
- Patterns Categorize Algorithms, Designs, etc.
- Patterns That Work
- Data Parallelism Wins
- Nesting Pattern
- Map Pattern
- Workpile Pattern
- Reduction Patterns (Reduce and Scan)
- Fork-Join Pattern
- Divide-and-Conquer Pattern
- Branch-and-Bound Pattern
- Pipeline Pattern
- Event-Based Coordination Pattern (Reactive Streams)
- Summary
- For More Information
- Part 2
- Chapter 9: The Pillars of Composability
- What Is Composability?
- Nested Composition
- Concurrent Composition
- Serial Composition
- The Features That Make TBB a Composable Library
- The TBB Thread Pool (the Market) and Task Arenas
- The TBB Task Dispatcher: Work Stealing and More
- Putting It All Together
- Looking Forward
- Controlling the Number of Threads
- Work Isolation
- Task-to-Thread and Thread-to-Core Affinity
- Task Priorities
- Summary
- For More Information
- Chapter 10: Using Tasks to Create Your Own Algorithms
- A Running Example: The Sequence
- The High-Level Approach: parallel_invoke
- The Highest Among the Lower: task_group
- The Low-Level Task Interface: Part One - Task Blocking.
- The Low-Level Task Interface: Part Two - Task Continuation
- Bypassing the Scheduler
- The Low-Level Task Interface: Part Three - Task Recycling
- Task Interface Checklist
- One More Thing: FIFO (aka Fire-and-Forget) Tasks
- Putting These Low-Level Features to Work
- Summary
- For More Information
- Chapter 11: Controlling the Number of Threads Used for Execution
- A Brief Recap of the TBB Scheduler Architecture
- Interfaces for Controlling the Number of Threads
- Controlling Thread Count with task_scheduler_init
- Controlling Thread Count with task_arena
- Controlling Thread Count with global_control
- Summary of Concepts and Classes
- The Best Approaches for Setting the Number of Threads
- Using a Single task_scheduler_init Object for a Simple Application
- Using More Than One task_scheduler_init Object in a Simple Application
- Using Multiple Arenas with Different Numbers of Slots to Influence Where TBB Places Its Worker Threads
- Using global_control to Control How Many Threads Are Available to Fill Arena Slots
- Using global_control to Temporarily Restrict the Number of Available Threads
- When NOT to Control the Number of Threads
- Figuring Out What's Gone Wrong
- Summary
- Chapter 12: Using Work Isolation for Correctness and Performance
- Work Isolation for Correctness
- Creating an Isolated Region with this_task_arena::isolate
- Oh No! Work Isolation Can Cause Its Own Correctness Issues!
- Even When It Is Safe, Work Isolation Is Not Free
- Using Task Arenas for Isolation: A Double-Edged Sword
- Don't Be Tempted to Use task_arenas to Create Work Isolation for Correctness
- Summary
- For More Information
- Chapter 13: Creating Thread-to-Core and Task-to-Thread Affinity
- Creating Thread-to-Core Affinity
- Creating Task-to-Thread Affinity
- When and How Should We Use the TBB Affinity Features?
- Summary.
- For More Information.