图书介绍

大规模并行处理器程序设计 第2版 英文版2025|PDF|Epub|mobi|kindle电子书版本百度云盘下载

大规模并行处理器程序设计 第2版 英文版
  • (美)柯克,(美)胡文美著 著
  • 出版社: 北京:机械工业出版社
  • ISBN:9787111416296
  • 出版时间:2013
  • 标注页数:496页
  • 文件大小:106MB
  • 文件页数:517页
  • 主题词:并行程序-程序设计-英文

PDF下载


点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快]温馨提示:(请使用BT下载软件FDM进行下载)软件下载地址页直链下载[便捷但速度慢]  [在线试读本书]   [在线获取解压码]

下载说明

大规模并行处理器程序设计 第2版 英文版PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台)。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用!后期资源热门了。安装了迅雷也可以迅雷进行下载!

(文件页数 要大于 标注页数,上中下等多册电子书除外)

注意:本站所有压缩包均有解压码: 点击下载压缩包解压工具

图书目录

CHAPTER 11 Introduction1

1.1 Heterogeneous Parallel Computing2

1.2 Architecture of a Modern GPU8

1.3 Why More Speed or Parallelism?10

1.4 Speeding Up Real Applications12

1.5 Parallel Programming Languages and Models14

1.6 Overarching Goals16

1.7 Organization of the Book17

References21

CHAPTER 2 History of GPU Computing23

2.1 Evolution of Graphics Pipelines23

The Era of Fixed-Function Graphics Pipelines24

Evolution of Programmable Real-Time Graphics28

Unified Graphics and Computing Processors31

2.2 GPGPU:An Intermediate Step33

2.3 GPU Computing34

Scalable GPUs35

Recent Developments36

Future Trends37

References and Further Reading37

CHAPTER 3 Introduction to Data Parallelism and CUDA C41

3.1 Data Parallelism42

3.2 CUDA Program Structure43

3.3 A Vector Addition Kernel45

3.4 Device Global Memory and Data Transfer48

3.5 Kernel Functions and Threading53

3.6 Summary58

Function Declarations59

Kernel Launch59

Predefined Variables59

Runtime API60

3.7 Exercises60

References62

CHAPTER 4 Data-Parallel Execution Model63

4.1 Cuda Thread Organization64

4.2 Mapping Threads to Multidimensional Data68

4.3 Matrix-Matrix Multiplication—A More Complex Kernel74

4.4 Synchronization and Transparent Scalabilitv81

4.5 Assigning Resources to Blocks83

4.6 Querying Device Properties85

4.7 Thread Scheduling and Latency Tolerance87

4.8 Summary91

4.9 Exercises91

CHAPTER 5 CUDA Memories95

5.1 Importance of Memory Access Efficiency96

5.2 CUDA Device Memory Types97

5.3 A Strategy for Reducing Global Memory Traffic105

5.4 A Tiled Matrix—Matrix Multiplication Kernel109

5.5 Memory as a Limiting Factor to Parallelism115

5.6 Summary118

5.7 Exercises119

CHAPTER 6 Performance Considerations123

6.1 Warps and Thread Execution124

6.2 Global Memory Bandwidth132

6.3 Dynamic Partitioning of Execution Resources141

6.4 Instruction Mix and Thread Granularity143

6.5 Summary145

6.6 Exercises145

References149

CHAPTER 7 Floating-Point Considerations151

7.1 Floating-Point Format152

Normalized Representation of M152

Excess Encoding of E153

7.2 Representable Numbers155

7.3 Special Bit Patterns and Precision in IEEE Format160

7.4 Arithmetic Accuracy and Rounding161

7.5 Algorithm Considerations162

7.6 Numerical Stability164

7.7 Summary169

7.8 Exercises170

References171

CHAPTER 8 Parallel Patterns:Convolution173

8.1 Background174

8.2 1D Parallel Convolution—A Basic Algorithm179

8.3 Constant Memory and Caching181

8.4 Tiled 1D Convolution with Halo Elements185

8.5 A Simpler Tiled 1D Convolution—General Caching192

8.6 Summary193

8.7 Exercises194

CHAPTER 9 Parallel Patterns:Prefix Sum197

9.1 Background198

9.2 A Simple Parallel Scan200

9.3 Work Efficiency Considerations204

9.4 A Work-Efficient Parallel Scan205

9.5 Parallel Scan for Arbitrary-Length Inputs210

9.6 Summary214

9.7 Exercises215

Reference216

CHAPTER 10 Parallel Patterns:Sparse Matrix—Vector Multiplication217

10.1 Background218

10.2 Parallel SpMV Using CSR222

10.3 Padding and Transposition224

10.4 Using Hybrid to Control Padding226

10.5 Sorting and Partitioning for Regularization230

10.6 Summary232

10.7 Exercises233

References234

CHAPTER 11 Application Case Study:Advanced MRI Reconstruction235

11.1 Application Background236

11.2 Iterative Reconstruction239

11.3 Computing FHD241

Step 1:Determine the Kernel Parallelism Structure243

Step 2:Getting Around the Memory Bandwidth Limitation249

Step 3:Using Hardware Trigonometry Functions255

Step 4:Experimental Performance Tuning259

11.4 Final Evaluation260

11.5 Exercises262

References264

CHAPTER 12 Application Case Study:Molecular Visualization and Analysis265

12.1 Application Background266

12.2 A Simple Kernel Implementation268

12.3 Thread Granularity Adiustment272

12.4 Memory Coalescing274

12.5 Summary277

12.6 Exercises279

References279

CHAPTER 13 Parallel Programming and Computational Thinking281

13.1 Goals of Parallel Computing282

13.2 Problem Decomposition283

13.3 Algorithm Selection287

13.4 Computational Thinking293

13.5 Summary294

13.6 Exercises294

References295

CHAPTER 14 An Introduction to OpenCLTM297

14.1 Background297

14.2 Data Parallelism Model299

14.3 Device Architecture301

14.4 Kernel Functions303

14.5 Device Management and Kernel Launch304

14.6 Electrostatic Potential Map in OpenCL307

14.7 Summary311

14.8 Exercises312

References313

CHAPTER 15 Parallel Programming with OpenACC315

15.1 OpenACC Versus CUDA C315

15.2 Execution Model318

15.3 Memory Model319

15.4 Basic OpenACC Programs320

Parallel Construct320

Loop Construct322

Kernels Construct327

Data Management331

Asynchronous Computation and Data Transfer335

15.5 Future Directions of OpenACC336

15.6 Exercises337

CHAPTER 16 Thrust:A Productivity-Oriented Library for CUDA339

16.1 Background339

16.2 Motivation342

16.3 Basic Thrust Features343

Iterators and Memory Space344

Interoperability345

16.4 Generic Programming347

16.5 Benefits of Abstraction349

16.6 Programmer Productivity349

Robustness350

Real-World Performance350

16.7 Best Practices352

Fusion353

Structure of Arrays354

Implicit Ranges356

16.8 Exercises357

References358

CHAPTER 17 CUDA FORTRAN359

17.1 CUDA FORTRAN and CUDA C Differences360

17.2 A First CUDA FORTR AN Program361

17.3 Multidimensional Array in CUDA FORTRAN363

17.4 Overloading Host/Device Routines With Generic Interfaces364

17.5 Calling CUDA C Via Iso_C_Binding367

17.6 Kernel Loop Directives and Reduction Operations369

17.7 Dynamic Shared Memory370

17.8 Asynchronous Data Transfers371

17.9 Compilation and Profiling377

17.1 0 Calling Thrust from CUDA FORTR AN378

17.1 1 Exercises382

CHAPTER 18 An Introduction to C++AMP383

18.1 Core C++Amp Features384

18.2 Details of the C++AMP Execution Model391

Explicit and Implicit Data Copies391

Asynchronous Operation393

Section Summary395

18.3 Managing Accelerators395

18.4 Tiled Execution398

18.5 C++AMP Graphics Features401

18.6 Summary405

18.7 Exercises405

CHAPTER 19 Programming a Heterogeneous Computing Cluster407

19.1 Background408

19.2 A Running Example408

19.3 MPI Basics410

19.4 MPI Point-to-Point Communication Types414

19.5 Overlapping Computation and Communication421

19.6 MPI Collective Communication431

19.7 Summary431

19.8 Exercises432

Reference433

CHAPTER 20 CUDA Dynamic Parallelism435

20.1 Background436

20.2 Dynamic Parallelism Overview438

20.3 Important Details439

Launch Environment Configuration439

API Errors and Launch Failures439

Events439

Streams440

Synchronization Scope441

20.4 Memory Visibility442

Global Memory442

Zero-Copy Memory442

Constant Memory442

Texture Memory443

20.5 A Simple Example444

20.6 Runtime Limitations446

Memory Footprint446

Nesting Depth448

Memory Allocation and Lifetime448

ECC Errors449

Streams449

Events449

Launch Pool449

20.7 A More Complex Example449

Linear Bezier Curves450

Quadratic Bezier Curves450

Bezier Curve Calculation(Predynamic Parallelism)450

Bezier Curve Calculation(with Dynamic Parallelism)453

20.8 Summary456

Reference457

CHAPTER 21 Conclusion and Future Outlook459

21.1 Goals Revisited459

21.2 Memory Model Evolution461

21.3 Kernel Execution Control Evolution464

21.4 Core Performance467

21.5 Programming Environment467

21.6 Future Outlook468

References469

Appendix A:Matrix Multiplication Host-Only Version Source Code471

Appendix B:GPU Compute Capabilities481

Index487

热门推荐