图书介绍

高性能Spark 影印版2025|PDF|Epub|mobi|kindle电子书版本百度云盘下载

Holden Karau，Rachel Warren著著
出版社：南京：东南大学出版社
ISBN：9787564175184
出版时间：2018
标注页数：344页
文件大小：42MB
文件页数：360页
主题词：数据处理软件－英文

PDF下载

点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快]温馨提示：（请使用BT下载软件FDM进行下载）软件下载地址页直链下载[便捷但速度慢] [在线试读本书] [在线获取解压码]

点击复制MD5值：3e188d218991179b347935e6e2953288

下载说明

高性能Spark 影印版PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

点击复制85GB完整离线版磁力链接到迅雷FDM等BT下载工具进行下载详情点击-查看共享计划

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台）。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用！后期资源热门了。安装了迅雷也可以迅雷进行下载！

（文件页数要大于标注页数，上中下等多册电子书除外）

注意：本站所有压缩包均有解压码： 点击下载压缩包解压工具

图书目录

1．Introduction to High Performance Spark1

What Is Spark and Why Performance Matters1

What You Can Expect to Get from This Book2

Spark Versions3

Why Scala?3

To Be a Spark Expert You Have to Learn a Little Scala Anyway3

The Spark Scala API Is Easier to Use Than the Java API4

Scala Is More Performant Than Python4

Why Not Scala?4

Learning Scala5

Conclusion6

2．How SparkWorks7

How Spark Fits into the Big Data Ecosystem8

Spark Components8

Spark Model of Parallel Computing：RDDs10

Lazy Evaluation11

In-Memory Persistence and Memory Management13

Immutability and the RDD Interface14

Types of RDDs16

Functions on RDDs：Transformations Versus Actions17

Wide Versus Narrow Dependencies17

Spark Job Scheduling19

Resource Allocation Across Applications20

The Spark Application20

The Anatomy of a Spark Job22

The DAG22

Jobs23

Stages23

Tasks24

Conclusion26

3．Data Frames,Datasets,and Spark SQL27

Getting Started with the SparkSession（or HiveContext or SQLContext）28

Spark SQL Dependencies30

Managing Spark Dependencies31

Avoiding Hive JARs32

Basics of Schemas33

DataFrame API36

Transformations36

Multi-DataFrame Transformations48

Plain Old SQL Queries and Interacting with Hive Data49

Data Representation in DataFrames and Datasets49

Tungsten50

Data Loading and Saving Functions51

DataFrameWriter and DataFrameReader51

Formats52

Save Modes61

Partitions（Discovery and Writing）62

Datasets62

Interoperability with RDDs,DataFrames,and Local Collections63

Compile-Time Strong Typing64

Easier Functional（RDD“like”）Transformations65

Relational Transformations65

Multi-Dataset Relational Transformations65

Grouped Operations on Datasets66

Extending with User-Defined Functions and Aggregate Functions（UDFs,UDAFs）67

Query Optimizer69

Logical and Physical Plans69

Code Generation70

Large Query Plans and Iterative Algorithms70

Debugging Spark SQL Queries71

JDBC/ODBC Server71

Conclusion72

4．Joins（SQL and Core）75

Core Spark Joins75

Choosing a Join Type77

Choosing an Execution Plan78

Spark SQL Joins81

DataFrame Joins82

Dataset Joins85

Conclusion86

5．Effective Transformations87

Narrow Versus Wide Transformations88

Implications for Performance90

Implications for Fault Tolerance91

The Special Case of coalesce92

What Type of RDD Does Your Transformation Return?92

Minimizing Object Creation94

Reusing Existing Objects94

Using Smaller Data Structures97

Iterator-to-Iterator Transformations with mapPartitions100

What Is an Iterator-to-Iterator Transformation?101

Space and Time Advantages102

An Example103

Set Operations106

Reducing Setup Overhead107

Shared Variables108

Broadcast Variables108

Accumulators109

Reusing RDDs114

Cases for Reuse114

Deciding if Recompute Is Inexpensive Enough117

Types of Reuse：Cache,Persist,Checkpoint,Shuffle Files118

Alluxio（nee Tachyon）122

LRU Caching123

Noisy Cluster Considerations124

Interaction with Accumulators125

Conclusion126

6．Working with Key/Value Data127

The Goldilocks Example129

Goldilocks Version 0：Iterative Solution130

How to Use PairRDDFunctions and OrderedRDDFunctions132

Actions on Key/Value Pairs133

What’s So Dangerous About the groupByKey Function134

Goldilocks Version 1：groupByKey Solution134

Choosing an Aggregation Operation138

Dictionary of Aggregation Operations with Performance Considerations138

Multiple RDD Operations141

Co-Grouping141

Partitioners and Key/Value Data142

Using the Spark Partitioner Object144

Hash Partitioning144

Range Partitioning144

Custom Partitioning145

Preserving Partitioning Information Across Transformations146

Leveraging Co-Located and Co-Partitioned RDDs146

Dictionary of Mapping and Partitioning Functions PairRDDFunctions148

Dictionary of OrderedRDDOperations149

Sorting by Two Keys with SortByKey151

Secondary Sort and repartitionAndSortWithinPartitions151

Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function152

How Not to Sort by Two Orderings155

Goldilocks Version 2：Secondary Sort156

A Different Approach to Goldilocks159

Goldilocks Version 3：Sort on Cell Values164

Straggler Detection and Unbalanced Data165

Back to Goldilocks（Again）167

Goldilocks Version 4：Reduce to Distinct on Each Partition167

Conclusion173

7．Going Beyond Scala175

Beyond Scala within the JVM176

Beyond Scala,and Beyond the JVM180

How PySpark Works181

How SparkR Works189

Spark.jl（Julia Spark）191

How Eclair JS Works192

Spark on the Common Language Runtime（CLR）—C#and Friends193

Calling Other Languages from Spark193

Using Pipe and Friends193

JNI195

Java Native Access（JNA）198

Underneath Everything Is FORTRAN199

Getting to the GPU200

The Future201

Conclusion201

8．Testing and Validation203

Unit Testing203

General Spark Unit Testing204

Mocking RDDs208

Getting Test Data210

Generating Large Datasets210

Sampling211

Property Checking with ScalaCheck213

Computing RDD Difference213

Integration Testing216

Choosing Your Integration Testing Environment216

Verifying Performance217

Spark Counters for Verifying Performance217

Projects for Verifying Performance218

Job Validation219

Conclusion220

9．Spark MLlib and ML221

Choosing Between Spark MLlib and Spark ML221

Working with MLlib222

Getting Started with MLlib（Organization and Imports）222

MLlib Feature Encoding and Data Preparation223

Feature Scaling and Selection228

MLlib Model Training228

Predicting229

Serving and Persistence230

Model Evaluation232

Working with Spark ML233

Spark ML Organization and Imports233

Pipeline Stages234

Explain Params235

Data Encoding236

Data Cleaning239

Spark ML Models239

Putting It All Together in a Pipeline240

Training a Pipeline241

Accessing Individual Stages241

Data Persistence and Spark ML242

Extending Spark ML Pipelines with Your Own Algorithms244

Model and Pipeline Persistence and Serving with Spark ML252

General Serving Considerations252

Conclusion253

10．Spark Components and Packages255

Stream Processing with Spark257

Sources and Sinks257

Batch Intervals259

Data Checkpoint Intervals260

Considerations for DStreams261

Considerations for Structured Streaming262

High Availability Mode（or Handling Driver Failure or Checkpointing）270

GraphX271

Using Community Packages and Libraries271

Creating a Spark Package273

Conclusion274

A．Tuning,Debugging,and Other Things Developers Like to Pretend Don’t Exist275

Index325