图书介绍
高性能Spark 影印版2025|PDF|Epub|mobi|kindle电子书版本百度云盘下载

- Holden Karau,Rachel Warren著 著
- 出版社: 南京:东南大学出版社
- ISBN:9787564175184
- 出版时间:2018
- 标注页数:344页
- 文件大小:42MB
- 文件页数:360页
- 主题词:数据处理软件-英文
PDF下载
下载说明
高性能Spark 影印版PDF格式电子书版下载
下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台)。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用!后期资源热门了。安装了迅雷也可以迅雷进行下载!
(文件页数 要大于 标注页数,上中下等多册电子书除外)
注意:本站所有压缩包均有解压码: 点击下载压缩包解压工具
图书目录
1.Introduction to High Performance Spark1
What Is Spark and Why Performance Matters1
What You Can Expect to Get from This Book2
Spark Versions3
Why Scala?3
To Be a Spark Expert You Have to Learn a Little Scala Anyway3
The Spark Scala API Is Easier to Use Than the Java API4
Scala Is More Performant Than Python4
Why Not Scala?4
Learning Scala5
Conclusion6
2.How SparkWorks7
How Spark Fits into the Big Data Ecosystem8
Spark Components8
Spark Model of Parallel Computing:RDDs10
Lazy Evaluation11
In-Memory Persistence and Memory Management13
Immutability and the RDD Interface14
Types of RDDs16
Functions on RDDs:Transformations Versus Actions17
Wide Versus Narrow Dependencies17
Spark Job Scheduling19
Resource Allocation Across Applications20
The Spark Application20
The Anatomy of a Spark Job22
The DAG22
Jobs23
Stages23
Tasks24
Conclusion26
3.Data Frames,Datasets,and Spark SQL27
Getting Started with the SparkSession(or HiveContext or SQLContext)28
Spark SQL Dependencies30
Managing Spark Dependencies31
Avoiding Hive JARs32
Basics of Schemas33
DataFrame API36
Transformations36
Multi-DataFrame Transformations48
Plain Old SQL Queries and Interacting with Hive Data49
Data Representation in DataFrames and Datasets49
Tungsten50
Data Loading and Saving Functions51
DataFrameWriter and DataFrameReader51
Formats52
Save Modes61
Partitions(Discovery and Writing)62
Datasets62
Interoperability with RDDs,DataFrames,and Local Collections63
Compile-Time Strong Typing64
Easier Functional(RDD“like”)Transformations65
Relational Transformations65
Multi-Dataset Relational Transformations65
Grouped Operations on Datasets66
Extending with User-Defined Functions and Aggregate Functions(UDFs,UDAFs)67
Query Optimizer69
Logical and Physical Plans69
Code Generation70
Large Query Plans and Iterative Algorithms70
Debugging Spark SQL Queries71
JDBC/ODBC Server71
Conclusion72
4.Joins(SQL and Core)75
Core Spark Joins75
Choosing a Join Type77
Choosing an Execution Plan78
Spark SQL Joins81
DataFrame Joins82
Dataset Joins85
Conclusion86
5.Effective Transformations87
Narrow Versus Wide Transformations88
Implications for Performance90
Implications for Fault Tolerance91
The Special Case of coalesce92
What Type of RDD Does Your Transformation Return?92
Minimizing Object Creation94
Reusing Existing Objects94
Using Smaller Data Structures97
Iterator-to-Iterator Transformations with mapPartitions100
What Is an Iterator-to-Iterator Transformation?101
Space and Time Advantages102
An Example103
Set Operations106
Reducing Setup Overhead107
Shared Variables108
Broadcast Variables108
Accumulators109
Reusing RDDs114
Cases for Reuse114
Deciding if Recompute Is Inexpensive Enough117
Types of Reuse:Cache,Persist,Checkpoint,Shuffle Files118
Alluxio(nee Tachyon)122
LRU Caching123
Noisy Cluster Considerations124
Interaction with Accumulators125
Conclusion126
6.Working with Key/Value Data127
The Goldilocks Example129
Goldilocks Version 0:Iterative Solution130
How to Use PairRDDFunctions and OrderedRDDFunctions132
Actions on Key/Value Pairs133
What’s So Dangerous About the groupByKey Function134
Goldilocks Version 1:groupByKey Solution134
Choosing an Aggregation Operation138
Dictionary of Aggregation Operations with Performance Considerations138
Multiple RDD Operations141
Co-Grouping141
Partitioners and Key/Value Data142
Using the Spark Partitioner Object144
Hash Partitioning144
Range Partitioning144
Custom Partitioning145
Preserving Partitioning Information Across Transformations146
Leveraging Co-Located and Co-Partitioned RDDs146
Dictionary of Mapping and Partitioning Functions PairRDDFunctions148
Dictionary of OrderedRDDOperations149
Sorting by Two Keys with SortByKey151
Secondary Sort and repartitionAndSortWithinPartitions151
Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function152
How Not to Sort by Two Orderings155
Goldilocks Version 2:Secondary Sort156
A Different Approach to Goldilocks159
Goldilocks Version 3:Sort on Cell Values164
Straggler Detection and Unbalanced Data165
Back to Goldilocks(Again)167
Goldilocks Version 4:Reduce to Distinct on Each Partition167
Conclusion173
7.Going Beyond Scala175
Beyond Scala within the JVM176
Beyond Scala,and Beyond the JVM180
How PySpark Works181
How SparkR Works189
Spark.jl(Julia Spark)191
How Eclair JS Works192
Spark on the Common Language Runtime(CLR)—C#and Friends193
Calling Other Languages from Spark193
Using Pipe and Friends193
JNI195
Java Native Access(JNA)198
Underneath Everything Is FORTRAN199
Getting to the GPU200
The Future201
Conclusion201
8.Testing and Validation203
Unit Testing203
General Spark Unit Testing204
Mocking RDDs208
Getting Test Data210
Generating Large Datasets210
Sampling211
Property Checking with ScalaCheck213
Computing RDD Difference213
Integration Testing216
Choosing Your Integration Testing Environment216
Verifying Performance217
Spark Counters for Verifying Performance217
Projects for Verifying Performance218
Job Validation219
Conclusion220
9.Spark MLlib and ML221
Choosing Between Spark MLlib and Spark ML221
Working with MLlib222
Getting Started with MLlib(Organization and Imports)222
MLlib Feature Encoding and Data Preparation223
Feature Scaling and Selection228
MLlib Model Training228
Predicting229
Serving and Persistence230
Model Evaluation232
Working with Spark ML233
Spark ML Organization and Imports233
Pipeline Stages234
Explain Params235
Data Encoding236
Data Cleaning239
Spark ML Models239
Putting It All Together in a Pipeline240
Training a Pipeline241
Accessing Individual Stages241
Data Persistence and Spark ML242
Extending Spark ML Pipelines with Your Own Algorithms244
Model and Pipeline Persistence and Serving with Spark ML252
General Serving Considerations252
Conclusion253
10.Spark Components and Packages255
Stream Processing with Spark257
Sources and Sinks257
Batch Intervals259
Data Checkpoint Intervals260
Considerations for DStreams261
Considerations for Structured Streaming262
High Availability Mode(or Handling Driver Failure or Checkpointing)270
GraphX271
Using Community Packages and Libraries271
Creating a Spark Package273
Conclusion274
A.Tuning,Debugging,and Other Things Developers Like to Pretend Don’t Exist275
Index325
热门推荐
- 826413.html
- 774249.html
- 2380660.html
- 3650745.html
- 1885300.html
- 3819426.html
- 3472416.html
- 2750275.html
- 3672147.html
- 1185576.html
- http://www.ickdjs.cc/book_1647577.html
- http://www.ickdjs.cc/book_2766306.html
- http://www.ickdjs.cc/book_2475731.html
- http://www.ickdjs.cc/book_2547037.html
- http://www.ickdjs.cc/book_3190906.html
- http://www.ickdjs.cc/book_7830.html
- http://www.ickdjs.cc/book_2194607.html
- http://www.ickdjs.cc/book_1622452.html
- http://www.ickdjs.cc/book_512874.html
- http://www.ickdjs.cc/book_3376679.html