In 2010 Apache open sourced a big data framework called Apache Spark. It’s one of their most popular frameworks to date and with the Apache community supporting its development I have a feeling it’ll be around for years to come.
Picking up Spark or big data from scratch can be intimidating. But in this post I’ve collected the best books you can read for getting up to speed with this emerging technology.
If you’re completely new to Spark then you’ll want an easy book that introduces topics in a gentle yet practical manner. For this I’d recommend Apache Spark in 24 Hours.
It’s absolutely huge totaling 592 pages full of Spark tips, tricks, workflows, and exercises for newbies. If you need an intro to Spark I would highly recommend starting with this book first.
Learning Spark teaches big data analysis through APIs for three languages: Python, Scala, and Java. But this book is more than just an intro programming guide to the framework.
You’ll learn a lot of theory behind the Spark framework and what makes it tick. Beginners will learn the value of distributed datasets and command line interfaces.
What I like most is how the author focuses on one single API for singular programmers. You don’t need to pick up Hadoop or Java+Scala to get this working. Just be good at whatever language you choose, and be willing to hit Stack Overflow for answers along the way.
This book is perfect for beginners who want to understand why Spark is so important to big data. It does get easier the more you practice but you’ll need prior programming experience.
I’m a fan of the Sams Teach Yourself series because these books explain topics in a way that just makes sense to beginners. Apache Spark in 24 Hours written by Jeffrey Aven teaches the fundamentals of big data webapps that connect into the Spark framework.
You’ll not only learn what Spark is, but also how to deploy it locally and externally. This includes the Spark CLI interface and database access with Spark SQL and NoSQL.
By the end you’ll be learning about 3rd party resources and extensions like Apache Kafka. You’ll also get into distributed data concepts and touch upon the more advanced enterprise features.
A good portion of this book covers the Spark API and how to interact with it on a regular basis. But it’s really an overall intro to Spark made for complete beginners, and it’s easily the best intro book I can find on the market.
Data analysis covers a much larger space when you talk about big data projects. Big Data Analytics with Spark explains workflows for the most common features like interactive data, graph data, and online streaming data.
I’d recommend this book more to professionals who already work in big data environments. The chapters are very practical and you’ll want live working examples to get your hands dirty.
However you do not need any prior knowledge of Spark to read through this book. You do need to know at least one programming language, but that’s true for all the books in this post.
So what’s so special about this book? It’s basically an all-in-one intro to Spark analysis for all forms of big data. It’s also pretty quick and to the point so it can act as a reference guide once you’re done reading.
Complete newbies are never worried about best practices and scaling their architecture. These topics are so complicated that you really can’t even worry about them when first getting started.
But Spark is a framework that handles a lot of data, and for this reason High Performance Spark is a must-read book at some point during your studies.
Naturally this is more of an advanced book so it’s not meant for beginners. But if you’re scaling large data applications you’ll need to know how to extrapolate data, adjust 3rd party add-ons, optimize your code and the server for handling intense loads.
Spark isn’t easy to optimize and it’s typically just one part of a larger ecosystem including other programs running on the server itself.
But if you’re looking for an ultimate guide to optimizing Spark then this is the book for you.
Most intro books consider the value of Python for spark users. It’s a very popular language and it’s extremely versatile when it comes to dynamic web applications.
Spark for Python Developers introduces big data functionality on the web through Python. There’s a whole online guide to PySpark but it’s not the easiest thing to follow when first getting into Apache Spark development.
This book teaches you how to build a real time webapp relying on smaller exercises throughout the book. You’ll learn about advanced caching in Python and machine learning models that run over prebuilt datasets.
Real time data handling is not one of the easiest tasks to practice. Most devs need a live deployment server to access that much data.
But if you follow the exercises in this book you’ll walk away with a better understanding of Spark, Python, and real time big data analysis.
In a recent post I shared the best books for machine learning with a variety of topics from AI to probabilistic programming. In Machine Learning with Spark you’ll learn how machine learning can be applied to a big data framework like Spark.
This book spans 330 pages of live machine learning examples written in Python, Scala, and Java. It’s oriented towards developers who want to get into machine learning on big data, mostly for devs who already have a basic knowledge of machine learning.
But you can pick up this book with little knowledge of either machine learning or Spark. However you will need extensive knowledge of a programming language, either Python or Scala, to work through the lessons.
By the end you’ll learn how to build programs that automate sorting and cleaning data, how to handle online Spark streaming, and how to work with Spark data on an Amazon EC2 platform.
It takes hard work to understand the intricate details of Spark and big data webapps. But rather than forcing you down a study path, Apache Spark Machine Learning Blueprints shares guidelines and blueprints you can follow for various situations.
Early chapters cover the basics of machine learning, Spark RDD and MLlib. But you’ll quickly be foisted into the world of R programming on top of Spark SQL and other similar resources.
This book is not for the faint of heart. But it will teach you real world solutions for detecting spammy behavior, fraud, vote rigging, and similar machine learning patterns that can be analyzed through the Spark framework.
I would highly recommend that you study the basics of Spark and R programming before getting this book. The lessons are challenging and you will struggle unless you have some deeper background in big data.
This is one of the more unique Spark books because it shares real world scenarios and case studies including real solutions. Lessons are split individually by chapter with each one presenting a scenario, a goal, and a final solution.
Advanced Analytics with Spark applies big data to the real world. You’ll learn how to create a music recommendation engine, how to detect traffic surges, and how to track deforestation using publicly available data.
Most topics relate to statistics and big data. You’ll work over the three popular Spark API languages and each example shares different syntax for problem solving. This book is really unique because it doesn’t guide you in a particular direction, but instead offers pre-built solutions for real world problems.
If you’re looking for a variety of case studies and real examples for using Spark then definitely check out this book. I did a technical review of this book if you want to read more.
Live data and Spark Streaming are becoming much more attractive to enterprise developers. This book covers the fundamentals of real time data processing over a total of 200 pages of exercises with Scala and Spark libraries like GraphX, Spark SQL, and MLib.
Learning Real Time processing with Spark Streaming is not meant for complete beginners, although you can pick up this book with zero knowledge of Spark. However you will need to understand Scala and machine learning basics to work through the lessons comfortably.
You’ll learn how to apply transformations on big data streaming and how this can be filtered with machine learning. Early chapters introduce you to the basics of Spark and a live example of streaming log files and data through Spark extensions.
Later in the book you’ll learn how to deploy these apps live and how to keep them secure. Big data is moving into real time data and if you’re interested in the field of data science then Spark Streaming is a topic you’ll want to learn.
It can be a challenging process to launch a local app live. Getting your local project into production is always the final goal. But a proper workflow can make that goal a lot easier to achieve.
Spark: Big Data Cluster Computing in Production shares tips and workflows you can apply to move your demo Spark apps into live production. You’ll learn about common pitfalls, challenges, and real world scenarios to help you avoid disaster.
The authors of this book are all experts in big data so they’re worthwhile instructors on this topic. They not only teach how to launch, but also how to enhance security and how to optimize your Spark environment for better performance.
Make sure that you understand the basics of Spark and how it operates before you pick up this book. It probably won’t be easy to understand but it will offer the best workflow for anyone working in a Spark environment.
Visuals always appeal to techies and management alike, and this is especially true in enterprise businesses. Spark GraphX in Action is the de facto book for learning graph rendering and visual big data analysis using GraphX.
The GraphX library is covered in many other books mentioned in this post. But none of them have the level of depth you’ll get with Spark GraphX in Action.
This book teaches you all the fundamentals of GraphX and data visualization from scratch. You’ll learn how to write algorithms for data sorting and how to work with machine learning for visualizing your projects.
By the end you’ll have a much deeper understanding of Spark data and the GraphX library. This is one of the newest books in this list and it’s absolutely perfect for anyone that wants to master GraphX for big data.
It’s hard to say if anyone can ever truly master a framework. But with books like Mastering Apache Spark you can get pretty damn close.
A good portion of this book looks into 3rd party extensions for building on top of the Spark foundation. You’ll learn how to merge Spark apps with Cassandra, Hbase, and paid services like Databricks.
The author Mike Frampton teaches by example with each lesson modeling a new topic. This can include data storage, advanced clustering, and even cloud computing with AWS.
I can only recommend this book to someone who really loves sysadmin or DevOps work. It gets into a lot of detail for many 3rd party libraries and you should already be very comfortable with Spark before digging into this title.
Looking for simple recipes for analyzing data trends in Spark? Want to solve common problems with Spark SQL, GraphX, and related libraries?
Then you’ll want a copy of the Spark Cookbook written by Rishi Yadav. The book does teach little exercises and applications, but it’s mostly a resource guide for solving common problems.
You get over 60 unique recipes for working in the Spark shell, working in AWS, querying through Spark SQL, and optimizing Spark’s performance.
Cookbooks can be a lot of fun but they’re best used in situations where you need to solve common problems frequently. This cookbook gets all over the map so it’s great for anyone from beginner to advanced Spark knowledge.
And it’s one of the newer books in this post so the source code is reliable and up to date.
Big data is an emerging field where true experts are needed badly. Apache Spark is just one framework and while it’s a powerful one, it’s also difficult to just get started with no background.
But I’m confident the books in this post can guide you from a complete novice to an expert if you put in the work to build some Spark projects. This framework may not click in a day or even a month. But given enough time you can reach a professional level whether you’re an engineer, database admin, systems administrator, data scientist, or just a student of big data.