Book Review: Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Reviews This post may contain affiliate links. If you buy something we get a small commission at no extra charge to you. (read more)

Big data is taking over the world in the form of enterprise applications and social networks. Spark is a powerful engine by Apache used for data processing and analysis.

The authors of Advanced Analytics with Spark put down their favorite case studies and patterns for sorting and handling tremendous amounts of data. These patterns get very complex like estimating financial risk and offering custom music suggestions with Audioscrobbler.

The book was written by four data scientists from Cloudera who know a lot about large enterprise applications.

If you’re getting into big data or just have a curiosity to understand how it works then this book will offer the most practical tips you could ever ask for.

Book Contents

With 276 pages I was surprised at how much got crammed into this book. I wouldn’t say the writing style is terse, but it gets right to the point without wasting time.

This is not a beginner’s book and you should really understand the basics of Spark and Scala for enterprise projects before reading. Big data is a complex topic which explains why this book has such complex analytics patterns.

Here’s a breakdown of all 11 chapters:

  1. Analyzing Big Data
  2. Introduction to Data Analysis with Scala and Spark
  3. Recommending Music and the Audioscrobbler Data Set
  4. Predicting Forest Cover with Decision Trees
  5. Anomaly Detection in Network Traffic with K-means Clustering
  6. Understanding Wikipedia with Latent Semantic Analysis
  7. Analyzing Co-occurence Networks with GraphX
  8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
  9. Estimating Financial Risk through Monte carlo Simulation
  10. Analyzing Genomics Data and the BDG Project
  11. Analyzing Neuroimaging Data with PySpark and Thunder

The first couple chapters get you started with Spark and big data. You learn about enterprise datasets and how to manage this level of data. I found these two chapters fascinating and while they can appeal to beginners, they probably aren’t the best introductory materials for learning Spark.

The other nine chapters each cover individual case studies with real data from real situations. Topics range from music selection to deforestation and neuroimaging with a unique story behind each one.

During each chapter you’ll get a recap of the project, what the data means, and some design patterns with source code that you can try yourself.

All the code is free and open source on GitHub if you want to examine it closer.

Each chapter offers source code for Scala and Python along with some Java too. These languages are all supported by the Spark development API along with the R programming language(not covered in this book).

The ever-expanding field of data science is quickly becoming necessary for larger companies. In just 10-20 years data science could be an emerging market with countless opportunities. And this book really sells the idea that doing data science right requires the best patterns and the best lines of reasoning.

You get to learn far beyond generic algorithms for sorting data. You learn how to analyze data and what the results actually mean. These patterns can be reused over and over and even customized for your own projects.

It’s very hard to find data science books that cover this much information. And the case studies used in this book are simply phenomenal compared to what else is out there.

If you’re interested in this field of study I would absolutely recommend this book on Spark. But you need to make sure you have the fundamentals down first.

apache spark homepage

The book focuses heavily on Scala with a lot of Python too. It does offer Java but it doesn’t seem to be the focus, and there’s no R code examples at all. If you just need to get going fast then try Scala for the Impatient. Or you can follow the tutorials on /r/LearnPython if you’re OK with online tutorials.

If you’re already a competent programmer with no experience using Spark then pick up a copy of Learning Spark first. This covers everything you need before going into Advanced Analytics with Spark.

Once you’re ready for this book I would highly recommend adding it to your bookshelf. It will push you to that next level of thinking that’s required for professional data science.

Pros & Cons

My favorite part of this book is the pragmatic nature of each case study. The datasets are pulled from real situations where you get to analyze real data with real algorithms.

This is a very rare opportunity to study big data with Spark from experts in the field.

I’ve tried searching for other related books and very few reach this level of detail. Yes it is fairly advanced and you do need to know some programming(you’re best with Scala or Python). But it’s a one-of-a-kind book on the fascinating world of data science.

One downside I have with the writing is the potential to skip chapters or to just not care about parts of the content. Since each chapter presents a different case study you’re entering a whole new conceptual model with each chapter.

For the most part this is great, especially if you want to get into data science. I personally don’t love that area so I wasn’t as excited about every case study. I’m mostly interested in large platforms and the applications of these patterns for the web.

Granted there were plenty of these examples which caught my attention. But others like predicting deforestation really didn’t grab me.

However if you wish to work in the field of data science you rarely get to pick your job. You’re tasked with certain criteria and you have to analyze that data as needed. I think this book is a powerful example of these career requirements in action.

Who Is This For?

There’s no getting around the fact that Advanced Analytics with Spark is made for intermediate-to-advanced engineers. You need prior experience working with Spark and preferably Scala or Python.

This book is not for anyone who’s a complete beginner with Spark. If you get this book without any prior knowledge you will be confused and pretty pissed.

As an alternative for beginners I’d recommend Learning Spark if you’re passionate and willing to get started.

Data analytics can get tricky and when you throw in the Spark API it’s basically impossible without prior knowledge. Spark is a powerful platform but you need to be ready for it.

If you already have some fundamental knowledge of Spark and Scala/Python you should really enjoy this book.

The case studies are phenomenal if you can follow what they’re saying. And the authors have a way of blending their writing together so that the book feels like multiple books in one.

The design patterns for Spark are replicable and easy to clone for your own projects. If you can write great Scala code then you’ll get a lot from this book that you can apply to your own data analysis projects.

Final Summary

Anyone interested in big data, enterprise applications, or using Spark should definitely pick up a copy of this book.

Just be warned that it’s targeted towards intermediate-to-advanced users and you really need that prior knowledge. Advanced Analytics with Spark shares big data analysis techniques that you can duplicate and customize for your own needs. The case studies are all unique and incredibly powerful.

My only issue is that I wish more examples were given with greater detail. I know this is asking for a lot so I really don’t want to knock down my rating much for this trivial complaint.

Bottom line, if you’re ready to delve into professional data science this book will rock your socks into next October.

Review Rating: 4.5/5


Alex is a fullstack developer with years of experience working in digital agencies and as a freelancer. He writes about educational resources and tools for programmers building the future of the web.