Spark: The Definitive Guide published by O'Reilly!

As of February 6th, 2018, Spark: The Definitive Guide has gone to print. This was the most intensive project and process that I've ever undertaken in my life. It was filled with frustrations and anticipations, excitements and fears. I must extend thanks to those that encouraged me to lead the writing of the book, namely Ion Stoica, Patrick Wendell, Ali Ghodsi, and (somewhat obviously) Matei Zaharia. These folks were the ones that recommended that I take the lead on the book and I am forever grateful for them to grant me such an opportunity.

In particular, I would like to thank my co-author, Matei Zaharia, for his help and graciousness throughout the writing process.

How it got done

Writing the book was simple. Simple is not easy.. You have to churn out a lot of words and then refine them for the audience. It's really that simple.

Matei and I had a core thesis from the beginning. ~80% of value should be in the first ~20% of the book or in other terms, 80% of the readers will only read the first 80 pages of the book. The reason is straightforward, the vast majority of the audience will read through the beginning and then cherry pick relevant sections after that. Importantly, we were 100% ok with that!

For this reason, the first part of the book(approximately 80 pages) is narrative explaining Spark's background and the fundamental concepts. We focused on the simplest explanations, eschewing technical jargon in order to make it approachable to the largest audience.

The first two months of the "writing" process were focused on organizational concepts - how can we ensure that we're going to cover all the topics that we need to in a coherent manner? It became readily apparent that length was going to be an issue. Spark is an immensely capable system taking the "standard library" approach towards big data. For this reason, it has an massive set of capabilities. Capabilities that we improve upon at Databricks with the Databricks Runtime.

After laying out the general organizational structure of introduction, Spark SQL, streaming, machine learning, and then the ecosystem we were ready to get started.

Part 1: The First Draft

Instead of trying to get every chapter perfect (as some of the early readers may have noticed), we took the approach of a more turbulent process. From November 2016 to mid 2017, I wrote the first draft of a given chapter in each week for the book. A chapter a week, good or bad, was my commitment. Naturally, some chapters ended up quite bad at the onset.

One powerful note, is that we wrote the entire book inside of Databricks notebooks because it made it easy to write with various languages and test the code that I'm writing - all in one narrative notebook.

After writing several chapters, Matei and I would read through the chapters and write up our edits in order to make sure that we're aligned on tone, content, and direction. This process was excellent. The raw first drafts helped sketch some lines out and make sure that we were on the same page about what we wanted to achieve. This was the "dry-run" of reviewing before we were ready to send them all the reviewers.

The ideal would have been to write a draft of the book and work from there, however, batching was necessary for our excellent reviewers to make progress reading and editing the book themselves.

Part 2: Reviewing

After getting feedback from our reviewers, it came time to integrate that feedback. This was a labor-intensive process requiring me to diff the various comments from our reviewers. Making one change meant that other proposed changes might be affected as well making it hard to keep track of what was and was not fixed. This part took longer than I hoped as the feedback was good and some rather obvious issues had to be resolved.

Upon completion of review edits as well as our own changes, we realized that the book length issues had gotten worse. We were at 650+ pages (on what was expected to be a 450 page book).

Part 3: Editing & Quality Control

By the time quality control comes around, we had been working on the book for over a year. By this time my "light PM workload" had fully vaporized and I was in the throws of working on Databricks Delta and another two teams full time. Matei had returned to teaching as well as the Stanford year picked up and so we were both short time.

This part of the book was probably the most difficult. It was extremely arguous and painful while still being inspiring and rewarding. It felt like a battle of inches. Where the finest of details have to be ironed out in a coherent and consistent way. Our production editor, Justin Billing did an incredible job helping us along this path. With time, we hacked the length down, removing detail where we could, and focused on keeping the book extremely high quality and in depth. This was terribly challenging as there is just so much to cover - especially for a definitive guide.

After a much time, we finally landed at 600 pages and after two quality control periods - we completed the entire book on February 6th, 2018.

What it means that it got done?

This book was an unbelievable effort not just because writing a book is hard but because Spark is an amazing tool. It's used everywhere but so many companies out there. If you're doing big data, you're probably doing Spark. Those are facts. What we are most proud of is that this will greatly enable the community to grow and for new people to learn about and use the toolkit to solve the problems that they would like to solve.

The fact that thousands of eyeballs will read those words is incredibly rewarding in and of itself. The fact that people will use that to do their jobs day in and day out makes it even more so.

Matei and I sincerely hope that you enjoy the book and thanks so much for reading!