“If we were to make a mistake it could cost above $2,000 a second. So no pressure,” explained Niall Richard Murphy, head of the ads reliability engineering team at Google in Ireland and the co-author and editor of a definitive book on site reliability engineering.
One of the little-known facts about Google’s 5,000-person-strong operations in Dublin is that it has been the lynchpin of the company’s global data and connectivity strategy since 2003. Under the leadership of Terence McGoff, some 400 engineers have set a global benchmark for data centre design, software-defined networks and site reliability engineering.
Google is building a second $150m data centre in west Dublin, alongside an earlier data centre it built in 2012.
Because Google is a company that was born on the internet in 1998, it had to learn and apply engineering lessons and methods from a standing start, often as its user base grew exponentially and it acquired and integrated companies like YouTube.
In the years since, many of those lessons have become codified under the term site reliability engineering (SRE), a methodology applying the principles of computer science and engineering to the design and development of computing systems, particularly large, distributed ones.
Now those principles have been published in a new book from O’Reilly Media called Site Reliability Engineering, involving contributions from engineers in Ireland and Silicon Valley.
The book should be a bible for fast-scaling internet businesses and how they should manage engineering and deploy technology while rolling out new products on a constant basis.
Niall Richard Murphy was one of the co-authors and editors of the book.
Would you agree that the role of Google’s Irish engineering operations is the untold story of the company’s history here?
I think there are about 400 people in engineering now. You are right to observe that, in general, it is the untold story. We have pretty difficult missions that keep us absorbed a lot of the time, so we’re not into publicity. As a bunch of introverts, we are not greatly inclined to manage the public message, but the book provides an opportunity to talk about a bunch of things that we are doing here. Not only the day job situation, but my current role is to manage the ads reliability engineering team here, which is going to be 80 people by June; it is currently around 70 people.
My teams look after, in all senses of the word, engineering and operations, the systems that underpin over 90pc of Google’s revenues, so all the AdSense, AdWords platforms.
I try not to think about the fact that if you divide our public revenue number by the amount of seconds in a quarter, if we were to make a mistake it could cost above $2,000 a second. So, no pressure!
We look after all of that infrastructure, including the YouTube pre-rolls and the engineering involved in that alone is considerable.
So, the Irish engineering operations are integral to the global Google operation?
Absolutely. For example, we are engaged in internal disaster recovery and we run the company from non-Mountain View offices for a stretch of days at a time.
We are perfectly capable in Ireland of running the systems and extending and improving them, so it’s not just an operational component for a multi-billion-dollar business.
Speaking as a person who grew up in the Ireland of the 1970s and 1980s, it is just a fantastic achievement to run something of this scale on an international basis from these shores.
When Google came to Ireland in 2003 it was still a start-up. What was it like dealing with engineering challenges at scale for the first time?
I have been involved with ads for the last five years, but was involved with other teams long before then.
We have storage and we have a bunch of other functions, including search engine, here. We had to build a lot of things from scratch – not just build physical things like data centres and fibre but we had to actually arrive at our best practices, arrive at the processes and organisational set-up that are tried and tested by now.
That’s what’s going on in the book. We are taking our conventions, about what’s best to do and why and codifying it in the book.
We hope that the book can go out and be helpful to other organisations and hopefully help in their reliability as well.
As the data revolution accelerates there appears to be an openness among the biggest operators of data centres to work together on issues like virtualisation, commoditisation and software-defined networks. Do you see SRE entering this discussion?
We invented the term. Ben Treynor Sloss came up with the term when he started the team in mid-2003. The term SRE was invented by us and we were using it internally for a while until we started publishing public job postings about it and then other companies started doing it as well.
It is close to the term DevOps. There are still a lot of benefits you get from the traditional separation of engineering, the wisdom of operations informs better engineering on every level.
In the DevOps model – which I have difficulties with because the community has refused to define it adequately – but specifically the differences between DevOps and SRE in production engineering are not at all clear, which is pretty difficult from, at the very least, a recruitment perspective.
Just because someone says they were in production engineering or SRE in Apple for a few years, for example, we don’t necessarily know what that means and part of what we are trying to do with the book is draw our line in the sand about what SRE means.
At least the industry will know what they are coalescing around if they are doing it. Being much more public about what we are doing will help in conversations around cloud and how commodisation is happening in the industry. It makes a lot more sense for us to increase the perception around the reliability of our cloud services based on what they are going to know about how SRE works inside the company.
In a practical sense, what are the basic tenets of SRE?
I think organisationally, today, we have a small number of fixed responsibilities and how these responsibilities are implemented can vary from team to team. But we obviously have responsibility for reliability of the product.
One of the key takeaways in the book is an approach called ‘budgets-based risk management. because when you have a product that is being used by a large number of people you have significant infrastructure associated with it.
You very often are able to make some kind of choice about how fast you push that product forward and that choice often has a reliability-like downside to it.
In other words, I can release a lot of new stuff but it is probably going to break because it hasn’t been tested. New stuff often correlates with breakages in various ways.
One of the key realisations we have in SRE at Google is that if we give the team as a whole that owns the product, the SRE team and the product development team, we have this notion that you have a budget that you can set availability of the product to be four or five nines.
And then if you change enough about the product, that if it breaks too much, then your budget is gone and you can’t launch any more features because it would be actively harmful to users’ expectations of how the products will behave when they use it.
Really it is about codifying what it means to make changes safely and SRE is a tool for helping you to make changes safely and increase product velocity as well because if you do go too fast and release a load of random stuff quickly, things break, users go away and very often they never return.
Or, if they do return, they will do so more reluctantly and use your product less often than they otherwise would.
SRE is a big hammer for fixing that problem.
The first mention of SRE was among the original production team in Google Mountain View in 2003/2004.
Dublin was the first SRE site outside of California, which was in 2005. It wasn’t invented here but we were one of the first practitioners.