I’ve got a bit of a problem in that I spend most of my career working in engineering space, but most of my thought capital is spent on larger problems of organizational design, technical strategy, laying down foundations today for problems we’re going to need to solve in a year or more. This frustrates my bosses to no end, who just want me to build a server or swap a bad hard drive out or any other of a number of mundane day to day sysadmin tasks. I’m left without much of an outlet for this stuff besides meetup groups and, when I find the time, blogging. Thanks for humoring me.
One of my frequent frustrations is we tend to carry too much legacy around in how we work, in how we organize. We do things all wrong because, well, that’s how we’ve always done it. But I’m thinking farther out, and I see many operations teams on a collision course with the hard limits of the human brain. To wit: the hierarchical limitations of Dunbar’s number and the human neocortex.
As the theory goes, we can only maintain about 150 human:human relationships before our brain starts demoting less important relationships out into the realm of mere acquaintances, or recognition, or worse yet… a person is completely forgotten. This all comes from when we were hunter-gatherers and we moved past the more primitive great ape behavior of grooming one another to verbal communication as the glue between us. The number can vary from individual to individual, with a lower bound capacity of about 100 to an upper bound capacity of about 250.
But within that ~150 social group, we have layers. There may be about 5 people that we’re intimately familiar with… mother, father, closer siblings, a lover, a best friend. Further out we may have a relatively deep sense of kinship with about 30 or so people, our extended family, our tribe, or in the modern context, our department at work.
Our brains evolved to increase our ability to be social with one another using spoken language as our glue, and gave us enough capacity in the neocortext to maintain the bonds we form. But there are upper limits on the numbers of bonds we can form, with a relatively low limit on the closest bonds, and a relatively higher limit on the looser bonds.
What the hell does this have to do with Service Oriented Architecture?
Mark Burgess has been pioneering the field of Promise Theory, and its practical application in human:machine and machine:machine relationships through the ongoing development of CFEngine. Most of you reading this now have probably indirectly benefited from Burgess’ research by way of using configuration management tools like Puppet or Chef. In his book “In Search of Certainty“, Burgess explores how we have behaviorally taken advantage of the neocortex to build relationships with the machines that we rule over, and that the limitations of our ability to rule over machines is limited by the capacities of our neocortex to maintain those relationships.
In the bad old days of operations engineering, it was not uncommon to see a 1:20 or a 1:30 relationship between sysadmins and servers. I’ve even seen ratios as poor as 1:12, and even worse in Windows shops. In shops like this, our servers were special snowflakes, lovingly built by hand and given cute names. The upper boundary on how many machines we could handle was more of a capacity limit on the neocortex than any sort of time boundary.
If you were lucky enough to work in a LAMP shop back then, and you didn’t have much of a social life hogging up your more intimate Dunbar slots, you had the luxury of having deep, intimate knowledge of your full stack. It wasn’t very complicated. It was within the realm of reason to be a full stack ninja rock star (or whatever the recruiters are calling people like that these days).
When CFEngine came out and inspired the release of other automation tools like Puppet, Chef, etc, we moved to an Infrastructure as Code mentality. This didn’t eliminate the limitations of the neocortex, but it did add a layer of abstraction. Instead of being intimately familiar with individual machines, we now had to become intimately familiar with the roles defined in code. As many shops had fewer than 30 major server roles in production, our brains coped and by all appearances it was Mission Accomplished! We licked that capacity problem, and now one engineer can run 10,000 servers. Indeed, through use of automation, I had at one point in my career been solely responsible for over 4,000 machines at once and still had time left over to help out the Windows guys.
But most of those 4,000 machines shared one role. I’d only consumed one Dunbar slot for over 99% of my domain of responsibility. Of course I could be intimately familiar with it all!
Service Oriented Architecture, combined with the rise of cloud, the maturation of configuration management tools, Agile methodologies, etc. represented an ideal confluence of new ways of doing things. We were able to go back to the old school UNIX best practice of making small tools that do specific things really well, and then gluing them together to solve bigger problems. We got better at decomposing big problems into small ones, and solving those small ones with discrete services. This didn’t just solve a lot of technical problems for us, but it also solved some organizational scaling problems. Now engineering teams could truly focus on smaller parts of the service stack, knowing that as long as the interfaces were stable and well-understood, they had a good bit of autonomy on everything that happens inside of their domain space.
But it hasn’t been so awesome on the operations side. In many shops, we’ve still got the age-old problem of development teams tossing things over the wall at operations. And as service offerings get more comfortable embracing SOA, the variety of services that operations engineers are responsible for are growing.
In some cases, growth will exceed the boundaries of a Dunbar layer.
While all of this is going on, the rise of the DevOps movement is placing greater emphasis on our human:human relationships, which is putting even greater strain on the limitations of human biology. The neocortex can only handle so much before somebody gets demoted.
So how do we get back to the operations engineer knowing all the things about all the things? We don’t. It’s a fallacy. You can go through the motions, but at the end of the day, the human mind can only have intimate knowledge of a finite number of entities. And remember, if you try to load them up with more machine contexts to be intimately familiar with, you’re asking them to drop a slot that would go to another human being.
We’ve seen some movement in the DevOps space towards shifting part of the operational burden to product development teams, and in some cases this works very well. But it makes sense, because they are already very familiar with their code. Would building greater intimate familiarity with the operationalizing of that code occupy another Dunbar slot? Or would it just add depth to the slot that is already being consumed by familiarity with the service?
In working this way, the remaining operations team is no longer bothered with intimate familiarity of services running on their infrastructure. Instead they can focus on excellence in providing the Infrastructure as a Service. And if this is comprised of only a few discrete systems, can that then occupy one of the closer orbit slots in the neocortex? This approach would marry better with the social creatures that we’ve evolved to become. We’d use our biological limitations as a strength rather than as a weakness.
Whether it was obvious to the author or not, such a shift happened in “Turn This Ship Around!” by L. David Marquet (an excellent book, by the way). The crew of the Santa Fe, a nuclear submarine in the US Navy, had a crew complement of 135. That fits comfortably into the Dunbar theory of social capacity. One of the things that the captain changed, though, is moving the intimate technical knowledge down the organizational stack, placing decisions in the hands of those closest to the impacted domain space. Marquet realized rather profound improvements in organizational performance and engagement, but I’m not sure that he recognized that part of the reason for this success is restructuring responsibilities around smaller working groups that built deeper, more intimate relationships with their areas of responsibility (and removing himself from the decision space in the process).
Humans make pretty poor machine emulators, but we’ve got tens of thousands of years of experience at being primates. We ought to tap into what we’ve learned about hominid social structures to build more effective engineering organizations. SOA happens to offer up some convenient abstraction boundaries for partitioning domain knowledge and responsibilities.