Walls in Data Teams

2021 Feb 01

Walls in Data Teams: Dysfunctions and Unintended Consequences

The idea of this essay is not to sell the idea of “why data teams should be more end-to-end” or the “importance of all developers having a basic understanding of how to work in numerous parts of a Machine Learning service,” as these arguments have already been made by Eric Colson and Eugene Yan in their respective essays: “Beware the data science pin factory: The power of the full-stack data science generalist and the perils of division of labor through function” and “Unpopular Opinion: Data Scientists Should Be More End-to-End”.

I will attempt to detail, in an anecdotal way (and within certain limits of privacy and timeline), some events I witnessed both as a member and a client of data teams that had a hyper-specialization of roles and some of their unintended consequences.

Although this essay speaks specifically about a data product team, the same can be applied to Analytics, BI, Data Intelligence, Delivery Science, and other types of data analysis teams.

The Origin of the Wall

Source: [https://germanyinusa.com/2019/08/18/why-did-the-gdr-build-the-berlin-wall/](https://germanyinusa.com/2019/08/18/why-did-the-gdr-build-the-berlin-wall/)

At a certain point in my career, I was part of a Data team that had the following composition: one Machine Learning Engineer, one Data Scientist, and one Software Engineer.

The team had three very peculiar characteristics that served as contributing factors to the cases presented here:

(1) the team had no technical leader because the organization believed in a Holacracy model;
(2) due to (1), an informal authority of the product team was established, creating a relationship of total precedence over the engineering and analytics team, to the point of dictating parts of the code that would go into production or not; and
(3) due to point (2), the team did not work towards solving problems for a service that served an end consumer; instead, each engineer received tasks exactly according to the job title on their HR record.

Clarifying the last point:

The Data Scientist would do nothing more than take data analysis or prototyping tickets, even if they had a professional background in software engineering for API implementation;
The Machine Learning Engineer could only translate R code and put everything inside Falcon using Python, even if they had experience with statistical modeling; and finally,
The Software Engineer was expressly discouraged from touching the Python code of Falcon implementations for performance improvement, even with vast experience in backend.

In short, the team was doing knowledge work but operating within a Taylorist paradigm regarding workflow. Initially, we were all satisfied with having a specific focus on our own disciplines. After all, we wouldn’t be distracted by context switching or involved in “non-relevant” matters within each discipline.

The Beautiful Side of the Wall

Source: [https://wallpapersafari.com/berlin-wall-wallpapers/](https://wallpapersafari.com/berlin-wall-wallpapers/)

In the beginning, there was a perception of increased team productivity, as each individual had a well-defined backlog of tasks. Since each professional had their sprint tasks previously established within a personalized Jira board, they didn’t have to worry about absolutely anything beyond their sphere of skills.

Explicitly:

The Data Scientist wouldn’t have to deploy models or update Dockerfile records in the ECR;
The ML Engineer wouldn’t have to read/understand endless lines of Scala + Ruby code to embed the model into the platform, or even worry about integration with other client services; and finally,
The Software Engineer wouldn’t need to spend hours debugging serialized models in case of response times above 500ms per request, leaving them to just wait for the problem to be solved by the person who had that specific competence on their badge.

Initially, team productivity was at stratospheric levels. Tickets and more tickets in JIRA were being closed with impressive speed, as the degree of uncertainty regarding the knowledge needed to deliver the task was much lower.

Data scientists were training models with state-of-the-art libraries, engineers using AWS services launched weeks earlier at AWS re:Invent, and of course, product teams were extremely happy with things coming out as expected.

But to paraphrase Shakespeare in the play Hamlet, “something was rotten in the state of Denmark” regarding the way things were being built and, especially, if something needed maintenance…

The Reality of Walls in Data Teams

Source: [https://www.historyextra.com/period/20th-century/berlin-wall-history-facts-fall-why-built-destroyed-how-long-deaths-killed-graffiti-east-west-life-today/](https://www.historyextra.com/period/20th-century/berlin-wall-history-facts-fall-why-built-destroyed-how-long-deaths-killed-graffiti-east-west-life-today/)

As the pillar of contemporary philosophy Mike Tyson would say, “Everyone has a plan until they get punched in the face.” And our plan was simple: Since everyone was doing very specific work, we had the advantage of implementing more optimized solutions from the start of the development cycle, as each would work specifically on their discipline without distractions or other uncertainties.

To top it off, our plan assumed that knowledge transfer to the professional immediately next in the work sequence would be simple and easy.

In other words, we didn’t include in the equation issues such as uncertainty about the receiver’s knowledge regarding the information provided, learning time if there were knowledge gaps, clarifications, and other aspects that could be a point of friction during this knowledge transfer.

And our punch in the face started getting closer as we began putting numerous Machine Learning models/services/products into production.

For each of these models, we had to deal with various aspects such as: monitoring requests and outputs, concept/data drift, increasingly complex pipelines (even cases of real-time training data updates), extreme situations (corner cases) arriving, model retraining orchestration, security, logging, including business rules in the APIs, and of course, bugs.

When the Machine Learning models/services/products went into production to help our clients, feelings of satisfaction, pride, and relief flourished as the first recommendations/predictions/classifications/inferences were passed to our clients; after the ecstasy of the first five minutes, everyone only thought:

“If this service stops and so-and-so isn’t here, how are we going to fix it?”

…

Often, luck and the robustness of most current technologies hide numerous latent problems without anyone noticing what is happening. And this gives a false sense that we are sagacious planners, and in some cases, this ends in an excess of optimism on the part of some stakeholders. The reasoning was: if an event of absence of a team member during a production problem never happened, therefore it wasn’t a problem.

But in reality, with numerous moving parts to orchestrate, monitor, and maintain, and with this dysfunction established by design that collaboration was considered transgression, it was more than obvious that our inevitable punch in the face offered by reality wouldn’t come on a horse: this punch would come mounted on a 1300cc Hayabusa at 300Km/h.

Two examples of how this artificial wall directly harmed the organization.

In the first case, it was a recommendation service that, after a broad debate regarding the observability instrumentation of the ML service, reached a point where the person responsible for implementing the service arbitrarily decided that there would be no aspects like tracing, logging, heartbeat, and monitoring in this recommendation platform.

There was almost a civil war among members regarding the implementation, but because of the wall within the team regarding these implementations (where only that domain specialist would actually do the task), it was determined that no one would spend 8 more hours implementing observability in this service, even though one of the team members had 95% of the code ready to implement this instrumentation.

It turns out that due to an error in the scheduler service, which orchestrated the update of the data serving as the basis for the Machine Learning service, the recommendation lists were not being sent updated to the clients.

The production service had no alerts or monitoring on either the scheduler machine or the data update orchestration system. There were not even mechanisms for data consistency and update checks. The observability tools were available, and some of the SREs offered to help with integration, but due to the previously mentioned precedence issues and the prevention of collaboration, the data update problem occurred silently for weeks.

Result: All clients received the same items, 99% of items expired after 72 hours, and of course, all revenue from recommendations dropped to zero.

In the second case, the core platform started having average response times of 11 seconds in a prediction module, which caused many complaints from clients. Due to this investigation, a ticket was opened for the ML Engineer specifying that “The API was slow.”

The investigation started in the API that was servicing the model where the core platform consumed the predictions.

After some tests, we saw that the serialized model was returning the prediction in less than 40ms, and the API in Falcon where the prediction service was running took just over 47ms. Concerning, but the remaining 8.9 seconds still needed to be found.

Since the team had no access to the personal backlog of other developers or access to the codebase to see what had changed, we discovered after a long time that the slowness started after a deployment made by a developer on a “parallel task” at the request of a stakeholder.

There were only a few details preventing the solution of the problem: (i) the developer responsible went on unpaid leave the next day and (ii) no one besides this soul knew where the codebase for the module in question was.

To give you an idea of the degree of hyper-specialization and the absence of cross-functionality, the other Scala + Ruby developers in the company couldn’t even create the same environment setup that this developer used to put that code into production.

And now comes the worst: One of the team members had asked about a month earlier for access to the repository and information about the setup so he could see the codebase and approve PRs (given his experience with Ruby); but because of this artificial wall, this initiative was canceled. Result: We had to stay with this problem for a long period of time until this developer returned from his leave and fixed the problem.

Needless to say, the amount of stress and frustration that this whole situation caused.

…

Socio-technical systems are highly dynamic and consider numerous variables such as people, material resources, environment, pressures, psychological factors, etc. It would be very pedantic of me to want to make a prescriptive list of x aspects that every data team should do to avoid problems like walls in data teams without the proper context.

However, within what I experienced, I can speak a little about some of the unintended consequences when collaboration between members of the same team is interpreted as domain transgression.

Some of the Unintended Consequences of These Walls

Source: [https://twitter.com/SalfordUni_PCH/status/1028559184730701825/photo/3](https://twitter.com/SalfordUni_PCH/status/1028559184730701825/photo/3)

Eugene Yan made an interesting argument about the advantages of cross-functionality in data teams and why these teams manage to deliver more.

Thus, instead of talking about the good side of not having walls in data teams, I will list some of the consequences I experienced firsthand, and I leave the conclusions to each of the 3 readers who managed to reach this part of the post:

“He who has one has none”: One of the things some of my friends, Daniel and Eiti, always told me is that {…} he who has one has none, he who has two has one, he who has three has only two {…}. This is not a matter of redundancy or headcount, but rather an understanding of the fragility of services when team members do not have the minimum knowledge and context of what was implemented, and especially how each should act in the event of a problem or unavailability. The point is that in the absence of knowledge sharing, when a team member is absent, the platform/product/service will stop. Simple as that. Two solutions emerge: (i) it stays unavailable and life goes on, or (ii) the time to resolve the problem will be high;
Hot potato as a knowledge transfer methodology: The new normal was that what should be a knowledge-sharing activity between two team members (e.g., DS and MLE or MLE and SE) ended up becoming one developer passing a hot potato to the other. Since the work is highly specialized and the gap between sender and receiver in communication is large, two patterns emerge: (i) for each solution, the sender almost has to do a MOOC to explain the implementation logic (e.g., explanation of learning rate update in CNN for convergence improvement) and (ii) when the sender knew that the receiver would lose the thread after 5 minutes, both agree that it’s best to save time and put it into production to close the ticket in Jira. Since each task was individualized and there was no possibility of collaboration to give or receive help, the most important thing was to close one’s own ticket and pass the task along rather than be considered a bottleneck in the entire process;
Technical Latifundia: With a wall preventing collaboration on what would be done, the creation of what I vulgarly call technical latifundia occurred within the team: properties of large technology silos belonging to one person or a small group, a latifundium that generally has low productivity. And here everything happens: Code Review among buddies, direct commits to main/master, people refusing to remove AWS credentials from the code, deployments to production with a bunch of ghost environment variables, overestimations, etc.;
Review Theater: Because of hyper-specialization, it reached a point where no one had the slightest idea of what was actually on the production platform. With everyone out of the big picture and not knowing what the person sitting at the workstation in front was doing, every time a PR came for review, no matter how technical and well-intentioned our reviewer was, that person didn’t have the slightest idea what the code did, if the tests were in the correct places, or if the logic made sense within the requirements and current architecture of the platform. Often the reviewer stepped out of the role of the person who would help minimize errors and bugs and potentially help improve code performance and quality, to transform into the last obstacle between someone’s work and the “resolved” status in Jira;
The Eternal Game of Telephone and communication overhead: Since people were isolated, communication had to be done individually for each team member with distinct expressions, information, and contexts for each information receiver. Due to the decentralized diffusion of information, interpretation problems were recurrent. Another obstacle was that technical communication was performed by people without background. This caused two problems: (i) the time in meetings for technical communication was often longer than the implementation itself, and (ii) due to the lack of perception of criticality, much of the technical communication that demanded a high degree of precision was lost, either due to a premature abstraction by the communicator or simple forgetfulness;
Increase in lead time for solutions: With each professional depending on the immediately preceding element of the workflow, the workflow became very slow. Hyper-specialization made knowledge transfer difficult because instead of being a systematic investment in educating people so that knowledge was common a priori, what happened recurrently was a hot and ad hoc transfer. In this case, fundamental aspects that team members could know via training were “learned” at the time of implementation. Another point is that due to hyper-specialization, often not only the developers but also other stakeholders became impatient with performing this knowledge transfer and ended up accumulating tasks on just one person. It was common for a developer to have to handle 4 critical problems, while the other two were respectively doing PoCs and increasing the text box size for non-critical software for an internal stakeholder;
Creation of silent dissidents within the team and the normalization of deviance: The machine learning engineer needed to understand the backend to know if they would implement NiFi or Flink as a streaming data processing solution, but they couldn’t because that was the task of the integration engineer. The data scientist wanted to enter experimentation meetings to help elaborate strategies for either A/B tests or Multi-Armed Bandits, but was prevented because this was exclusive to the product management team. What in the idealization was a simple and harmless division of tasks, in practice promoted an environment that promoted convoluted solutions without dialogue, collaboration, and construction. People, even with better ideas, arguments, and data, simply didn’t feel safe to express opinions, given that in the end, only a dominant voice or groupthink was responsible for all decisions regardless of the quality of information. And from that point was where the small normalized deviances began, which months later became unavailabilities, but that is a story for another time;
And in the end, people leave: Software development in general has some unique characteristics as a profession such as: (i) it is a practice-based profession that only improves or is maintained with constant exercise, (ii) it involves applied knowledge & creativity (i.e., continuous exercise of cognitive aspects for learning, retention, and refinement of knowledge), and (iii) it has extremely short update cycles (i.e., a framework from two years ago can be obsolete today). Professionals in the information economy (specifically software development) know that if they fall into an environment that doesn’t minimally have these three aspects (practice, applied knowledge, and an update cycle that keeps up with the market), they will fall behind very quickly. And when people feel this way, they simply leave. The rationale is: A little more money in exchange for comfort, complacency, and an absence of learning in the present can mean future salary stagnation, pure and simple unemployment, or a much slower and more difficult return to the market.

FINAL CONSIDERATIONS

My goal with this anecdotal evidence was to comment on the side of walls in Data/Machine Learning/Data Science/BI/Analytics teams that is not always talked about. Of course, the aspects of managerial precedence and the adoption of a holacracy model had an important association with what was presented.

The creation of these artificial silos, in addition to placing a fictitious limit on team productivity, also prevents the growth of professionals in numerous spheres (e.g., career, corporate, financial [N1]). Furthermore, not by chance, a considerable portion of Machine Learning and Data Science projects are failing miserably also because of the lack of expertise.

The main point is that in the knowledge economy, most people will be inclined to go after opportunities that not only exercise their skills but also offer problems that help them expand and/or refine that knowledge through real-world experience.

The knowledge worker knows that their greatest assets are (i) the capacity for learning, (ii) the refinement and/or expansion of knowledge that solves real-world problems and is coveted by the market, and (iii) the capacity to recognize that the degradation of these skills can put them out of the market in an increasingly short period of time.

Source: [https://www.ucpublicaffairs.com/home/2019/11/12/30-years-after-the-berlin-wall-came-down-east-and-west-germany-are-still-divided-by-nathan-stoltzfus](https://www.ucpublicaffairs.com/home/2019/11/12/30-years-after-the-berlin-wall-came-down-east-and-west-germany-are-still-divided-by-nathan-stoltzfus)

REFERENCES

Beware the data science pin factory: The power of the full-stack data science generalist and the perils of division of labor through function — https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/

Unpopular Opinion — Data Scientists Should Be More End-to-End — https://eugeneyan.com/writing/end-to-end-data-science/

Collective Ownership - http://www.extremeprogramming.org/rules/collective.html

WikiWikiWeb - Thrown Over The Wall

NOTES

[N1] — The binomial qualified professionals + higher salaries is debatable, but personally, I believe in the relationship, even if it is asymmetrical between the parties.