Software Development Today: SWAT Team, a Pattern for Overloaded, Multi-project Organizations

We all have seen it in our projects. Some teams are overloaded and can't deliver what we need while others are not so busy and could even tackle more work if there was a need for that.

This type of partial overloading is easily understood if we picture the work flow as a network of nodes. The throughput of the overall organization is limited by the nodes that are overloaded, even if not all nodes are overloaded.

The situation is even worse in multi-project organizations where many projects compete for staff time which leads to optimize for staff-loading optimization and ultimately to overloading the organization by creating certain bottlenecks that block all work for one or multiple projects.

Why does this happen? and How to tackle it?

Below is a short pattern description of one of the possible ways in which we can tackle this type of situations.

Problem and Contribution

As explained above, the problem symptoms are typically that some teams are overloaded and cannot take on more work. This work is essential to meet a particular deadline or to get a project started.

Other symptoms may include:

Line Managers requesting more resources on a regular basis.
Long meetings with two or more people where the allocation of one person is discussed at great length.
Projects are started officially but no progress is achieved due to ongoing projects having higher priority.

A typical result is that some project gets delayed or a project that is started gets starved for resources. The problem can generically be defined as a resource allocation problem. In computer science literature this is also sometimes called a Scheduling problem.

In this pattern we outline a simple strategy to tackle this project starvation in an overloaded multi-project organization.

Context

This pattern is applicable to any organization that has several teams and/or multiple projects (typically a non trivial number) and where the symptoms can be seen, but the root cause cannot easily be found or solved.

An example is a large software development organization where many teams contribute to multiple parallel projects. In these cases it is perhaps possible to find the root cause for the delays and starvation of competing projects, but it is either not practical (no people with enough knowledge of the overall organization or no people with experience in large complex network management) or not economically feasible (the time and money it would take to find out the root cause can better be used applying a simple pattern like this).

Forces

In an organization Line Managers (people that coordinate and prioritize work at a sub-level, typically a team or subset of teams) are motivated to show that their teams are busy. Given this constrain it is typically very difficult to assess the actual load in all the teams in an organization. This is because even if one line manager would be honest and inform that their teams have some bandwidth, there would quickly be more work assigned to those teams so as to overload them.
Alternatively, other managers would complain that their teams are overloaded and require that the available teams be assigned to them, therefore assigning the "busy" line manager more power in the organization, which in turn re-enforces the message that "it is not OK to say that your teams are not overloaded".
Teams themselves have no incentive to publicly state that they are not overloaded because that leads to job insecurity or scorn from surrounding teams (see also above).
In any mid- to large-size organization (more than a few teams) it is very difficult to unequivocally identify which are the teams that are overloaded and which may have some bandwidth. This is because changes to the software may cause a previously free team to become busy and assigning more work to that team may jeopardize the work needed when future changes are recognized.
The larger the organization the more specialized the workers tend to be, which in practice makes it very difficult for teams to help each other by sharing a common work log (aka backlog) or for one team to help another that is working on a different component of sub-system.
Many organizations start multiple parallel projects. This typically leads to overloading the organization even further and putting more pressure on certain teams that are in practice bottlenecks for the flow of work. Multiple projects dependent on one team become therefore "stuck" and their requests cannot be completed. This often leads to other teams being stuck as well because they depend on the bottleneck teams. Finally the organization may have multiple teams that cannot deliver their work because of one single bottleneck. This is a common feature of networked systems such as the Internet.

Solution

In this post I try to present a simple solution. The goal with this solution is not to optimize staff utilization, or to explore complex mathematical models. Rather this solution focuses on something that is easy to understand albeit more expensive than other possible competing solutions.

As explained above, mid- to large-size organizations tend to build complex networks of teams through which the work flows. These networks tend to be complex as well as unstable (i.e. the node links change frequently due to rerouting of different types of work).

A simple solution will have to work independently of the characteristics of that network, thereby being network agnostic.

Create a team of generalists and assign those to one of the ongoing projects and make it report to the project management team as opposed to the line managers. This way the project management team can assign this team of generalists to help some of the teams that may be blocked on some piece of work. Additionally this team will collect tacit knowledge about the prevailing bottleneck areas and will therefore also be useful in debugging and investigating specific problems in those areas later on.

This team must be staffed with people that can easily touch any component or subsystem and ideally these people will be very well networked in the organization in question, allowing them to quickly tap into more specialized knowledge when needed (say over the lunch break or on a quick IRC chat).

This team, which I'll call SWAT team (for SoftWare Action Team) will work under the guidance of one person with Project Management responsibilities and will answer only to the project management team, not the respective Line Managers. The reason for this is that the project management team is motivated to solve specific problems that allow the project to progress and have access to the information about which areas/teams/sub-systems need more work to make that happen.

Resulting context

When the SWAT team is nominated it must be staffed out of the best software specialists (Architects, testers, designers, coders) in the organization. This means that other projects will lose some man-power and potentially lose people that helped those projects progress faster. For this reason it is important to be flexible at first in the nomination of this team and consider every individual assignment together with both the software development teams and the project management teams.

A possible consequence of the nomination of this team is that the team is kept together for a very long time, therefore reducing the staffing levels indefinitely for other software development teams or projects. This is not a desirable situation. The SWAT team should be a temporary team, brought together only to help the previously-starved project progress. Once that project raises in the priority list or is not starved then we must consider stopping the SWAT team's work and returning them to their original teams.

Another risk that should be considered is the loss of specific knowledge on the part of the SWAT team. This is particularly problematic if some of the individuals work in highly specialized teams within a specific knowledge that evolves fast. Therefore the SWAT team should be as short lived as possible but put together whenever needed.

When selecting the individuals for the SWAT team one must also consider the motivation of those individuals to participate in a generalist team. Some will undoubtedly be happy to do so, but others may not.

Queuing Theory and Throughput

The basic systemic impact of the SWAT team is that it transfers people from the day-to-day work (which is typically optimized for staff utilization, i.e. getting everybody busy) to a sporadic or ad-hoc work which is assigned on a need basis. This effectively creates a staffing "buffer" by which we reduce the number of people that are constantly kept busy or in full-utilization. We then use this SWAT team to tackle particular bottlenecks.

Finally, the SWAT team has very little impact on the organization utilization (only a few individuals compared to many teams), but it creates a small surplus capacity that can be applied to any bottlenecks to solve the problem of a starving project.

A Question for you

Have you seen this pattern applied in your organization? What were the results? Share those with us in the comments below.

Pattern format description

In "Name" I give a name to the pattern.
In "Problem" I describe the problem the team is facing in the form of a root cause and a set of symptoms that may be detected and lead to this problem.
In "Context" I explain where the pattern is applicable.
In "Forces" I try to describe the constraints or forces that the solution has to take into account.
In "Solution" I try to describe the instructions and practices to be used to solve the problem as described.
In "Example" I briefly explain the case of one team at work that has faced this problem and how they solved it.

Photo credit: Dunechaser @ flickr

Labels: coplien, network, organization, patterns, queue, reinertsen, Team

This blog has moved. Go to SoftwareDevelopmentToday.com for the latest posts.

Friday, January 07, 2011

SWAT Team, a Pattern for Overloaded, Multi-project Organizations

Why does this happen? and How to tackle it?

Problem and Contribution

Context

Forces

Solution

Resulting context

Queuing Theory and Throughput

A Question for you

See Also

Pattern format description

0 Comments:

About Me

Previous