Feature flags and canary, dark, and A/B releases

Alec Lazarescu May 18, 2014 Continuous Delivery No Comments

What are feature flags?

Feature flags are toggles in the code base that allow UI areas and/or backend functionality to be enabled or disabled via a configuration file or other configuration system. Users that have the feature disabled see no trace of it.

What can feature flags do for you?

Fast Feedback Loops

Fast feedback loops are one of the cornerstones for realizing Agile development benefits. Developers working independently on isolated versions or feature branches of the code base for too long run the risk of diverging paths substantially and having a large and risky merge effort surprise at the end.

Using feature flags developers can all work in the same branch and merge more routinely by ensuring they have their feature config disabled prior to check-ins until ready for testing. Merging once a day or even more often is not uncommon. This brings to the fore any file locations where multiple developers may have worked and makes the mechanics of the merge simpler as small changes and in recent memory to boot are easier to reconcile. Furthermore, it also keeps the developers aware of each other and may prompt useful conversations on code design in an area of joint interest. This is one of many practices that can take some getting used to, but doing it more often will rapidly build team alignment and understanding after the initial growing pains.

Experiments and A/B Testing

These have become popular especially in e-commerce, marketing, advertising, and some UX circles. Effective experiments require a good target metric and a historical record of its measurements. Your experiments will be attempting to influence this metric. A few examples:

sales $/time
ratio of shopping cart checkouts to abandon
advertising click rates/time
ratio of registrations to visitors

As with the scientific domain roots of experiments, a group of users is selected to partake in the test whether it’s a new code build, UI, or product offering having the feature flag for the experiment enabled and non-participating users are in the control group. Keep in mind strategies to minimize selection bias. Some types of changes may rate particularly well or poorly with power users so it’s best to be aware of the characteristics of your selected user samples and you may even want to consider separate experiments by user archetype to not conflate too many variables together.

During the course of the experiment the target metric can be compared to the control group to verify if any statistically relevant change has occurred.

If the change is promising it can be rolled out to the population of all users. If not it can be scrapped.

Canary Releases

Despite best intentions and even with a serious automated testing suite there’s always a chance a particular change may have corner case bugs, performance issues, or unintended consequences.

Given a strongly instrumented system it is beneficial to release initially to a very small subset of servers/users via enabling the feature flag for them and monitoring for any issues. For the full value it is imperative that there’s not only monitoring for error rates, CPU, and the like present but also metrics that correlate to normal usage thresholds such as posts/minute, checkouts/minute, registrations/minute. If for some reason your application is unusable yet silently errors out, noting the drop in traffic compared to typical activity will be what gives this away.

Dark Launch

Though around for a while, this style of release was publicized by Facebook. With this technique there’s no UI impact of the feature flag enabling, but behind the scenes additional work is being done generally to provide real world test data to a system. This could be querying a new data store or sending data through a new data channel for example.

Load tests and staged data are very important first steps, but for crucial changes having an additional intermediate step of a dark launch can help root out further scalability bottlenecks or unexpected behaviors in production.

What to test?

Testing every permutation of feature flags is unnecessary. It’s important to test at a minimum:

every flag enabled
every flag expected to be on for the release deployment enabled

I would add a further case that if you are running experiments where different combinations of flags are enabled to production users you can consider each of those a release deployment configuration of sorts and should give thoughts to testing each to ensure there’s no odd interaction of flag states.

Closing

Feature flags and having a rock solid configuration management system to apply them can enable all of these techniques that help get value to your users at a much faster rate.

These techniques have been around a few years at some of the internet technology leaders and startups so they are not new. Those are exciting and often very vocal places and that can skew the perception of how entrenched the techniques actually are. Scott Hanselman’s notion of the Dark Matter Developers comes to mind.

Nonetheless, with even SAP arriving at continuous delivery it’s time for all companies to take notice.