Putting the drama back in deployment

Maybe I've just been lucky, but over the years I've noticed a tendency to make product launches, or releases generally, less eventful. The Big Bang release is now such an anti-pattern that the very term is almost impossible to bring up without a derisory laugh. Continuous Integration and Delivery with trunk-based development and feature toggles, and fast to build and deploy microservices are making releases practically mundane.

But when the point where you've hit MVP and your product owner is happy to make it public arrives and all you have to do is push some config change and that's it... well, where's the fun in that? Sure, you can take some satisfaction in a job well done, but where's the big moment?

When we recently relaunched the Journal pages for Palgrave Macmillan it went pretty much just like that and our PO, a sensible person who wants things to work and go smoothly, reacted with a "so, that's it? It didn't feel very dramatic".

"Dramatic" was the word I latched on to here, and as it was the day before one of our fortnightly 10% time "hack days" I took it as a challenge. What could I do as an engineer to make our next launch more dramatic? While at the same time being technically interesting? And achievable within a few hours?

"Computer..."

It is a truth universally acknowledged that literally everyone wants to live and work on the Enterprise-D. This isn't going to happen, but every little step we can make in that direction is surely a massive improvement in our lives. So with one eye to gaining familiarity with the W3C speech recognition and synthesis APIs, here's what I went for.

Product Owner: Computer!
Computer: Working
Product Owner: Launch the new blog site
Computer: Launching the new blog site requires the authorization of a senior engineer and a product owner. Please identify.
Senior Engineer: Identify Jane Jones
Computer: Senior Engineer Jane Jones identified
Product Owner: Identify Sarah Smith
Computer: Product Owner Sarah Smith identified.

New blog site launch commencing in 5... 4... 3... 2... 1...

pause while launch and confirmation process happens in background

Computer: New blog site launched

Before commencing on anything here I had to know whether I could use the speech APIs at all. Fortunately this was very straightforward, aided immensely by the typically excellent MDN documentation.

Basic speech synthesis is remarkably straightforward, as long as your browser supports it:

const synth = window.speechSynthesis;

const computerVoice = () => synth.getVoices().find(voice => voice.name === 'Samantha');

function speak(text){
    const utterThis = new SpeechSynthesisUtterance(text);
    utterThis.voice = computerVoice();
    synth.speak(utterThis);
}

"Samantha" was the closest I could find given I was looking for the Star Trek flavour; perhaps one day Majel Barret's voice itself will be available for use... One headscratcher here was that the synth.getVoices() call returned no voices until after a new SpeechSynthesisUtterance had been constructed, which is presumably a bug.

Recognition required more work, unsurprisingly. First we make sure we have the classes required available to us - I didn't bother providing a graceful fallback if no recognition is available at all, as it is just a hackday project, but of course that would have been the Right Thing To Do.

const SpeechRecognition = SpeechRecognition || webkitSpeechRecognition
const SpeechGrammarList = SpeechGrammarList || webkitSpeechGrammarList
const SpeechRecognitionEvent = SpeechRecognitionEvent || webkitSpeechRecognitionEvent

Next we initialise and configure a recognition instance:

const recognition = new SpeechRecognition();
recognition.lang = 'en-GB';
recognition.maxAlternatives = 20;

The lang field specifies the language to recognise, and is pretty clear. The maxAlternatives is more interesting, and I will come back to it later. Note, The tutorials always include setting up a grammar, but my experience showed that this made no difference to the quality of the recognition results (in Chrome) and could be safely left out.

Recognition provides a number of useful event handlers, of which I used the following:

recognition.onspeechend = function() {
    recognition.stop();
};

Stopping recognition when it detects the end of some speech is necessary when the interface talks back to you, as otherwise it will listen to, and potentially process, it's own speech. You therefore need to remember to start it again once the computer has finished speaking, which can be achieved via the utteranceFinished event.

The following was to cover what appears to be a bug in the recognition where the speech recognition would unexpectedly stop - I was never able to consistently reproduce this, sometimes it would stop during recognition other times during silence and never at a regular interval. This is life at the bleeding edge!

recognition.onend = function(event) {
    try { recognition.start(); } finally { }
};

I also used the error event to get feedback on when there are problems during development, though in practice I don't recall seeing an error come out of it:

recognition.onerror = function(event) {
    console.error(`Error occurred in recognition: ${event.error}`);
};

Most importantly, recognition.onresult is where the meat of the work is happening. This is where you will need to interpret the results from the recognition system, and due to the nature of the beast you will have to get creative.

recognition.onresult = function(event) {
    // ...
};

The event contains a result list object in turn containing result objects, which in turn contain a number of 'alternative' objects having a transcript field. This is where the maxAlternatives field set when intialising the system comes in - you could trust the system to precisely recognise your speech and stick with just the one alternative, and in good conditions with distinctive words this can actually work. In practice however... well, the following is a list of alternatives given by the system for my name:

["Jim McKenzie", "Jim MacKenzie", "it Jim McKenzie", "the Jim McKenzie", "gym McKinsey", "Jim McKinsey", "gym MacKenzie", "gym McKenzie", "a Jim McKenzie", "the gym MacKenzie", "the gym McKenzie", "the gym McKinsey", "a gym McKinsey"]

This is presumably where the grammar would help in theory, but it does not appear to. Ensuring you get a wide list of alternatives and then checking it for known phrases you want to recognise is a good work-around:

const recognisedPeople = knownPeople.filter((person) => { result.transcript === person.name });

I also experimented with having a data structure with a 'sounds-like' field just in case a name is rare enough to turn up in even 20 alternatives - it is best not to request too many alternatives as you run the risk of too many false positives:

const recognisedPeople = knownPeople.filter((person) => { result.transcript === person.name || person.soundsLike.indexOf(result.transcript) > -1 })

To avoid false positives having a restricted grammar of commands is very helpful - the commands in this case follow this format:

[command word] [additional message]

Where the command name indicates what structure, if any, the additional message has. For example, "Computer!" has no variable content and gets the system ready to listen for more commands, "Launch New Blog" is made up of the command word to initiate a launch conversation with the name of the feature to be launched. Keeping the command words distinct is important in avoiding false positives and a paralysed app - and one approach that can help with this is treating the conversation as a finite state machine.

Finite State Machines

[A finite state machine] is an abstract machine that can be in exactly one of a finite number of states at any given time. The FSM can change from one state to another in response to some external inputs; the change from one state to another is called a transition. An FSM is defined by a list of its states, its initial state, and the conditions for each transition.

from https://en.wikipedia.org/wiki/Finite-state_machine

Each step of the launch process is a state - the computer is standing by; a feature has been selected for launch, requiring authorization; x/n authorized personnel have been identified; authorization complete, feature launching; feature is launched.

As each state has a very limited set of transitions it is unlikely there will be false positives. Even if two authorized personnel have names which are close to one another, one can be arbitrarily chosen and counted before the other.

I haven't yet applied this, at least in a formal way - I'm saving it for another hackday (maybe - I have a long list of saved up projects...) - but it should be interesting.

Security Theatre, Literally

Of course, we are talking about voice recognition not identification. As long as the system recognises the name spoken, it will be authorized, no matter who did the talking. This is fine - the whole project is to add a bit of fun to launches, and is best used when you're gathering the team around to launch some interesting new feature, not as a replacement for general deployments where hopefully you have some kind of access control and auditing.

But you can make it a little more secure in a fun way by adding a step after 'Identify' where the identifying user is prompted for a security code:

Product Owner: Identify Sarah Smith
Computer: Sarah Smith identified. Please state your authorization code
Product Owner: Papa Oscar Sierra Sierra
Computer: Authorization accepted

The authorizer's code can at least be stored hashed for comparison. This makes it harder to rely on looking for matching alternatives in the recognition result, but by limiting the vocabulary used in pass phrases you can significantly reduce recognition failures. The NATO phonetic alphabet is great for this, and has the advantage of sounding very movie-like. I wouldn't recommend using numbers or letters as they are easily mis-recognised.

Making it look good, too

There's no real need for a visual element to a voice UI, but then there's no real need for any of this, so what the heck...

First of all I wanted the words to be on the screen, too. Naturally this means they should be in green on black in a futuristic font, and the words should appear as spoken by the computer.

Google fonts has the lovely Audiowide, which is a great fit:

Some text using the Audiowide font on a black background

To get the words printed to the screen at the right moment we can hook into the onboundary event when speaking an utterance:

clearMessage();
let remainingWords = text.split(' ');
utterThis.onboundary = (event) => {
    if (event.name === 'word') {
        appendMessage(`${remainingWords[0]} `);
        remainingWords = remainingWords.slice(1);
    }
};

Where appendMessage adds the text to an element in the document which is cleared by clearMessage. There is potential here for making it more flashy using CSS transitions, but I didn't get to it at the time.

Finally I wanted something on the screen while the app is polling to check on the launch status. I decided to have a fairly abstract computer diagram of some kind of object with a trajectory, with various important-looking fields constantly updating themselves. Then when the launch status indicates success this disappears in a rapidly expanding red circle before the final '[feature] Launced' utterance.

I used the HTML Canvas API with some timeouts to do this - it is very raw, naive code and a colleague pointed me in the direction of a library which would have made it all much simpler afterwards, so I won't go into details here.

Putting it all together

Once you have got your head around all of this, the rest of the app is fairly standard stuff - a REST API for setting up feature launches, client side JS that does most of the work, server side JS that performs the feature launching and checking to avoid CORS constraints (naturally on my first pass I was testing on localhost and didn't think about this at all...).

The model looks like this:

{
  "name": "New Blog site",
  "trigger": {
    "url": "http://features.myamazingsite.com/new-blog-site/enable",
    "method": "POST",
    "headers": [{ "name": "X-API-Key", "value": "not_very_secure_really" }]
  },
  "check" : {
    "url": "https://blog.myamazingsite.com/"
  },
  "people": [
    {
      "name": "Jane Jones",
      "soundsLike": [ "Jayne Jones", "J N Jones" ],
      "role": "Senior Engineer"
    },
    {
      "name": "Sarah Smith",
      "role": "Product Owner"
    }
  ],
  "launchRoles": [ "Senior Engineer", "Product Owner" ]
}

The trigger mechanism became quite flexible after I looked at a few different ways of enabling features in our systems - when we eventually used this in production the app we were working on had no way of changing a feature state other than through a config file packaged up with it during the build process, so we prepared a commit and had the trigger be approving a deployment to production via our CI server (which required a header to be set, but no other security to speak of).

The check is just looking for a 200 OK from a target URL; in our particular case the target URL was redirecting to an old implementation in production until the feature was toggled on. Checks could also involve looking for text in a response body, for example.

people lists all personnel known to this app; by decoupling this from the launchRoles you can have some fun telling someone they are not authorized to launch a feature due to their role on the team. I shot myself in the foot by making launchRoles an array, and so somewhat flexible. As this is a natural language interface, I had to write some code to "prettify" an arbitrary set of launch roles - "A senior engineer and a product owner"; "two senior engineers"; "a senior engineer, a QA and a product owner", etc. Admittedly, this was fun to solve anyway:

function prettyLaunchRoles(roles) {

    const roleCounts = roles.reduce(function (acc, role, index) {
        if (acc[role]) {
            return Object.assign(acc, { [role] : acc[role] + 1 });
        } else {
            return Object.assign(acc, { [role] : 1 });
        }
    }, {});

    function article(count) {
        return (count === 1) ? 'a' : englishNumber(count);
    }

    function englishNumber(num) {
        if (num === 2) return "two";
        if (num === 3) return "three";
        return num.toString();
    }

    function pluralSuffix(role, count) {
        return (count === 1) ? '' : 's';
    }

    return Object.keys(roleCounts).reduce(function (acc, role, index) {
        let connector = '';
        if (index === 0) {
            connector = '';
        }
        else if (index === Object.keys(roleCounts).length - 1) {
            connector = ` and `;
        }
        else if (index > 0) {
            connector = `, `;
        }
        return `${acc}${connector}${article(roleCounts[role])} ${role}${pluralSuffix(role, roleCounts[role])}`
    }, '');
}

(and now that I've copied and pasted it in here, all I can see are the flaws)

The following video is my hackday demo, a rare occasion where the demo gods (mostly) smiled upon me:

Using it for real

"It's all very well squandering your 10% time on these flashy little projects which look good in hack day demos, but aren't you going to do something you'll use for real?" asked the strawman. Well, we did use this for real when my team relaunched some important pages on our website after taking over responsibility from a soon-to-be-decommissioned legacy system.

We gathered round the projector at the end of the room with the lights down, had the Good Speakers out, and I set up the dramatiser for a senior engineer and product owner to authorize the launch. And it all worked! And it was fun! If there was anything like a hiccup it was that we did it using a CI deployment pipeline and the deployment of our app currently takes several minutes to complete (next time, feature toggles we can alter at runtime!).

Conclusion

The point of this however is not that this in particular is a great way of doing fun launches. It's that you can experiment with something new (W3C speech APIs), sharpen your coding skills with kata-like problems (natural language description of groupable sets of roles), have fun (and hopefully bring some fun to your colleagues) and build something real. All of this is valuable, and I am glad to have worked somewhere that provides people with 10% of their time for self-improvement and letting off steam like this.

About

Jim Kinsey is an English software engineer based in Berlin. Currently at Springer Nature, formerly of BBC News & Sport.