Design for Devs – Change sequence diagrams

I’ve been asked a few times by junior developers how to get started in designing code, as if it’s some sort of art technique. In truth every developer is doing design, no one spontaneously writes complex formal logic. Most just do it in their head based on experience and patterns they’ve used before. For small well understood problems this is fine, but when you are dealing with larger or more complex changes doing a little design work up-front can really help clarify things and ultimately save time while giving a better solution.

I’m writing this to document a simple type of design I’ve used on many projects, a change sequence diagram, one you can do quickly on paper or on a whiteboard in ten minutes and I’ve found to be very helpful in thinking about what changes are required, the size of the overall change and promoting re-use of existing code.

Here’s an example:

It’s a pretty simple variation of a sequence diagram, where you show the sequence of events which should occur as a series of interactions between the involved people/components. It normally starts with person, like someone clicking a link on web page, then shows how the system responds. The change part is about highlighting what components in each part of the system need to change to handle the functionality, what parts need to be added/updated/removed.

Doing this forces you to think up-front about what you will need to change and how the system will interact to get the end result. It ties the individual component changes to the overall user requirement, e.g. you’re not just adding a new database column and view field, you’re adding them so the user can see and update their middle name on the personal details screen. This helps you understand how the parts of your system interact and use consistent implementations and design patterns in your changes, plus identify the unit tests and test scenarios.

When you are done, the number and type of changes shows the scale of the overall change, useful for estimates, and breaks it down into manageable chunks of work. You’ll get the best results if you do it paired with someone or get someone else to review your design. Doing this checks that you aren’t breaking existing patterns in the code or missing something that could increase or decrease the complexity. You can expand it to include alternate flows and consider NFR’s for security and performance.

Next time you’re looking at a new requirement or user story give this a try, you’ll be surprised how easy it is to do and what you’ll get out of it.

Hadoop summit Dublin 2016


Just back from Hadoop summit in Dublin, thought I would give a write up of the talks I went to and my impressions. All of the videos have been put up so it could help you decide what to spend time on.

Overall it was good, a nice spread of speakers covering highly technical topics, new products and business approaches. The key notes were good, some promotional talks by sponsors but balanced with some very good speakers covering interesting topics.

One impression I got was no one was trying to answer the question why people should try to use big data anymore, that has been accepted and now the topics have moved onto how to best use it. A lot of talks about security, governance and how to efficiently roll out big data analytics across organisations. Loads of new products to manage analytics workflows and simplify access for users for multiple resources. Organisational approaches, analytics as part of business strategy rather than cool new tools for individual projects.

One nitpick is they tried to push using a conference mobile app which required crazy permissions. No. I just wanted a schedule. A mobile first web site would have done the job and been more appropriate for the data conscious audience.


Enterprise data lake – Metadata and security – Hortonworks

Mentions of ‘Data lakes’ were all over the conference, as were security concerns about how to manage and govern data access when you start to roll out access across your organisation. This talk covered Hortonworks projects which are attempting to address these concerns, Apache Atlas and Ranger.

Altas is all about tagging data in your resources to allow you to classify them, e.g. put a ‘personal’ tag on columns in your Hive table which identify people, or an ‘expiry(1/1/2015)’ on tax data from 2012. Ranger is a security policy manager which uses the Altas tags and has plugins that you add to your resources to control access, e.g. only Finance users can access tax data and to enforce expiries.

You create policies to restrict and control who can do what to your resource data based on this metadata. This is an approach which scales and follows your data as it is used, rather than attempting to control access at each individual resource as data is ingested, which gets unmanageable as your data/resources grows, providing a single place to manage your policies. It also provides audits of access. Later talks also suggested using automation to detect and tag data based on content to avoid having to manually find it, such as identifiable or sensitive data.

Querying the IoT with streaming SQL

This talk was a bait and switch, not really about IoT but it still was interesting. It was really about streaming SQL, which the presenter thinks will become a popular way to query streaming data across multiple tools. I do agree with the idea, SQL is such a common querying language and most users would prefer not to learn tool specific query languages all the time.

The push for using streaming data is that your data is worth most when it is new and new data plus old is worth more. This means you should be trying to process your data as you get it, producing insights as quickly as possible. Streaming makes this possible.

He went into a lot of technical detail and examples of how you would use it as a super-set of SQL. Mentioned using Apache Calcite as a query optimiser to run these queries across your multiple data sources.

Advanced execution visualisation of Spark job – Hungarian Academy of Sciences

Talk from researchers who worked with local telcos to try and analyse mobile data. They won a Spark community award for their work in creating visualisations of Spark jobs to help find anomalies in data that cause jobs to finish slower.

They named this the ‘Bieber’ effect, based on the spike in tweets caused by Justin Bieber (and other celebrities). This spike can hurt job executiion if you are using default random bucket allocation in Spark based on hashes of keys, as suddenly a load of work needs to be aggregated across multiple nodes where it would be more efficient to partition it into specific ones closer to the data. The real example they found was when cell tower usage spiking due to a local Football match.

They’ve created the tools to view these spikes and test in advance using samples, and aim to create ways to dynamically allocate and partition tasks based on these spikes and improve the efficiency of their jobs.

Award blog

Real world NoSQL schema design – MapR

Talk about how you can take advantage of the flexible schema in NoSQL DB’s to improve data read times, doing things like putting your data into the column names and keys.

Very useful for people doing analysis on large amounts of simple data or large de-normalised datasets.

TensorFlow – Large scale deep learning for intelligent computer systems – Google

Could you recognise a British Shorthair cat?

If not, you know less about cats than a server rack at Google. Expect that list of things to grow.

Good talk on how they are using machine learning at Google, using classified tagged image data to train models that can recognise objects in other images, including things like are the people in the picture smiling.

Talked about TensorFlow, their open source deep learning project. If you are interested in machine learning I’d take a look at the videos they have up.

Migrating Hundreds of Pipelines in Docker containers – Spotify

I like containers, so was looking forward to this.

Good talk, covered Spotify’s use of Big Data over the last 6 years, going from AWS hosted Hadoop as a service, running their own cluster and their current move to Google Compute using Docker with their own container orchestration solution, Helios.

They are now working a service, Styx (don’t search Spotify Styx, you’ll just get heavy metal), which will allow them to do “Execution as a service”. This is a very exciting idea, allowing users to define jobs to run along with the docker images to execute it. This is a great way to manage dependencies and resources for complex big data tasks, making it easier to do self-service for users with governance.

Hadoop helps deliver high quality, low cost healthcare services – Healtrix

After a load of talks mainly about ROI and getting revenue it was nice to hear a talk about trying to give something back and improve quality of life. The speaker grew up in a poor Indian village and had experience of poor access to healthcare.

His talk was about providing at risk people with healthcare sensors (for blood sugar/pressure etc.) that connect to common mobile devices and send sensor data to backend servers that analyse the data. This can be used as part of predictive and preventative care to reduce the cost of unplanned hospital visits. Using this, healthcare providers can monitor patients with Diabetes or heart conditions, vary their drug prescriptions or advise appointments without waiting for the standard time between appointments.

This is especially important in areas with poor health coverage and bad transport links, as the data can move a lot easier than the patient can see a doctor.


Apache Hadoop YARN and the Docker container runtime – Hortonworks

Nice to know about this but the talk was pretty dry unless you are really interested in YARN. YARN supports running in Docker containers and there are now some resources provided by new docker and YARN releases for how to manage security and networking.

It did show that it’s possible to run YARN in YARN (YARNception), which apparently has real world uses for testing new versions of YARN without updating your existing version of YARN. YARN.


Organising the data lake – Information governance in a Big Data world – Mike Ferguson

More coverage of governance and security when using big data in your organisation, mainly from a business view rather than technical. If you are interested in how to roll out access of data across your organisation while centralising control and governance you should watch this talk.


Using natural language processing on non-textual data with MLLib – Hortonworks

Good talk, mainly about using Word2Vec (a google algorithm for finding relationships between words) to analyse medical records and find links between diagnosis codes (US data). This can be used to find previously undocumented links between conditions to aid diagnosis or even predict undiagnosed conditions (note, not a doctor).

The approach could be used in many other contexts and seems very straightforward to apply.

How do you decide where you customer was – Turkcell

Slightly creepy talk (from implications, the speaker was very nice and genuine) about how Turkcell, a telco, is using mobile cell tower data to analyse their customers movements, currently to predict demand for growth and rollout LTE upgrades. But they are also using it to get extremely valuable data about movement and locations of customer demographics which they can provide to businesses like shopping centers.

From a technical point of view it was interesting and gave a good perspective on the challenges of processing very high volumes of data in the real world.

Made me think, is my mobile company doing this? Then I realise of course, they would be stupid not to.


Using sequence statistics to fight advanced persistent threats – Ted Dunning – MapR

Great talk by an excellent speaker, highly recommend watching the video. Real world examples of large hacking attacks on businesses and insight into how large companies manage those threats.

Was about using very simple counting techniques with mass volumes of data and variables, comparing how often certain conditions (such as header values/orders, request timings etc.) occur together. Using these you can identify patterns of how normal requests look and detect anomalous patterns used by attackers. This approach is simple, works “well enough” to detect attackers who cannot know in advance the internals of your servers to mask themselves.

If you have an interest in security take a look at the talk.

Ted Dunning is a rock star, notice the abnormal co-occurrence of female audience members in the front row.


Videos for the keynotes aren’t published but I thought I should recognise some of the really interesting ones.

Data is beautiful – David McCandless

Standout talk from David McCandless with a focus on how to use visualisations to show complex relationships and share insights, data visualisation as a language. Lots of great visual examples, showing scale of costs of recent high profile events, timeline of ‘fear’ stories in the press and most common breakup times from scaping facebook.

Very interesting talk and I’m going to look up more about his work. Check out his visualisation twitter account for examples:

Technoethics – Emer Coleman – Irish gov

Talk about how big data is shaping society and how we should be considering the ethics of the software and services we are creating, both in private companies and government. Mentioned the UK’s snoopers charter, Facebook’s experiment in manipulating users emotions and Google’s ability to skew election results.

I do think we should consider the ethical impact of our work more in software development, trying to do more good and less ignoring the negative effects.

The incremental complexity trap

This is a common problem which developers face, particularly in agile projects when it’s normal to make user stories related to previous user stories to add new functions to existing screens and flows.

It happens all the time. You start with a something like a simple screen and do the necessary work to make it. Then you get a few new requirements, add some more fields and new validation for alternate user flow etc. You do the same, add the fields to your screens and logic to the controllers. Then it happens again, then again and again. More user flows are added, more fields appearing in some of those flows, more complex validation. Your original simple screen, controller and validation logic is now a monster, unmaintainable and a nightmare to debug.

Often teams don’t do anything to solve it, just live with the problems. Commonly by the time they realise what’s happening it’s easier just to keep layering on the complexity rather than deal with it properly. No one wants to be holding the bag when it’s time to stop and refactor everything.

This can be particularly bad if it’s happening in multiple places at the same time during a project. The effects of shortsighted decisions start to snowball, affecting development speed and increasing regression issues, and it’s not feasible to refactor everything (try selling a sprint full of that to a Product Owner).

So how do you avoid the trap?

  • Keep it simple

Establish code patterns that encourage separation of concerns, make the team aware of them and how to repeat the patterns.

  • Unit tests

Test complexity helps highlight when things are getting out of hand before it’s too late and make it much easier to refactor by reducing the chance of regression issues.

  • Anticipate it before it happens and design

This is the job of the Technical Architect, to know what requirements and stories are coming and how they will be implemented. If an area is going to get a lot of complexity it needs to be handled or you will end up in the trap.

What can you do if you are already in it?

  • Stop making it worse

For complex classes stop adding new layers of logic. Obvious, but the temptation will be there. Stop and plan a better approach, as the sooner you do this the easier it will be.

  • Don’t try for perfection and refactor everything

Big bang refactors are risky and time consuming. Take part of the complexity and split it out, establishing a pattern to remove more.

  • Focus on your goals

Overly complex classes aren’t bad out of principle. They are bad because they slow development and give bugs places to hide. Refactoring and creating a framework to add new functionality can speed development and reduce occurance of defects, which also eat time. You should focus your effort on refactoring areas which will need more complexity later, investing time now to save more later.

Infrastructure as code, containers as artifacts

As a developer, one of the things I love about containers is how fast they are are to create, spin up and then destroy. For rapid testing of my code in production like environments this is invaluable.

But containers offer another advantage, stability. Containers can be versioned, defined with static build artifact copies of the custom components they will host and explicitly versioned dependencies (e.g. nginx version). This allows for greater control in release management, knowing exactly what version of not just the custom code you have on an environment but the infrastructure running it. Your containers become an artifact of your build process.

While managing versions of software has long been standard practise, this isn’t commonly extended to the use of infrastructure as code (environment creation/update by scripts and tools like Puppet). Environments are commonly moving targets, separation of development and operations teams mean software and environment releases are done independently, with environment dependencies and products being patched independently of functionality releases (security patching, version updates etc.). This can cause major regression issues which often can’t be anticipated until it hits pre-production (if you are lucky).

By using containerisation with versioning you can control the release of environmental changes with precise control, something that is very important when dealing with a complex distributed architecture. You can release and test changes to individual servers, then trace back issues to the changes introduced. The containers that make up your infrastructure become build artifacts, which can be identified and updated like any other.

Here’s a sequence diagram showing how this can be introduced into your build process:

Containers as artifacts (1)

At the end of this process you have a fixed release deployed into production, with traceable changes to both custom code and infrastructure. Following this pattern allows upfront testing of infrastructure changes (including developer level) and makes it very difficult to accidentally cause any differences between your test and production environments.

ASP.MVC Datatables server-side

This is an example implementation of JQuery Datatables with server-side processing. The source is here.



JQuery Datatables is a great tool, attach it to a table of results and it gives you quick and easy sorting/searching. Against a small dataset this works fine, but once you start to have >1000 records your page load is going to take a long time. To solve this Datatables recommend server-side processing.

This code is an example of implementing server-side processing for an ASP.MVC web appliction, using a generic approach with Linq so that you can re-use it for different entities easily with little code repetition. It also shows an implementation of full word search across all columns, which is something that the Javascript processing version offers but is very tricky to implement on the database side with decent performance. It’s a C# .NET implementation but you can take the interfaces and calls from the controllers and convert the approach for Java or Ruby (missing the nice Linq stuff tho).


I’ll skip the basic view/js details as that is easily available on the datatables documentation.

The request comes into the controller as a GET with all the sort/search details as query parameters (see here), it expects a result matching this interface:

public interface IDatatablesResponse<T>
    int draw { get; set; }
    int recordsTotal { get; set; }
    int recordsFiltered { get; set; }
    IEnumerable<T> data { get; set; }
    string error { get; set; }

The controller extracts the parameters, creates the DB context and repository and makes three calls asynchronously:

  • get the total records
  • get the total filtered records
  • get the searched/sorted/paged data

The data is returned and Datatables Javascript uses it to render the table and controls for the correct searched/sorted/paged results.

The magic happens in the DatatablesRepository objects which handle those calls.

DatatablesRepository classes


public interface IDatatablesRepository<TEntity>
    Task<IEnumerable<TEntity>> GetPagedSortedFilteredListAsync(int start, int length, string orderColumnName, ListSortDirection order, string searchValue);
    Task<int> GetRecordsTotalAsync();
    Task<int> GetRecordsFilteredAsync(string searchValue);
    string GetSearchPropertyName();

The base class DatatablesRepository has a default implementation which provides generic logic for paging, searching and ordering an entity:

protected virtual IQueryable<TEntity> CreateQueryWithWhereAndOrderBy(string searchValue, string orderColumnName, ListSortDirection order)
    query = GetWhereQueryForSearchValue(query, searchValue);
    query = AddOrderByToQuery(query, orderColumnName, order);

protected virtual IQueryable<TEntity> GetWhereQueryForSearchValue(IQueryable<TEntity> queryable, string searchValue)
    string searchPropertyName = GetSearchPropertyName();
    if (!string.IsNullOrWhiteSpace(searchValue) && !string.IsNullOrWhiteSpace(searchPropertyName))
        var searchValues = Regex.Split(searchValue, "\\s+");
        foreach (string value in searchValues)
            if (!string.IsNullOrWhiteSpace(value))
                queryable = queryable.Where(GetExpressionForPropertyContains(searchPropertyName, value));
        return queryable;
    return queryable;

protected virtual IQueryable<TEntity> AddOrderByToQuery(IQueryable<TEntity> query, string orderColumnName, ListSortDirection order)
    var orderDirectionMethod = order == ListSortDirection.Ascending
            ? "OrderBy"
            : "OrderByDescending";

    var type = typeof(TEntity);
    var property = type.GetProperty(orderColumnName);
    var parameter = Expression.Parameter(type, "p");
    var propertyAccess = Expression.MakeMemberAccess(parameter, property);
    var orderByExp = Expression.Lambda(propertyAccess, parameter);
    var filteredAndOrderedQuery = Expression.Call(typeof(Queryable), orderDirectionMethod, new Type[] { type, property.PropertyType }, query.Expression, Expression.Quote(orderByExp));

    return query.Provider.CreateQuery<TEntity>(filteredAndOrderedQuery);

The default implementation for creating the Where query (for searching) will only work if you provide a SearchPropertyName for a property that exists in the database that is a concatenation of all the values you want to search in the format displayed.

You can implement and override to use a custom method if your Entity does not support this, here is an example from the Person Entity:

public class PeopleDatatablesRepository : DatatablesRepository<Person>
    protected override IQueryable<Person> GetWhereQueryForSearchValue(IQueryable<Person> queryable, string searchValue)
        return queryable.Where(x =>
                // id column (int)
                // name column (string)
                || x.Name.Contains(searchValue)
                // date of birth column (datetime, formatted as d/M/yyyy) - limitation of sql prevented us from getting leading zeros in day or month
                || (SqlFunctions.StringConvert((double)SqlFunctions.DatePart("dd", x.DateOfBirth)) + "/" + SqlFunctions.DatePart("mm", x.DateOfBirth) + "/" + SqlFunctions.DatePart("yyyy", x.DateOfBirth)).Contains(searchValue));

The same is true of the order by query, which may need customisation to sort correctly for data, i.e. dates. Here is an example from the PersonDepartmentListViewRepository, which replaces the formatted date column being formatted with the raw date:

public class PersonDepartmentListViewRepository : DatatablesRepository<PersonDepartmentListView>
    protected override IQueryable<PersonDepartmentListView> AddOrderByToQuery(IQueryable<PersonDepartmentListView> query, string orderColumnName, ListSortDirection order)
        if (orderColumnName == "DateOfBirthFormatted")
            orderColumnName = "DateOfBirth";
        return base.AddOrderByToQuery(query, orderColumnName, order);

Using a view will make life much easier, as the data can be pre-formatted and you can supply a search column to do the full word searching, here’s the view I used to combine results from two tables:

CREATE VIEW [dbo].[PersonDepartmentListView]
SELECT dbo.Person.Id, 
CONVERT(varchar(10), CONVERT(date, dbo.Person.DateOfBirth, 106), 103) AS DateOfBirthFormatted,
dbo.Department.Name AS DepartmentName,
CONVERT(varchar(10), dbo.Person.Id) + ' ' + dbo.Person.Name + ' ' + CONVERT(varchar(10), CONVERT(date, dbo.Person.DateOfBirth, 106), 103) + ' ' + dbo.Department.Name AS SearchString
       dbo.Department ON dbo.Person.DepartmentId = dbo.Department.Id


  • If you are displaying date values be aware that you will need to format the date for display before returning in JSON, and the date format will affect how you sort the column on the backend as you will need to identify the actual date column property rather than the formated string
  • For effort and performance you are better off creating view than using complex Linq queries
  • I created the initial example with the help of Stephen Anderson

Normalization of deviance in Software projects

Found this article before Christmas talking about “normalization of deviance”, were groups of people develop a culture of accepting bad behaviour as the norm. It compares the behaviour of people in Volkswagen, introducing and hiding the emissions defeat device, to Johnson & Johnson, proactive management to encourage moral responsibility.

I believe this phenomenon occurs in Software companies too, we often grow to accept bad practises and shortcuts despite knowing the risks and problems they will cause. How often have you joined a new project found something worrying and been told “Yeah, we know it’s a problem but…”?

To avoid deviance becoming the norm, managers and technical leads need to take a proactive approach, encourage and reward people for avoiding and fixing bad practises that cause issues and technical debt, even if it takes more time. We should also listen to new people to see our projects from a fresh perspective which often sees the problems we ignore.

Here’s another article more focused on this issue in technical and healthcare cultures.

Offline web

The problem

I get the train to work every morning, and like a good commuter I avoid conversation and eye contact with my fellows, browsing the Internet on my phone. This works well up until 10 minutes into the journey when I hit a signal dead spot, then there’s a grey browser error screen frustrating me.

Screen Shot 2015-09-04 at 14.36.50

Why in 2015 is this an acceptable user experience for every web site? We cache HTML, images, CSS and JS for performance, but when your browser can’t reach the site for a response it just gives up and displays a meaningless generic message.

My trivial train problems aside, this is a serious usability issue for web applications. Mobile is now how most people access the Internet, and with mobile you cannot assume perfect network connectivity. They could be moving in a poor mobile signal area, in a Wi-Fi dead zone or just behind thick walls. It probably happens to you dozens of times a day and you never realise until you check something on your phone and the grey “Network Error” stares back at you.

I design and make business software applications, normally web applications. Mobile is big in business now as well. It’s not a feature; it’s a functional requirement. Customers want and expect their users to be able to walk around with their phone or iPad and check stock, view records and submit reports, without having to waste time finding a desktop. Lots of times business environments are bad for connectivity, big warehouses with metal walls, spotty Wi-Fi, hospitals built in the 50s with underground levels.

The normal way to deal with this is native applications, iOS/Android etc., which give you better offline options. But this massively complicates development, adding an entirely new tech stack and making you deal with installation and maintenance on client devices.

People like web sites because they are simple. HTML and HTTP are reliable, you navigate to a site and it loads the page, no install and works on all devices/browsers the same (mostly). It gives the same core experience on your desktop or phone, with no unique application UI and controls to understand.

So can we make websites more offline friendly?

Offline web

This is a bit of a silly sounding term, “Offline web”. We’ve lived with the restrictions of the HTTP request for HTML response pattern for so long that a web page being able to do something useful without needing a response seems strange. But what if after the first request, first visit, the web site could tell your browser what to do when you lose connection?html5_css_javascript

In HTML5, app cache as introduced as a way to provide offline functionality, but it’s a very static and cumbersome method which hasn’t taken off well (see here for a criticism of it). Local Storage and Web Workers also help web sites cache and dynamically request resources.

A new approach, Service Workers, has been proposed by Google, Samsung, Mozilla and others. A Service Worker is a JavaScript script that is run separate from your browser. It can intercept requests and cache responses, allowing you to programmatically handle offline functionality. It’s a very interesting idea, have a look at this article for more detail.

Since it’s just an additional JavaScript file that can be loaded if the client browser supports Service Workers, you can use the progressive enhancement approach for adding offline functionality so it your site still works on older/incompatible browsers. Currently it’s not supported on Safari or IE, which is a big limitation, but hopefully it will get adopted in the future.

Offline Service Worker example

To try out the offline functionality I wanted to use it in a dynamic web application rather than static content, so I went for a simple CRUD Person web site using the backup cache pattern (see here for Jake Archibald’s excellent offline cookbook).

Offline Service Worker cache fallback

The source is available here.

The site uses it to enhance the application with limited offline functions when available, caching responses for the Person list and Person entries. If a request fails, due to connection problems or the site being down, the cached response is retrieved and modified to display a offline warning message and the time it came from. Edit/Delete/Add buttons are hidden, as they would not work without a connection.

screencapture-rocky-ocean-8845-herokuapp-com-persons-1441385559595 screencapture-rocky-ocean-8845-herokuapp-com-persons-1-1441385576949

This allows users to continue viewing records they already retrieved even if they lose connection, something that can be very important in business applications.

It’s a simple example; you can do a lot more with Service Workers than just backup offline caching. You can proactively request resources and store locally to dramatically improve page load time, use in combination with a single page application framework (AngularJS, Ember etc.) to cache API call results to display while waiting for a response. Future functionality for Service Workers will include push notifications, background sync, and geo-fencing.


I believe with the new functionality available offline support in web sites is going to become the norm, with users expecting decent sites to handle connection problems gracefully with degraded functionality rather than total failure.

This is going to require new techniques for creating web sites, good knowledge of JavaScript and use of new frameworks to help. Already the Polymer web framework has a module to support offline. As more browsers and devices support the standards, the number of frameworks and tools to help develop offline functionality will increase.

This is an exciting time to be doing web development, lots of new possibilities for web applications are coming, changing what’s possible and making the web user experience better than before.