The symmetrical architecture trap

Often when thinking about topics to write about I hesitate, as in retrospect what I’m saying seems obvious. But it’s very common to fall into simple patterns when you are in the thick of a project, doing something which the flaws only become apparent later when early chaos is over and you can think clearly.

One of these traps is symmetrical application architecture, making two similar components in your solution use the same architecture, even when they have different requirements.

A common example of this is when a web application will have a public facing external site and a more secure internal site for administration. On the surface these two components have similarities, they both serve HTML and need to access/persist data, so you may initially use the same architecture for them.

Symmetric architecture diagram

However, you soon realise the external site needs to handle much more traffic than the internal, and it’s data requirements are different (higher read or write, only needing access to specific data). You can resolve this by scaling the architecture, but it’s clunky.

Symmetric architecture with external load diagram

Then you realise the internal site has more security and auditing requirements. You can resolve with implementation changes but it would be neater to include additional layers or services.

Symmetric architecture with internal audit security diagram

The symmetry of the architecture becomes a conceptual barrier to change, changing either one appears to be introducing more complexity but in reality the implementations are diverging anyway due to their different circumstances. Looking at them individually and at how they will be hosted on less abstract infrastructure diagrams can help. Could be your external and internal sites don’t need the same data store or layers, and changing them could save resources and simplify implementation.

Asymmetric architecture diagram

Embracing asymmetry in your architecture early can help you break out of this mindset and prevent you hitting problems later when your implementation work arounds start to creak.

Advertisements

Design – Minimalist principle of least privilege web application approach

principle-of-least-privilege-db-design-1

I used this design on a recent project and wanted to write up my thoughts.

This approach was taken as the project was a small scale web application with quick time scales. I’d previously worked on several projects which took a generic web-api-db pattern, even when there was no plan or ability to scale or separate the components out, so the implementation increased complexity for little gain. Also principle of least privilege had come up in some security reviews, database permissions not really being considered early on.

I wanted to see if I could cut out the API components, that added an additional layer of mainly boilerplate code, without resorting to a monolith design. This also reduced the complexity of the infrastructure and networking. Experience from looking at database permissions made me aware that users/roles/schemas permissions can be set very fine grained, providing assurances that connections can be locked down to specific tables/operations (e.g. SELECT/UPDATE only, no DELETE)

You could take this  further and go for full microservice split, with internal/external each having a separate worker and communicating via limited exposed API endpoints, but for this project that wasn’t really necessary and I was sick of designs dictated by patterns rather than needs.

Scenario

You have two applications:

  • external-web
    • Public site used by unauthenticated users and exposed to the internet
    • Allows users to submit application data to be processed, with a limited view of previously submitted application data
    • High risk, don’t want users to potentially view other users application data or change details
    • Higher usage than internal (public facing, unpredictable traffic)
  • internal-site
    • Internal site used by authenticated users and IP restricted
    • Allows users to process applications
    • Lower risk, but still don’t want users to be able to perform actions like deleting records or submitting applications
    • Low number of active users (small team)

Proposed solution:

  • Split the data stores, so external-web and internal-web have their own stores, with external-web only holding data as long as necessary
  • Use permissions to prevent each application from doing anything other than the minimum they need on their stores (principle of least privilege)
  • Use a worker application, not exposed or directly connected to either application, to move data between the two
  • Use either a special API or function to allow external-web to query historic data with limited access, so it cannot query the entire store

Thoughts on outcome

Pros

  • Simple and low number of components (moving parts that could go wrong)
  • Low infrastructure requirements
  • Still able to scale internal/external independently
  • Less code and complexity
  • Public facing site only has access to data in transit, not large amounts of long term data

Cons

  • Public facing application has access to database (even if limited to select/updates)
  • Unable to scale external/internal API independently from sites
  • Worker unable to scale independently of external/internal
  • Lose a lot of relational integrity from copying between stores if using relational stores

Links

Building and tagging container images in CI

docker-ci-tagging-anon

Been thinking a lot recently about how to manage versioning and deployment using Docker for a small scale containerised solution. It’s different from a traditional release pipeline as the build artifacts are the container images with the latest code and configuration, instead of the CI having a zip of the built application.

In a completely ideal containerised microservice solution all containers are loosely coupled and can be tested and built independently. Their CI configuration can be kept independent as well, with the CI and testing setup for the entire orchestrated solution taking the latest safe versions of the containers and performing integration/smoke tests against test/staging environments.

If your solution is smaller scale and the containers linked together, this is my proposed setup.

Build

Images should be built consistently, so dependencies should be resolved and fixed at point of build. This is done for node with npm shrinkwrap which generates a file fixing the npm install to specific dependency versions. This should be done as part of development each time package.json is updated, to ensure all developers as well as images use the exact same versions of packages.

On each commit to develop the image is built and tagged twice, once with “develop” to tag it as the latest version for develop branch code, and then with the version number in the git repo VERSION.md (“1.0.1”). You cannot currently build with multiple tags, but building images with same content/instructions does not duplicate image storage due to Docker image layers.

Tagging

The “develop” tagged image is used as the latest current version of the image to be deployed as part of automated builds to the Development environment, in the develop branch docker-compose.yml all referenced images will use that tag.

The version number tagged image, “1.0.1”, is used as a historic fixed version for traceability, so for specific releases the tagged master docker-compose.yml will reference specific versioned images. This means we have a store of built versioned images which can be deployed on demand to re-create and trace issues.

On each significant release, the latest version image will be pushed to the image repository with the tag “latest” (corresponding to the code in the master branch).

Managing data store changes in containers

docker-container-ci-data-migrations

When creating microservices it’s common to keep their persistence stores loosely coupled so that changes to one service do not affect others. Each service should manage it’s own concerns, be the owner of retrieving/updating it’s data and define how and where it gets it from.

When using a relational database for a store there is an additional problem that each release may require schema/data updates, a known problem of database migration (schema migration/database change management).

There a large number of tools for doing this; flyway, liquibaseDbUp. They allow you to define the schema/data for your service as a series of ordered migration scripts, which can be applied to your database regardless of it’s state as a fresh DB or existing one with production data.

When your container service needs a relational database with a specific schema and you are performing continuous delivery you will need to handle this problem. Traditionally this is handled separately from the service by CI, where a Jenkins/Teamcity task runs the database migration tool before the task to deploy the updated release code for the service. You will have similar problems with containers that require config changes to non-relational stores (redis/mongo etc.).

This is still possible in a containerised deployment, but has disadvantages. Your CI will need knowledge/connection to each containers data store, and run the task for each container with a store. As the number of containers increase this will add more and more complexity into your CI which will need to be aware of all their needs and release changes.

To prevent this from happening the responsibility of updating their persistence store should be on developer for the container itself, as part of the containers definition, code and orchestration details. This allows the developers to define what their persistence store is and how it should be updated each release, leaving CI only responsible for deploying the latest version of the containers.

node_modules/.bin/knex migrate:latest --env development

As an example of this I created a simple People API node application and container, which has a dependency on a mysql database with people data. Using Knex for database migration, the source defines the scripts necessary to setup the database or upgrade it to the latest version. The Dockerfile startup command waits for the database to be available then runs the migration before starting the Node application. The containers necessary for the solution and the dependency on mysql are defined and configured in the docker-compose.yml.

docker-compose up

For a web application example I created People Web node application, that wraps the API and displays the results as HTML. It has a docker-compose.yml that spins up containers for mysql, node-people-api (using the latest image pushed to Docker Hub) and itself. As node-people-api manages it’s own store inside the container node-people-web doesn’t need any knowledge of the migration scripts to setup the mysql database.

Links

Session data is evil

I’ve been working in ASP.MVC recently after working in Java for a long time. One of the things that struck me was the common use of session data in web application.

Now I know that people can and do use abuse sessions in Java, but the default routing and ease of access make using it more tempting in ASP.MVC. The standard routing convention of “/Controller/Action/:id” means you need to explicitly code to use RESTful paths that give you multiple IDs in URLs like “order/2/item/3” for non-trivial scenarios, and out of the box convenience methods like “TempData” seem to offer magical persistence between requests. These incentives combine to make using session data the path of least resistance in ASP.MVC.

1487ta

Any data stored in session is inherently unreliable and use of it makes load balancing and scaling your application much more difficult. Once you use it, each instance of your web application must be able to find the users session data to reliably handle requests. Since it’s now extremely common to use multiple instances even for small applications (irresponsible not to for disaster recovery and redundancy) you will need to think about this before you deploy into production.

It also adds hidden complexity to testing your application. Each endpoint which relies on state stored in session needs to be tested with that application state simulated. This means you have at least two places in your code defining and using the same semi-structured data, which makes your tests complex/fragile and your code harder to maintain.

Stelhi_Silk_Mill_Lanco_broken_windows

Once you make using session data part of your architecture it’s very hard to refactor and remove it. That little innocent use of TempData to store details from the last request will spread as Developers think “If it was ok there then it’s ok here…” and “one more can’t hurt” (the broken windows theory). Now your user flows in the web application rely on session stored details to go from screen A to B to C, and refactoring them means re-writing and testing a lot of the view/controller logic to replace the data held in session.

There are acceptable uses for session data in web application, authentication is the obivous one. What they have in common is having alternative flows to cope if session data is not found without breaking functionality.

If you have an over reliance of session in your application you are making a flakey, hard to scale and maintain application that will at best limp into production. At worst it will fall over and take your users data with it.

There are common patterns and methods to avoid needing session data, below are some links to help:

Design for Devs – Change sequence diagrams

I’ve been asked a few times by junior developers how to get started in designing code, as if it’s some sort of art technique. In truth every developer is doing design, no one spontaneously writes complex formal logic. Most just do it in their head based on experience and patterns they’ve used before. For small well understood problems this is fine, but when you are dealing with larger or more complex changes doing a little design work up-front can really help clarify things and ultimately save time while giving a better solution.

I’m writing this to document a simple type of design I’ve used on many projects, a change sequence diagram, one you can do quickly on paper or on a whiteboard in ten minutes and I’ve found to be very helpful in thinking about what changes are required, the size of the overall change and promoting re-use of existing code.

Here’s an example:

change-sequence-diagram
It’s a pretty simple variation of a sequence diagram, where you show the sequence of events which should occur as a series of interactions between the involved people/components. It normally starts with person, like someone clicking a link on web page, then shows how the system responds. The change part is about highlighting what components in each part of the system need to change to handle the functionality, what parts need to be added/updated/removed.

Doing this forces you to think up-front about what you will need to change and how the system will interact to get the end result. It ties the individual component changes to the overall user requirement, e.g. you’re not just adding a new database column and view field, you’re adding them so the user can see and update their middle name on the personal details screen. This helps you understand how the parts of your system interact and use consistent implementations and design patterns in your changes, plus identify the unit tests and test scenarios.

When you are done, the number and type of changes shows the scale of the overall change, useful for estimates, and breaks it down into manageable chunks of work. You’ll get the best results if you do it paired with someone or get someone else to review your design. Doing this checks that you aren’t breaking existing patterns in the code or missing something that could increase or decrease the complexity. You can expand it to include alternate flows and consider NFR’s for security and performance.

Next time you’re looking at a new requirement or user story give this a try, you’ll be surprised how easy it is to do and what you’ll get out of it.

Hadoop summit Dublin 2016

IMAG0068

Just back from Hadoop summit in Dublin, thought I would give a write up of the talks I went to and my impressions. All of the videos have been put up so it could help you decide what to spend time on.

https://www.youtube.com/channel/UCAPa-K_rhylDZAUHVxqqsRA

Overall it was good, a nice spread of speakers covering highly technical topics, new products and business approaches. The key notes were good, some promotional talks by sponsors but balanced with some very good speakers covering interesting topics.

One impression I got was no one was trying to answer the question why people should try to use big data anymore, that has been accepted and now the topics have moved onto how to best use it. A lot of talks about security, governance and how to efficiently roll out big data analytics across organisations. Loads of new products to manage analytics workflows and simplify access for users for multiple resources. Organisational approaches, analytics as part of business strategy rather than cool new tools for individual projects.

One nitpick is they tried to push using a conference mobile app which required crazy permissions. No. I just wanted a schedule. A mobile first web site would have done the job and been more appropriate for the data conscious audience.

Talks

Enterprise data lake – Metadata and security – Hortonworks

Mentions of ‘Data lakes’ were all over the conference, as were security concerns about how to manage and govern data access when you start to roll out access across your organisation. This talk covered Hortonworks projects which are attempting to address these concerns, Apache Atlas and Ranger.

Altas is all about tagging data in your resources to allow you to classify them, e.g. put a ‘personal’ tag on columns in your Hive table which identify people, or an ‘expiry(1/1/2015)’ on tax data from 2012. Ranger is a security policy manager which uses the Altas tags and has plugins that you add to your resources to control access, e.g. only Finance users can access tax data and to enforce expiries.

You create policies to restrict and control who can do what to your resource data based on this metadata. This is an approach which scales and follows your data as it is used, rather than attempting to control access at each individual resource as data is ingested, which gets unmanageable as your data/resources grows, providing a single place to manage your policies. It also provides audits of access. Later talks also suggested using automation to detect and tag data based on content to avoid having to manually find it, such as identifiable or sensitive data.

http://ranger.apache.org/
http://atlas.incubator.apache.org/

Querying the IoT with streaming SQL

This talk was a bait and switch, not really about IoT but it still was interesting. It was really about streaming SQL, which the presenter thinks will become a popular way to query streaming data across multiple tools. I do agree with the idea, SQL is such a common querying language and most users would prefer not to learn tool specific query languages all the time.

The push for using streaming data is that your data is worth most when it is new and new data plus old is worth more. This means you should be trying to process your data as you get it, producing insights as quickly as possible. Streaming makes this possible.

He went into a lot of technical detail and examples of how you would use it as a super-set of SQL. Mentioned using Apache Calcite as a query optimiser to run these queries across your multiple data sources.

https://calcite.apache.org/

Advanced execution visualisation of Spark job – Hungarian Academy of Sciences

Talk from researchers who worked with local telcos to try and analyse mobile data. They won a Spark community award for their work in creating visualisations of Spark jobs to help find anomalies in data that cause jobs to finish slower.

They named this the ‘Bieber’ effect, based on the spike in tweets caused by Justin Bieber (and other celebrities). This spike can hurt job executiion if you are using default random bucket allocation in Spark based on hashes of keys, as suddenly a load of work needs to be aggregated across multiple nodes where it would be more efficient to partition it into specific ones closer to the data. The real example they found was when cell tower usage spiking due to a local Football match.

They’ve created the tools to view these spikes and test in advance using samples, and aim to create ways to dynamically allocate and partition tasks based on these spikes and improve the efficiency of their jobs.

Award blog

Real world NoSQL schema design – MapR

Talk about how you can take advantage of the flexible schema in NoSQL DB’s to improve data read times, doing things like putting your data into the column names and keys.

Very useful for people doing analysis on large amounts of simple data or large de-normalised datasets.

TensorFlow – Large scale deep learning for intelligent computer systems – Google

Could you recognise a British Shorthair cat?

If not, you know less about cats than a server rack at Google. Expect that list of things to grow.

Good talk on how they are using machine learning at Google, using classified tagged image data to train models that can recognise objects in other images, including things like are the people in the picture smiling.

Talked about TensorFlow, their open source deep learning project. If you are interested in machine learning I’d take a look at the videos they have up.

https://www.tensorflow.org/

Migrating Hundreds of Pipelines in Docker containers – Spotify

I like containers, so was looking forward to this.

Good talk, covered Spotify’s use of Big Data over the last 6 years, going from AWS hosted Hadoop as a service, running their own cluster and their current move to Google Compute using Docker with their own container orchestration solution, Helios.

They are now working a service, Styx (don’t search Spotify Styx, you’ll just get heavy metal), which will allow them to do “Execution as a service”. This is a very exciting idea, allowing users to define jobs to run along with the docker images to execute it. This is a great way to manage dependencies and resources for complex big data tasks, making it easier to do self-service for users with governance.

https://github.com/spotify/helios

Hadoop helps deliver high quality, low cost healthcare services – Healtrix

After a load of talks mainly about ROI and getting revenue it was nice to hear a talk about trying to give something back and improve quality of life. The speaker grew up in a poor Indian village and had experience of poor access to healthcare.

His talk was about providing at risk people with healthcare sensors (for blood sugar/pressure etc.) that connect to common mobile devices and send sensor data to backend servers that analyse the data. This can be used as part of predictive and preventative care to reduce the cost of unplanned hospital visits. Using this, healthcare providers can monitor patients with Diabetes or heart conditions, vary their drug prescriptions or advise appointments without waiting for the standard time between appointments.

This is especially important in areas with poor health coverage and bad transport links, as the data can move a lot easier than the patient can see a doctor.

 

Apache Hadoop YARN and the Docker container runtime – Hortonworks

Nice to know about this but the talk was pretty dry unless you are really interested in YARN. YARN supports running in Docker containers and there are now some resources provided by new docker and YARN releases for how to manage security and networking.

It did show that it’s possible to run YARN in YARN (YARNception), which apparently has real world uses for testing new versions of YARN without updating your existing version of YARN. YARN.

 

Organising the data lake – Information governance in a Big Data world – Mike Ferguson

More coverage of governance and security when using big data in your organisation, mainly from a business view rather than technical. If you are interested in how to roll out access of data across your organisation while centralising control and governance you should watch this talk.

 

Using natural language processing on non-textual data with MLLib – Hortonworks

Good talk, mainly about using Word2Vec (a google algorithm for finding relationships between words) to analyse medical records and find links between diagnosis codes (US data). This can be used to find previously undocumented links between conditions to aid diagnosis or even predict undiagnosed conditions (note, not a doctor).

The approach could be used in many other contexts and seems very straightforward to apply.

https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/presentation/NLP_on_non_textual_data.pdf

How do you decide where you customer was – Turkcell

Slightly creepy talk (from implications, the speaker was very nice and genuine) about how Turkcell, a telco, is using mobile cell tower data to analyse their customers movements, currently to predict demand for growth and rollout LTE upgrades. But they are also using it to get extremely valuable data about movement and locations of customer demographics which they can provide to businesses like shopping centers.

From a technical point of view it was interesting and gave a good perspective on the challenges of processing very high volumes of data in the real world.

Made me think, is my mobile company doing this? Then I realise of course, they would be stupid not to.

 

Using sequence statistics to fight advanced persistent threats – Ted Dunning – MapR

Great talk by an excellent speaker, highly recommend watching the video. Real world examples of large hacking attacks on businesses and insight into how large companies manage those threats.

Was about using very simple counting techniques with mass volumes of data and variables, comparing how often certain conditions (such as header values/orders, request timings etc.) occur together. Using these you can identify patterns of how normal requests look and detect anomalous patterns used by attackers. This approach is simple, works “well enough” to detect attackers who cannot know in advance the internals of your servers to mask themselves.

If you have an interest in security take a look at the talk.

Ted Dunning is a rock star, notice the abnormal co-occurrence of female audience members in the front row.

https://en.wikipedia.org/wiki/Likelihood-ratio_test

Keynotes

Videos for the keynotes aren’t published but I thought I should recognise some of the really interesting ones.

Data is beautiful – David McCandless

Standout talk from David McCandless with a focus on how to use visualisations to show complex relationships and share insights, data visualisation as a language. Lots of great visual examples, showing scale of costs of recent high profile events, timeline of ‘fear’ stories in the press and most common breakup times from scaping facebook.

Very interesting talk and I’m going to look up more about his work. Check out his visualisation twitter account for examples:

https://twitter.com/infobeautiful

Technoethics – Emer Coleman – Irish gov

Talk about how big data is shaping society and how we should be considering the ethics of the software and services we are creating, both in private companies and government. Mentioned the UK’s snoopers charter, Facebook’s experiment in manipulating users emotions and Google’s ability to skew election results.

I do think we should consider the ethical impact of our work more in software development, trying to do more good and less ignoring the negative effects.

http://siliconangle.com/blog/2016/04/14/rise-of-the-robots-finding-a-place-for-people-in-the-business-models-of-the-future-hs16dublin/