Just back from Hadoop summit in Dublin, thought I would give a write up of the talks I went to and my impressions. All of the videos have been put up so it could help you decide what to spend time on.
Overall it was good, a nice spread of speakers covering highly technical topics, new products and business approaches. The key notes were good, some promotional talks by sponsors but balanced with some very good speakers covering interesting topics.
One impression I got was no one was trying to answer the question why people should try to use big data anymore, that has been accepted and now the topics have moved onto how to best use it. A lot of talks about security, governance and how to efficiently roll out big data analytics across organisations. Loads of new products to manage analytics workflows and simplify access for users for multiple resources. Organisational approaches, analytics as part of business strategy rather than cool new tools for individual projects.
One nitpick is they tried to push using a conference mobile app which required crazy permissions. No. I just wanted a schedule. A mobile first web site would have done the job and been more appropriate for the data conscious audience.
Enterprise data lake – Metadata and security – Hortonworks
Mentions of ‘Data lakes’ were all over the conference, as were security concerns about how to manage and govern data access when you start to roll out access across your organisation. This talk covered Hortonworks projects which are attempting to address these concerns, Apache Atlas and Ranger.
Altas is all about tagging data in your resources to allow you to classify them, e.g. put a ‘personal’ tag on columns in your Hive table which identify people, or an ‘expiry(1/1/2015)’ on tax data from 2012. Ranger is a security policy manager which uses the Altas tags and has plugins that you add to your resources to control access, e.g. only Finance users can access tax data and to enforce expiries.
You create policies to restrict and control who can do what to your resource data based on this metadata. This is an approach which scales and follows your data as it is used, rather than attempting to control access at each individual resource as data is ingested, which gets unmanageable as your data/resources grows, providing a single place to manage your policies. It also provides audits of access. Later talks also suggested using automation to detect and tag data based on content to avoid having to manually find it, such as identifiable or sensitive data.
Querying the IoT with streaming SQL
This talk was a bait and switch, not really about IoT but it still was interesting. It was really about streaming SQL, which the presenter thinks will become a popular way to query streaming data across multiple tools. I do agree with the idea, SQL is such a common querying language and most users would prefer not to learn tool specific query languages all the time.
The push for using streaming data is that your data is worth most when it is new and new data plus old is worth more. This means you should be trying to process your data as you get it, producing insights as quickly as possible. Streaming makes this possible.
He went into a lot of technical detail and examples of how you would use it as a super-set of SQL. Mentioned using Apache Calcite as a query optimiser to run these queries across your multiple data sources.
Advanced execution visualisation of Spark job – Hungarian Academy of Sciences
Talk from researchers who worked with local telcos to try and analyse mobile data. They won a Spark community award for their work in creating visualisations of Spark jobs to help find anomalies in data that cause jobs to finish slower.
They named this the ‘Bieber’ effect, based on the spike in tweets caused by Justin Bieber (and other celebrities). This spike can hurt job executiion if you are using default random bucket allocation in Spark based on hashes of keys, as suddenly a load of work needs to be aggregated across multiple nodes where it would be more efficient to partition it into specific ones closer to the data. The real example they found was when cell tower usage spiking due to a local Football match.
They’ve created the tools to view these spikes and test in advance using samples, and aim to create ways to dynamically allocate and partition tasks based on these spikes and improve the efficiency of their jobs.
Real world NoSQL schema design – MapR
Talk about how you can take advantage of the flexible schema in NoSQL DB’s to improve data read times, doing things like putting your data into the column names and keys.
Very useful for people doing analysis on large amounts of simple data or large de-normalised datasets.
TensorFlow – Large scale deep learning for intelligent computer systems – Google
Could you recognise a British Shorthair cat?
If not, you know less about cats than a server rack at Google. Expect that list of things to grow.
Good talk on how they are using machine learning at Google, using classified tagged image data to train models that can recognise objects in other images, including things like are the people in the picture smiling.
Talked about TensorFlow, their open source deep learning project. If you are interested in machine learning I’d take a look at the videos they have up.
Migrating Hundreds of Pipelines in Docker containers – Spotify
I like containers, so was looking forward to this.
Good talk, covered Spotify’s use of Big Data over the last 6 years, going from AWS hosted Hadoop as a service, running their own cluster and their current move to Google Compute using Docker with their own container orchestration solution, Helios.
They are now working a service, Styx (don’t search Spotify Styx, you’ll just get heavy metal), which will allow them to do “Execution as a service”. This is a very exciting idea, allowing users to define jobs to run along with the docker images to execute it. This is a great way to manage dependencies and resources for complex big data tasks, making it easier to do self-service for users with governance.
Hadoop helps deliver high quality, low cost healthcare services – Healtrix
After a load of talks mainly about ROI and getting revenue it was nice to hear a talk about trying to give something back and improve quality of life. The speaker grew up in a poor Indian village and had experience of poor access to healthcare.
His talk was about providing at risk people with healthcare sensors (for blood sugar/pressure etc.) that connect to common mobile devices and send sensor data to backend servers that analyse the data. This can be used as part of predictive and preventative care to reduce the cost of unplanned hospital visits. Using this, healthcare providers can monitor patients with Diabetes or heart conditions, vary their drug prescriptions or advise appointments without waiting for the standard time between appointments.
This is especially important in areas with poor health coverage and bad transport links, as the data can move a lot easier than the patient can see a doctor.
Apache Hadoop YARN and the Docker container runtime – Hortonworks
Nice to know about this but the talk was pretty dry unless you are really interested in YARN. YARN supports running in Docker containers and there are now some resources provided by new docker and YARN releases for how to manage security and networking.
It did show that it’s possible to run YARN in YARN (YARNception), which apparently has real world uses for testing new versions of YARN without updating your existing version of YARN. YARN.
Organising the data lake – Information governance in a Big Data world – Mike Ferguson
More coverage of governance and security when using big data in your organisation, mainly from a business view rather than technical. If you are interested in how to roll out access of data across your organisation while centralising control and governance you should watch this talk.
Using natural language processing on non-textual data with MLLib – Hortonworks
Good talk, mainly about using Word2Vec (a google algorithm for finding relationships between words) to analyse medical records and find links between diagnosis codes (US data). This can be used to find previously undocumented links between conditions to aid diagnosis or even predict undiagnosed conditions (note, not a doctor).
The approach could be used in many other contexts and seems very straightforward to apply.
How do you decide where you customer was – Turkcell
Slightly creepy talk (from implications, the speaker was very nice and genuine) about how Turkcell, a telco, is using mobile cell tower data to analyse their customers movements, currently to predict demand for growth and rollout LTE upgrades. But they are also using it to get extremely valuable data about movement and locations of customer demographics which they can provide to businesses like shopping centers.
From a technical point of view it was interesting and gave a good perspective on the challenges of processing very high volumes of data in the real world.
Made me think, is my mobile company doing this? Then I realise of course, they would be stupid not to.
Using sequence statistics to fight advanced persistent threats – Ted Dunning – MapR
Great talk by an excellent speaker, highly recommend watching the video. Real world examples of large hacking attacks on businesses and insight into how large companies manage those threats.
Was about using very simple counting techniques with mass volumes of data and variables, comparing how often certain conditions (such as header values/orders, request timings etc.) occur together. Using these you can identify patterns of how normal requests look and detect anomalous patterns used by attackers. This approach is simple, works “well enough” to detect attackers who cannot know in advance the internals of your servers to mask themselves.
If you have an interest in security take a look at the talk.
Ted Dunning is a rock star, notice the abnormal co-occurrence of female audience members in the front row.
Videos for the keynotes aren’t published but I thought I should recognise some of the really interesting ones.
Data is beautiful – David McCandless
Standout talk from David McCandless with a focus on how to use visualisations to show complex relationships and share insights, data visualisation as a language. Lots of great visual examples, showing scale of costs of recent high profile events, timeline of ‘fear’ stories in the press and most common breakup times from scaping facebook.
Very interesting talk and I’m going to look up more about his work. Check out his visualisation twitter account for examples:
Technoethics – Emer Coleman – Irish gov
Talk about how big data is shaping society and how we should be considering the ethics of the software and services we are creating, both in private companies and government. Mentioned the UK’s snoopers charter, Facebook’s experiment in manipulating users emotions and Google’s ability to skew election results.
I do think we should consider the ethical impact of our work more in software development, trying to do more good and less ignoring the negative effects.