I love the ETSY notion of measure anything measure everything

I love the ETSY notion of measure anything measure everything
4/10/2015 1:06:30 PM

Since my move back to consulting I have been involved with a bunch of different big and small companies each with several big and small projects.  The one thing I almost always see missing is the ability to measure the success of a project from any of the numerous ways to gauge success.  I have been a fan of the notion of “Measure anything, measure everything” since reading it in February of 2011. 

This article basically suggests that you should have the ability to measure anything in your “system”.  And by system I don’t just mean tracking aspects of your code.  There are a great deal of company's that seem to think that you simply write some code, deploy the world, and cross your fingers.  I have worked at many of these style of company’s and it is no secret that they spend a lot of time “reacting” to customer complains rather than “pro-actively” identifying and fixing issues iteratively…and hopefully not in production.

There are a few company’s I have worked with that have taken this one step further.  They attempt to remove the random behavior of their applications by creating performance baselines.  This concept is essentially running a load test in some form or another to see what works, what doesn’t, response times, effect on underlying system, etc.  With all this timing data in hand you have a baseline.  This means that you can now safely add features to your application, rerun the performance tests, and identify differences between your baseline and the new performance measurements.  This is considerably better!

But it could be sooo much better!  I am now working with a few customers that have sizeable applications both in terms of complexity and load.  In both cases the systems are distributed, multi-run time systems.  At the easiest they have web applications, API applications, a messaging bus, various data stores, many third party dependencies, various tablet clients, etc.  There are a lot of moving pieces.  A simple load test won’t give us a complete picture with regards to subtle changes in the system.  Also, with this many things to monitor, testing once before deploying won’t cut it.  You really need to measure the health of the system, in many different ways, continuously.

When you ask a developer how they get information about their system the default answer is usually a logging strategy.  Generally this is something like log4net writing to a file or database.  But when you have 100 nodes all doing different things it is generally difficult to figure out what you should be looking at by digging through a log dump.  I was in a brown bag at Clear Measure (we do those weekly) with one of my favorite logging vendors LogEntries.  My friend Trevor Parsons was telling us about all the new features that their centralized logging solution provides.  I was quite impressed at the rate they are adding new awesome features.

LogEntries

LogEntries is essentially (at least in my world) an appender for log4net.  This means writing logging code is the same as it always has been.  Only now we redirect that log data to a central hosted location where we can easily mine the data.  Most people might jump and say that sounds like SumoLogic or LogStash or any number of other logging aggregators.  On the surface this is indeed true.  But what I like about LogEntries is that you can start to tweak your log data to be a part of your APM (application performance monitoring) story.  With LogEntries you can write tags for various patterns in your log data.  In LogEntries the log data is parsed and processed as it comes in in real time.  Not as some post process.  Meaning when data comes in that says the order taking service has stopped processing (or some other contrived really important business feature in your system) you can see it immediately.

Of course tagging is only worth while if you are looking at the log stream.  We don’t usually do that.  So how do we take advantage of this real time processing and tagging?  We can make reports around the log data.  Huge dashboards full of information that can also be populated based on log processing.  In the case something important happens you can visualize the occurrence on their dashboards.  Again only valuable if you have an information radiators in front of you…you have that right?  We use GeckoBoard for this if you were wondering.  So how do we take this to a passive style of monitoring?  LogEntries also has the ability to broadcast events.  This used to be just via email which was good enough.  But now they have all sorts of other integrations such as Slack and PagerDuty which makes this story even better.  These two tools have become very valuable to me.

image

NewRelic / APM

This is just logging data.  There is soo much more to the monitoring story.  Since I mentioned APM above, you should also be looking at NewRelic!  Any good APM tool will allow you to visualize your system from the client (javascript, ios, android) all the way back to the most internal of systems (think databases).  These sorts of tools will tell you all about how long a process took.  It should allow you to visualize where a bottleneck is in your system.  And you should be able to drill down into the visual map to determine with great granularity exactly where to look for problems.  If you have slow queries your APM tool should be able to pin point that you have a stored procedure call that isn’t performing and when. 

With this sort of data in hand you can head back to LogEntries to ferret out exactly what is going on.  But can you answer why this happened?  It was working an hour ago.  Did we do something to cause this new flair up?  Hmm.  Enter StatsD and the notion of capturing metrics for Everything you do.  Tracking business metrics in the same way that you track application code, sub systems, and processes is very important.  The biggest couple of events that have historically bit me in the butt are easy stories that most people can relate too.  Everybody has had a working system.  And shortly after launching new code a feature started to act up in unexplainable ways.  Yes – you launched 3 new features.  …and 10 new bugs.  One of those bugs may have caused a database process to go wonky.  Or you may have killed your ability to accept orders.  Hopefully have you unit tests, integration tests, and functional tests to ensure that this type of error doesn’t find its way into production…but. 

Statsd

How do you track when you last did a deployment?  Most people go talk to the guy with the deployment button at their finger tips.  However, if you work somewhere that doesn’t have automated deployments this “button” may be a manual process.  This means that while the deployment guy may normally do 10 steps, today he did 9.  Missing that 10th step is may have caused the issue.  But not having metrics around when you deployed, and specifically what was deployed means it will take longer to figure out the issue. 

StatsD is a metrics tools.  It only does a few very simple things.  It can say “this metric occurred”.  It can say “this metric took this long to complete”.  Or “this value is now X”.  It does this in a very passive manner over UDP which means if something isn’t listening on the other side your code doesn’t fail.  It also means that there is near no performance impact to gather this data from any part of your system or processes.  And you can “sample” the data in the case that you just need to know rough numbers.  Meaning every 10th time something occurs track it for example.  In C# you are just a nuget package away from starting to use StatsD.

With StatsD you can wire metrics into your deployment process with a great deal of granularity.  Enough to see that “the deployment happened at this time on this day”.  And you can track that “step 1 happened, step 2 happened, step 4 happened, step 5 happened”.  Hey wait…where is step 3?  Like LogEntries, StatsD is only half of the equation.  Without the ability to get to and visualize the data, report on the data, or get alerts from the data, why capture it?

Graphite

Graphite is one of many reporting packages that works with StatsD repositories of data.  I have customers that don’t necessarily want to stand up these sorts of components in their environments.  For that reason I tend to stick to hosted solutions when ever possible.  They are usually easier to get up and running quickly.  And tend to not require buy in from all the parties required to keep the tool running.  For this I have been tinkering on Hosted Graphite a lot lately.  It basically allows you to put your metrics and counters on a graph over time.

image

Now you are ready

With all of this data in hand you are ready to answer almost any question about your system.  And if you are doing any form of continuous delivery or even continuous deployment when you find that someone has not measured enough data you can easily wire in new metrics, deploy and start capturing more information about your system.  I can’t wait to tell you how we are using these principles to monitor the health of our projects, our engineering teams morale, and our clients happiness.

comments powered by Disqus