Metrics Metrics Everywhere Coda Hale

this is metrics metrics everywhere and today what I'm going to do is I'm going to talk to you about how to make better decisions by using numbers my name is Kota you've probably seen me on Twitter yelling about things that I don't like maybe I'm github working on things that I do like on the infrastructure architect at Yammer comm which is the enterprise social network and a big part of what I do at Yammer is I write code that's not actually my job that's not actually why I take a paycheck the reason why I take a paycheck is because I write code that then generates business value right and immediately this should raise the question in all of your minds what the hell is business value I thought this is going to be about numbers and Here I am standing in front of you spouting b-school so it turns out that we as software engineers are actually intimately acquainted with business values because we're so involved in the creation of things because we're so involved in in construction and creating new things we know what business value is and we see it every day as we work business value means shipping a new feature that the users absolutely adore it means going back to an old feature that never actually really worked and really making a pop this time it means writing a writing code with fewer bugs which means fewer support tickets which means less time spent in JIRA or probably here pivotal tracker it means not pissing our users off with a slow site or an ugly site or even a pretty site right because it really depends on your users it means refactoring your code and having well factored code to make future changes easier it means adding a unit test before you fix that bug so you'll never actually have to fix it again all of these little moments of when we experience a software engineers that's business value right because business value is anything which makes people more likely to give us money and because we like money we want to generate more business value now software engineers what this means is we need to make better decisions about our code and we can't do this unless we realize we couple really important things the first of which is that our code generates business value when it runs not when we write it people don't give us money to take the code and put it up on the wall like a trophy people don't give up give us money to take the code and read it late at night in bed they take the code they run the code they give us money for that because of the functionality our code provides when it runs which means that in order to make better decisions about our code we need to know what our code does when it runs and we can't do this unless we actually measure that that you should be asking why I measure that why can't we I know what my code does I wrote it the reason why we need to measure it is because of kind of a basic fundamental truth about reality and that's the map is not the territory this is the famous dictum of the Polish philosopher Alfred Korzybski he said a bunch of really crazy this one he's actually right about general semantics I'm not necessarily on board with this however is true in the same way that a map of San Francisco is not actually the city of San Francisco is anyone who's tried to tell someone how to drive down Market Street knows the way that we talk about something isn't actually the way that that thing is even though the thing that we think of is not actually the thing in and of itself it's trite but our perception is not actually reality and because both our understanding of what's going on and what's actually going on both the map and the territory are in a constant state of flux there's always a gap between what we know and what actually is and we need to mind the gap for software engineers this matters a lot because we have a mental model of what our code does this is what we use when we write our code and we work with our code when we track down that bug when we predict the effects that a particular change will make but it's a mental model it's not actually the code itself and a mental model is often wrong and very much like someone attempting to drive down Market Street when our mental model is wrong we experience this as confusion right okay we say this code can't possibly work rather it works you gotta mind the gap we say this code can't possibly fail and then it proceeds to blow up gotta mind the gap combining the gap in our profession especially is hard there's no physical correlate for what we do right we don't have an off kilter chair to remind us of a bad carpenter we spent so much time inside of our heads it's very easy to mistake the inside of our heads for reality to mistake the map for the territory right for example which of these two code samples is faster items sort by name or item stuff sort a dot name compared veto thing and I know especially here that there's at least one person in the audience who's thinking well the first one is an example of a schwartzie and transform so despite the fact that the complexity of the underlying sort algorithm is the same the alignment of the first one is evaluated in times where as the land line when the second was evaluated in logins Honus which means that despite the fact that the first one uses more space it's actually faster in terms of runtime right so which is it which one's actually faster we don't know we don't know which one is faster because we're working with our mental model right and in this case the first one got monkey patch to have a 100 second sleep in it right so start by is probably not faster here right and unfortunately sort is no better it actually just blows up right so it's not just that we don't know in this particular case it's that we can't know until we measure when I software engineer is this this affects how we make decisions it affects how we spend our time it affects how we allocate the resources that are given to us by our employer or you know by the sweat off my brow or self-employed I mean for example huh right your boss your manager comes in she says their application is slow this page takes 500 milliseconds to run that's unacceptable fix it right until you crack open that bit of code in an editor and you look it over and you play the game that we all hate to play find the bottleneck and so you look at the code and you kind of think it over and you think it could be could be an SQL query right we may need to hint an index we may need to to redo this it could be a template rendering because we just dried up a lot of templates and victory factor we may have introduced a performance regression there or quick recession stories because just about this really cool blog post about memcache and slab sizes so which is it which of these you work on which of these do you spend your time on which of these do allocate resources towards we don't know that's what makes that such a frustrating game so let's play it again and this time with a little twist but I want you guys to pay attention to how you're thinking about this problem right so we're going to play I find the bottleneck to point out and I'm just gonna straight-up tell you that the SQL query took fifty three milliseconds the template rendering took one millisecond in session storage took three hundred and fifteen milliseconds right where's the bottom line it's right there right did you see what happened when we added that information when we added that extra layer of data on top of what we were doing all of that confusion that we had banished right and as a result we made a better decision we made a better decision because we improve our mental model by measuring what our code does Maps not the territory it will never be the territory but we can always have a more accurate map we can always have a map which is more faithful to the territory we can always improve our mental model that's critical because we use our mental model to decide what to do and naturally a better mental model makes us better at deciding what to do specifically it makes this better generating business value so the gist of this is that measuring makes your decisions better it removes that confusion removes that frustration but only for measuring the right thing and this is really important because if I told you that the S to a session rendering had taken three milliseconds instead of 300 when we worked on the SQL query right you would have wasted your time you would have wasted your resources so it's critical that we need to be measured like this condition it's critically important that we're measuring the right thing here we need to measure our pub where it actually matters we need to measure our code in the wild as it is actually generating this business value that we care about specifically we need to measure our code in the magical land called production not on your laptop not on a test cluster not on ec2 unless we're actually running on UC - we need to be continuously measuring our code in production any amer we do this with the library I originally very original altmetrics it's written in Java as a Scala facade integration points for a bunch of different things it's MIT license it's available on github and it gives us a toolkit of five different ways of measuring what our code is doing the production gauges count through meters histograms and time chips so each one of these is associated with class which is where what it measures lives and has a name because random numbers are really hard to figure out at 3:00 a.m. and so it's going to walk you guys through this and explain how you might use this in your application to improve your decisions just go ahead and walk what a walk into an example application so let's say we're building an autocomplete service for City teams right real simple web service you say a complete saffron it says San Francisco and so as we are gathering requirements as we are writing tests and code as we're putting together the staging environment as we're putting together for the production environment as we're maintaining this over time we need to be asking ourselves two questions one of which is what does this go do that affects its business value and then how can we measure but you can measure in one of these probably makes you can't I would like to talk to you so let's talk about what our theoretical autocomplete does that actually affects there's business value right and how we can measure that so let's start with gauges a gauge is an instantaneous value of something at a particular point in time right so what is a value of this autocomplete er that actually affects this business well how about the number of cities because if you're autocomplete er can only complete three cities then it's really only useful in like a post-apocalyptic scenario right one of them is Bartertown I'm not sure what the other two are in soon those movies but on the other kind of side spectrum if you can complete three billion cities then things are seriously interesting but probably broken so we can measure this with a gauge like this can we do to name cities and it has a callback associated with it we return the number of cities in our list or prefix for your order we have and with this gauge we can say the service has 589 cities registered so you gave the gauge except for counters these are pretty much exactly what you would expect their incrementing and decrementing value right and so with a you know for a web service one of the things that you might care about is the number of open connections you don't actually have those connections lined up in a big list you can count them and so we can great encounter we can give a name connection when someone connects we increment it when someone disconnects we decrement it if we have this counter we can say there are 584 active sessions on that server pretty simple next up are meters and this is where things actually really start to get fun so meter is the average rate of events over a period of time and because our auto computer is a web service the the kind of classic example of this is the number of requests per second that auto completer is doing at any given point of time you can measure that with a meter great one with the name requests and you give it a rate units remain requests per second and then when a request happens we mark the occurrence of that event with the meter usually when people talk about rates usually when people talk about things per second they're talking about the mean rate which is the number of events divided by the last time they took those initial current right so we say miles per hour miles driven divided by the number of hours it took drive that distance and so weekends we can take that and we can map that onto what hour auto-configure is doing right so here are the number of requests being processed at each particular instant which isn't actually thing but we'll assume it is each instant of time right and and you can see it goes up and down like a like a regular process and we're normally distributed with that there's two inflection points right here right here where the rate changes right it looks like it's being used more and so if we graph the mean right against this it looks like this you see there's an initial instability and then it literally progresses to mean and then at the first inflection point it kicks up slightly and then the second flexure point it takes up even less and at the end we're actually going to present it with this weird gap between what we think is happening and what actually is happening we know that we need to mind the gap so the mean rate is not actually the process and specifically when it comes to rates we care very much a lot of sense of recency if you care about what the process is doing right now not you know some kind of historical trivia and the mean rate provides us with the rate over the entire period of the surfaces lifetime which is not what we want it's not useful and so to actually come up with rate which makes a lot more sense which is which Maps much more closely to what we care about we turn to math specifically we turn to exponentially weighted moving average these have been used in industrial processes for decades it's been kicking around the guts of Unix for easily 40 years and so it looks a little something like this and you don't actually need to memorize that but the one important thing is that there's a consonant alpha which is the smoothing factor an exponentially weighted moving average that has a high smoothing factor is very insensitive to variance in the underlying data a low smoothing factor means that it's very sensitive to very variance in a lot of data and so if we go back to what our process is doing we can chart a exponentially weighted moving average over this this has a very low smoothing factor so it's banging up and down a whole lot you can see at the two inflection points that it actually changes to two when what the process is still in this changes and so we can increase the smoothing factor more and more and you can see that this red line here is actually like really a faithful metric of what our process is doing over time and so the meters in metrics calculate a 1-minute rate a five-minute rate and a 15-minute rate using literally the exact same formula that the unix load average does and so if we have a meter on our autocomplete service we can say we went from 3,000 requests a second to less than 500 requested system you can see how recency really does matter that's what you get with a meter next up for histograms and these are by far my favorite this is where things get really fun a histogram measures this is the statistical distribution of values in a stream of data right so what's the stream of data that this service is processing or generating that we care about the statistical distribution of its contents right well how about the number of cities returned if our autocomplete er returns most of these zero results mmm not very useful if it returns mostly 300 results again there's just not that many places named sand something so this matters very much as far as business value goes it's something that we that we potentially deeply care about so we measured that with a histogram we give the name with font sizes and then when we have a response we update the histogram with the number of cities in that particular response and the histogram measures the statistical distribution of values in that and it has the usual suspects like minimum maximum mean and standard deviation but the thing about these is that these are really good the means specifically the mean is a really good measure of centrality for normally distributed data for data which is not normally distributed which is to say the vast majority of data that we a software engineers see it is not useful at all right how many times have you done something and you've got the mean and the standard deviation and the standard deviation is several times larger than the mean right all that's telling you is that you've got two numbers which are useless so what we really want as kind of a general purpose tools we want quantiles we want the median we want the 75th percentile one 95th percentile 98 99 and then because I'm a bit of a perfectionist than 99.9 percent I'm but the way that people usually calculate these is the eco giant sorted list of all values and then to get the 75th percentile you go 75 percent of the way in and the value there is the 75th percentile right we can't possibly keep all of these values especially for high performance services like we building amber if you have a service which is doing a thousand requests a second and each one of those requests you're measuring a thousand different things and if you do that for a day you're going to end up with about 86 billion values and for 64 bits integers that's about 640 gigs of data right now there's clever things you can do with you know like Delta compression and variable length encoding but it's just it's not going to happen we need a better tool and for that we've naturally turn into math we use a technique called reservoir sampling which is that we keep a statistically representative sample of the stream of data and instead of attempting to perform the quantile measurement over the entire stream of data instead we for you know a we make a measurement on a small around thousand element sample and the results are statistically representative of the underlying data really cool technique really not nearly as widely known as they should be the canonical way of doing this I'll take questions afterwards the economical way of doing this is Vitter's algorithm are there's a reference right there I can't really get into it it's pretty bloggers and we can take a data set from our hypothetical service this is the number of cities and it's returning a nation request over time and because it's a set up facade there's two inflection points about two thirds of the way and where we start returning more cities right we don't know wise to do the search stuff change whatever this is the behavior of the process and so using Vitter's algorithm are using this algorithm for reservoir sampling we can calculate the median 75th percentile and the 95th or 6o you can see at the first inflection point they don't actually change and then a second inflection point that change unit lasted by the time we get to the end you can see just visually they're shooting 95% of these data points underneath this line it's a 95th percentile it's what it does it's more than that's more than 5% that's a lot more than 5% right there's actually this big gap and by now you should know we need to mind the gap the reason why there's that gap is because bidders Alberta mark produces uniform samples these are samples which are statistically representative of the data that the service generates for its entire lifetime but we care again we care about recency we care about what it's doing right now so how do we calculate quantiles with a sense of recency well we turn to even more Matt and I realize that I'm hitting a lot of math so here's a picture of my kittens they're not this small anymore and they like to play hockey at 4 a.m. but they're very cute when they're asleep so deep breath so we do this really technique called for decay for example and there's a reference right there about one-two punch kittens math reference right there a the slides are online I don't have enough time to talk about it really in-depth but it's an even cooler technique but the gist of it is that this allows us to maintain a statistically representative sample of the last 5 minutes of data or X minutes we chose 5 because that's kind of granularity that we care about we don't have to again we don't have to store all of the data but we end up with a statistically representative sample with an actual sense of recency 2 and so if you go back to our data set if we're using forward to gave priorities and you calculate the median 75th percentile and 95th percentile and you can just immediately see the change in the first inflection point all of a sudden this kicks out second inflection point all sudden this kicks out and by the end you know like you can see about 95% of those X's are underneath that red line if you put it back to back it's even more pronounced which one of these does a better job of measuring what the process is doing over time biased and so if we have a histogram the number of cities in our autocomplete or service we can say something like 95% of our autocomplete results return three cities or fewer is this good is this bad thing real but we know that this is true it's what you get with a histogram the next up are timers and timers are basically a compound metric they're a histogram of durations and a meter of Paul's right so we're already getting requests per second but we don't just care about through but we also care about latency we want to know the number of milliseconds that it takes us to respond we know want to know specifically the distribution of that data and we can do that with a timer like this we give it a name requests which we replace in the meter and we give it a duration unit or in milliseconds so we're measuring the time that each request takes in milliseconds to give it the same rate unit seconds because we're measuring requests that happen per second and then we passed the timer a callback it measures how long that callback takes to execute updates the histogram and marks the occurrence of that event in the meter and so if we have a timer on our surface for this we can say at around 2000 requests second or 99% latency jumps from 13 milliseconds to 504 to 3 milliseconds hugely hugely important from the perspective of someone who actually cares what their plate does once it hits the server so to give it a timer so we have these five tools gauges counters meters histograms and timers cool now what now what do you do well the first thing you do is you take this toolkit and you go through your code and you instrumented a rule of thumb that we have is that if it could if it could affect the business value of your code add a metric if it affect the business value start measuring right to give you a baseline for comparison on most of our services which are very small composed code bases export about 40 to 50 metrics each one of those contains anywhere from one to ten different measurements so some of our larger and more complex services are actually exporting about 2,000 values about what they're doing at any particular time okay so you go through and you instrument your code now on now you collect that data if you're measuring these things so that you know when you're curious you can come by and see how your pet rock is doing that's not enough you need to actually be have an automated process for going around collaborators so the metrics library has a servlet which exposes the values of a services metric event metrics as a JSON object over HTTP which is great because you know everyone speaks those two things and so we have a set of scripts which come by every minute and collect this data from all of our services you collected this data you've got a big no one now you monitor it you can do this with Nagios or sab bags or whatever you want but the critical thing is that if it affects the business value someone gets their ass woken up right so if if your 99th percentile latency jumps out 3:00 in the morning you get a page or someone who can actively deal with that gets a page once you've got monitoring set up you then start to aggregate this data and you can do that with ganglia or graphite or cacti or mune and there's a whole bunch of tools they're all horrible in slightly different ways but they're there and you don't have to build it yourself yet and this is this allows you to place current values in historical context this means that you can then see long-term patterns in the behavior of your code and this is really really where the rubber meets the road because if you get this process set up it allows you to go faster with this process we can shorten our decision-making cycles this is critical our decision-making cycle looks a little something like this we observe something about the world we take that observation and we orient it within all of the other knowledge that we have within our strategies and and and desires and tactics and fears all the various things that they're kind of impose who we are who we are as organizations this allows us to extract meaning from the observations which means that whether we can decide what to do and having made a decision we can then act this should be familiar it's the OODA loop which is a piece of military theory that Colonel John Boyd was a pilot the Korean War came up with and it applies to fighter pilots it also applies to software engineers there's less shooting and flames but it's still the same process we make an observation about the world what is the 99% latency of our autocomplete service right now it's around 500 milliseconds oh we take this observation and we orient it within all of the other things we know about software about this software other software how does this compare to other parts of our system both currently and historically well it's 500 milliseconds now it was 50 milliseconds last night I think something may have changed it's way slower now we have meaning about this observation now we can decide what to do now we can decide where to allocate resources should we make it faster or should we add a new feature well in this particular case a 10x slowdown is probably bad so we should probably have someone make it faster having made our decision we can then act we get to write some code specifically probably remove that sleep this is an iterative process this is constantly happening we observe the consequences of our actions and orient that with what we know about how we act to to come up with to make decisions rather than how we change tactics or strategies and then we can actually make those changes so we have this we have this cycle where this decision-making cycle and if we iterate faster we will win a shorter decision-making cycle is a huge huge cultural advantage it's a huge cognitive advantage it's a huge competitive advantage if you're a decision maker if you if your decision-making cycle is shorter you will ship fewer bugs because your mistakes will not live as long you will ship more features because you won't be working on things that don't matter this means happier users and happier users should mean more money so I've covered a lot of ground here I kind of want to sum it all up first we might write code and chances in this room really high we probably do write code but we have to generate business value and in order to know how well our code is generating business value we need metrics about that code we need to know what that code is doing in production so Yammer we've got this tool kit provides gauges counters meters histograms of timers instrument your code know what your code is doing monitor these values from current problems so that you can respond to those problems in real time aggregate these values for historical perspective so that you can see long-term patterns on what your code is doing in production because the maps not the territory you can always have a better map we can always improve our mental model of our code because we have to mind the gap it might need the gap means that our unicycle gets faster our decision-making cycle gets faster and we win so if you're on the JVM right now you can use this logarithm who wrote if you're using Java Scala JRuby - closure fortress Mira anthem they miss anything there's a lot of there groovy sorry so it's it's MIT license it's available on github about the time it takes you to get a jar you can start using it if you're not on the JVM you can build this you can actually build this if you're using Ruby there's the Ruby metrics Jim - my friend John is working on using javascript my coworker my kid PE has a JavaScript library which does a lot of this John Erlang the Joe Williams from boundary has Folsom which has a lot of these same tools if if you're not on any of those platforms that you're capable of actually implementing all the stuff that I've talked about all the algorithms all the data structures all the techniques all these things completely and totally within your grasp it's not just you can build this but please please PLEASE buildings because I think that you can benefit from having these tools from knowing what your code is doing in production always as its generating business value I think that if you have this you can make better decisions by using numbers Thanks he said the rubia dog scripture the ruby jam is called ruby metrics I don't know what the JavaScript libraries calls on github micro burgers my AP IB H knees straight questionnaire buds yeah after the good general strategy question so your library is sort of fundamentally about having pieces of software they're doing work be interested in the kinds of dating want about what they're doing right I mean they know oh I want moving average of this oh I'm on Instagram with that right and and there's there's sort of another like a little strategy of well for the supper that's doing workmanship just Bob have detailed event logging set it all somewhere and then post process have a substantial to say that's totally awesome that's totally possible for some things that's a very good strategy I know others staff maybe where you basically crappy D with this active an aggregator service and then interview statistics about that service the reason why we have an impulse is approach is because we're a Yammer with a long service if you're doing things where if we were to start cracking on bases where you see we would have liked to scan for problems right you have the thing we're trying to do and then we have been if the consequence we come from a long massive amounts of data either bang goes over into an asynchronous processing process like an attachment or just kind of this death by trickling a little bit data out of soccer if it really wanted to use one approach that that aggregation up to the heavy application so that we could aggregate thanks how did you go about like using sequel technologies to actually trap all this data that when someone makes aa request to a server a table in database gets updated and especially a sequel very so it's all finally good then you have to find a database technology which provides you with the relational model which enable hundreds of thousands of rights a second I mean again this is the same thing like here we already have one skin problem and introducing another point is that said I think that there's a tremendous amount of that value provide a little extra model in an analytical role currently Android an analytics group really really good it's a bunch of the promises and social amber so there is like but for a passwords but we have this excellent and pipelines that we really do have preposterous amount of insight into whether our user to do my name is Africa which is inevitable in some so I really want to use their data warehouse to analyze my numbers but I think that at some point you have to change your granularity because there's this massive man right if you're talking about the service which is deployed on 50 machines and the raw data is like many megabytes per second and then you have to put that data support instead much rather figure out what the actual point back and then what an aggregator that you know that's what the rate is supposed to be existence eventually the way your services have anywhere between 3,000 data comes that they're exporting in terms of just writing your programming as you go it's there what has strategies that you developed or you know deciding as you write what's important to measure and or know what you could be possibly fiendish and you know how did that avoid basically at one point that they see much of a hassle to be including marketing manager so one of the one of the reasons why we decided to go with manual instrumentation as opposed to kind of the more like just a profile everything all the time it's because you're profiling is very rarely at the granularity era right okay this one method does this but that's not actually really what you care about what you care about is only this kind of larger aggregation of various behaviors so as far as got the balance should be instrumentation honest Trinity historically the way it's worked is we've written something and it does something unexpected and we have to bail ourselves out we realize that we have this gigantic monthly blind spot and so we had as human vision to it and kind of since we started using this approach we've kind of developed a kind of intuitive feel for we're going to want to know about this like this is going to come back to bite us in the ass but it's still one of those things that gave a lot of interest have actually because it's you know like if you look at it another problem but but even that is still really useful because it means that this is something you know actually need to pop right so we have this mythical model to play well like what about your application is slow when everyone's probably got this kind of cube of like well it's probably this should address but it turns out that that's actually tends to be incredible bad intuition what's actually because there are things in our application where I was like yeah this is totally into your problem I feel so many option is Asians I can do this and explore that I can watch this and blah blah blah and it's like that's takes about two microseconds and pops like move on like this other thing takes orders of magnitude more time it's going more variants things into it so I think it's just shaken things up do you test your metrics collection in your code to some degree yes not not religiously we are not a test-driven development shop over here if it's something we can about the princess was an indication we surprised ourselves so you uh measuring the cost of your misery we've done jokes aside yeah get up there what percent up like water or anything and what do you think has the Milan's a Walmart bugs and darker have not gone down yeah so the the amount of overhead is trivial so hooking up a timer just over to the new tire is going to be biggest weight thing that we have takes a small mountain memory at any point in time we have partnered up to 3x of U which is samples such your very heavy yep oh yeah the amount of overhead that a timer adds is after the gin in wines everything it's about a microsecond so it there's definitely some hot loops where you want to do it but there's been a a lot of work into making measures of aggregation and epilation sounds plausible yeah listen so so fit some things are done within their question is you're not going within the request something like a meter what it does is to basically difference an atomic variable in memory which is real cheap and there's background threads to go in okay that value thank you high grade the aggregation yeah but 50 web servers no recording requests per second you just have like what ducation shows like that staff halted here you um currently I cry quietly not I mean a bottle of scotch game so there this is what reasons why we won't use various tools is because they have that victim whereas gangly it's just like here are these them examine them for days but the some metrics are really easily findable for example rates you can stack those on top and so we do have some dashboards there's no general purpose tools do there's one kind of a hybrid 80% of aggregated quantiles is a little bit more different a little different more complicated in that you can just kind of like average the two and so what we what we have is the ability to export every histogram a timers sample which would allow downstream attribute again to but at this point we really might like our primary means of engagement with our ganglia and model so if the 99th percentile viñagrow services goes above hundred millisecond its alarm bells go off there was lots of fun today so you have well instrumented code you have aside from that being useful for monitor performance wild affection seems like a bleep interesting that set up something like the CI server always running performance tests on puzzle with whatever Davis it needs yeah so the cool thing about metrics is that it does double is a very sweet niche market library so when we remove our curious about the amount of time that something takes up metrics all sense of console reporter which every next seconds or what-have-you props out all of the metrics to standard app so and so we didn't so it's it would be done impossible to run with management brush and stuff saying take this code make sure it's exposing this metrics correctly run through if you can workload and send everybody an email if you get something that's basically a segregated isn't because of our stuff gets long production that's cool if you have some present you're not positive returning no Brown connect now be in part because it we find it very very very hard to simulate the sorts of kind of failure scenarios that we've seen in production a works that there's just something about unleashing a couple million people want something that really causes shifting out the woodwork like everything else good top like maybe one level deeper on my use reservoir sampling it's a random sample and that example sure so right I mean I think the that's effectively what priority sampling is it's just basically how do you keep that random singable recent so if you did the other drawback with a random sample is that it doesn't actually decay if things stop it stays that the exactly the same value it doesn't it doesn't change those things slow down dramatically the number of elements me or add in half of the sample are skewed results less handling but how I think it seems like you get out and left with reservoir sampling the probability is specifically with forward decay very simply be probability that that's an element replace any other element in the sample is an exponential function of time from the landmark value and so basically it gets more and more and more likely relative to old values that new values will replace it so if all of a sudden there's this weird five minute gap where like X only at 17 minutes like like sign on sparklin staring in space awkwardly that next value that's like well just they took 17 minutes to complete that it's definitely I guess yeah I mean we have not at all looked into the CI thing in part because I made love to have another production cluster and then a gigantic set of drivers and actually simulate failure scenarios we've seen in production for our stuff with the the high value elements have not been well this thing's slightly slower this thing's kind of an exception is low it's like oh you throw hundreds of thousands of users at this and all of a sudden HTTP client has a chibi point three one because of its desire to share beautiful down between all threads blocks tired dolphins asphalt like that's the sort of thing you see in production that's actually going to Hospice at time the state of your fault everything all of these values that we are calculating get collected by the scripts ones that are high priority gets sent to zabbix for monitoring everything else goes to game and so we have these like a true long like ganglia pages that just have all the things right thanks happening and