Monitoring and Alerting
=======================
Why do we do monitoring - check at the metrics to figure out if something goes wrong
Operations and dev ops people are the ones who respond to the alerts the most
Managers and developers also are interested in taking the metrics and showing interest into the trends on the monitoring
which helps to understand how the application is being used and how many users are using and what are the feature being used the most
Why are we collecting the data in the first place
More than one answer
Analysis of the data and trending issues
Anamoly detection
Capacity planning
Predictiion kind of stuff
Make our systems Highly available and resilient to failure
Different SLA availability:
---------------------------
People expect our services to be available 99.999 5 nines
We should shift our Focus on predicting things and fixing things , before they actually go wrong
People fix the things , when they break, more of general maintenance stuff
Fire fighting
Fix it and go back with our day
We start to view our selves as service workers , maintenance and fire fighters
If something is wrong with the code, we have to fix it , thats good
The Bigger PICTURE :
--------------------
What is the objective of the business , what does the business care about ?
HA is important
What else is important
Existing customers - keep them happy - pricing is competitive
Customers , they do care , having high up time
Login to the services - fb,yahoo,google,snapchat
There is no innovation in the following scenario-
When we are constantly working putting out fires and trying to maintaining things , we are not learning things
The innovation piece is really missing in monitoring and alerting
Why is it even important ?
We dont have to think too far back
If we are not a learning organization , we need to learn and imporve
EG: Blockbuster could be doing the same what netflix is doing
When netflix started streaming , blockbuster wasnt doing that
They werent doing that , because , they were not a learning organization
Borders was old and we used to used to buy books from store
Now , we just go to amazon and just buy it from there
These companies which did not innovate and implement CHANGE do not exist anymore
They were focussed on keeping the lights on and keeping the things as it is
In IT department , there is really no chance to learn which uncovers all this innovation
They could be doing the next best things , if they focussed on learning
If they responded to the feed back
You can understand , what is going on , you can get a lot of information , which can help you decide your actions based on the information
How we deliver and maintain the systems
Puppet labs has information about monitoring - how to maintain
https://docs.puppet.com/pe/latest/puppet_server_metrics.html
Idea of MTTR - Mean time to recover { how to drive that number down }
---------------------------------------------------------------------
Responding to that problem
Reducing the cost of down times is very significant
We should be able to calculate the cost of down time
Reduce the MTTR to 1 hr or less
Those who focus the time on reduce the time on MTTR will try ways to innovate
Its all about prediction and preventing
We should be able to make sure , we know about these problems and respond to them very quickly , so we end up repairing them very quickly
If an engineer is fiddling out with the systems , even if we use devops , its still very much prevelant
Now for mean time between failures
MTTR
High performance teams, are not in a place to predict and prevent the issues
When it comes to failure , it would be very low and small impact outage
Its something that is always going to happen in a complex system
An example of why our systems are going to have a failure , is because , in a complicated and complex evnvironment
there are going to be problems
So , we need to understand , retrospect and figure out what we need to do
Take information about what we have learned
Understand and learn
If we start thinking a little bit about relisiency
The ability to build systems resilient to change is also a starting point
Make systems, that are resilient to failure because of a change
In order to build the systems that are resilient to change , the systems should embrace the change rather than be afraid of the change
Developers are incentivized for getting out the quality code
They are wanting to
We should be able to recover from the failure , learn what went wrong and improve the process
The way we approach our work and how to collaborate
Are we putting process into place
are we paying attention to MTTR
Postmortem reports
" Without deviation from the norm , progress is not possible "
In order to make things better , we have to go through some change
Change is the one that causes the failure
Inevitably something is going to go wrong
We are going to fix what went wrong
We are trying to learn , what went wrong and how it was fixed
We want to know , what did they look at , did they look at graphs, logs , queries in the database
traditionally ,Lots of people hop on to phone during outage , share information , chat and do the work
Devops - who said what at exactly what time , all the information is captured
Something to keep in mind , we have to rollout , what went wrong
Chatops is great - during incident , everything that was said, and done along with the time was recorded
This is a Q And A session - and we will be able to correlate the stuff
Who was saying what , how things were resolving , what needed to be done , how to make the process better
Utilize a pretty wellknown system to do the RCA
RCA is good
Five whys -
Not getting the full picture of what is going on
Five whys is a terrible way , to go about and look into what took place
------------------------------------------------------------------------
When something goes wrong , we are going to go around and do a RCA - nothing is going to go and undo the issue
RCA - is our own obsession with RCA
We are just putting things back the way they were
Reason 5 whys is not good ,because , it leads to blame
It doesnt help us in making progress or making things better
We make them feel bad , they tend to shut down , and stop talking
If people are afraid to speak up and give relevatnt information
We want to avoid blame at all costs
When we go into these things
example:
Jason is the one , that ran the command , that happened a system wide problem ,
Something went wrong because of a bad command, so we need to improve the system to make it resilient
We have to iterate constantly , we are here to learn from failures and success
What did we do well and keep doing that to avoind failure
In case of an issue
Team tries to detect and resolve an issue
swarm into a problem
Respond as a team to solve the problem
Learning organizations
-----------------------
Learning doesnt come from reading and listening
We are not really learning necessarily
We are not really learning , if we are not implementing
This is what unclocks oppurtunities to truly learn and do the implementation
During a learning exercise , what is the action we do to do this process better
EG: guitar
start plucking the strings and plucking the slides , from here we actually start learning
Knowing - understanding - learning - There will be mistakes along the way
If we place the finger in the wrong place , then there will be issues
If we place the finger in the right place , then it plays well
we learn from mistakes only by implementing
Engineers at Netflix say:
We will trade some up time in exchange for innovation
-----------------------------------------------------
Netflix uses chaos monkey to test their systems stability
Maintaining and protecting
Thats the part that sets up our business
When we start to see this , the monitoring and alerting tooling that we use ,
Simply because , we started implementing monitoring and alerting will help us learn and improve
Right after the experiment , we fail
We learn and improve , and implemt again
hypothesize , experiment , learn and implemt
Be innovative
Make the systems more resilient
The by product of a highly resilient system is a highly available system
Protect our system to be highly available
Learning and innovation , implementing , embracing change
No comments:
Post a Comment