AWS notes: Monitoring

Monitoring and Alerting
=======================

Why do we do monitoring - check at the metrics to figure out if something goes wrong

Operations and dev ops people are the ones who respond to the alerts the most
Managers and developers also are interested in taking the metrics and showing interest into the trends on the monitoring
which helps to understand how the application is being used and how many users are using and what are the feature being used the most

Why are we collecting the data in the first place
More than one answer
Analysis of the data and trending issues

Anamoly detection

Capacity planning

Predictiion kind of stuff

Make our systems Highly available and resilient to failure

Different SLA availability:
---------------------------
People expect our services to be available 99.999 5 nines

We should shift our Focus on predicting things and fixing things , before they actually go wrong

People fix the things , when they break, more of general maintenance stuff
Fire fighting
Fix it and go back with our day
We start to view our selves as service workers , maintenance and fire fighters
If something is wrong with the code, we have to fix it , thats good

The Bigger PICTURE :
--------------------
What is the objective of the business , what does the business care about ?
HA is important
What else is important
Existing customers - keep them happy - pricing is competitive
Customers , they do care , having high up time

Login to the services - fb,yahoo,google,snapchat

There is no innovation in the following scenario-
When we are constantly working putting out fires and trying to maintaining things , we are not learning things

The innovation piece is really missing in monitoring and alerting

Why is it even important ?

We dont have to think too far back
If we are not a learning organization , we need to learn and imporve

EG: Blockbuster could be doing the same what netflix is doing
When netflix started streaming , blockbuster wasnt doing that
They werent doing that , because , they were not a learning organization

Borders was old and we used to used to buy books from store
Now , we just go to amazon and just buy it from there
These companies which did not innovate and implement CHANGE do not exist anymore

They were focussed on keeping the lights on and keeping the things as it is

In IT department , there is really no chance to learn which uncovers all this innovation
They could be doing the next best things , if they focussed on learning
If they responded to the feed back

You can understand , what is going on , you can get a lot of information , which can help you decide your actions based on the information
How we deliver and maintain the systems

Puppet labs has information about monitoring - how to maintain
https://docs.puppet.com/pe/latest/puppet_server_metrics.html

Idea of MTTR - Mean time to recover { how to drive that number down }
---------------------------------------------------------------------

Responding to that problem
Reducing the cost of down times is very significant

We should be able to calculate the cost of down time
Reduce the MTTR to 1 hr or less

Those who focus the time on reduce the time on MTTR will try ways to innovate

Its all about prediction and preventing
We should be able to make sure , we know about these problems and respond to them very quickly , so we end up repairing them very quickly

If an engineer is fiddling out with the systems , even if we use devops , its still very much prevelant

Now for mean time between failures
MTTR
High performance teams, are not in a place to predict and prevent the issues

When it comes to failure , it would be very low and small impact outage

Its something that is always going to happen in a complex system
An example of why our systems are going to have a failure , is because , in a complicated and complex evnvironment
there are going to be problems

So , we need to understand , retrospect and figure out what we need to do

Take information about what we have learned

Understand and learn

If we start thinking a little bit about relisiency

The ability to build systems resilient to change is also a starting point
Make systems, that are resilient to failure because of a change

In order to build the systems that are resilient to change , the systems should embrace the change rather than be afraid of the change
Developers are incentivized for getting out the quality code

They are wanting to

We should be able to recover from the failure , learn what went wrong and improve the process
The way we approach our work and how to collaborate

Are we putting process into place
are we paying attention to MTTR
Postmortem reports

" Without deviation from the norm , progress is not possible "
In order to make things better , we have to go through some change

Change is the one that causes the failure

Inevitably something is going to go wrong
We are going to fix what went wrong

We are trying to learn , what went wrong and how it was fixed
We want to know , what did they look at , did they look at graphs, logs , queries in the database

traditionally ,Lots of people hop on to phone during outage , share information , chat and do the work

Devops - who said what at exactly what time , all the information is captured

Something to keep in mind , we have to rollout , what went wrong

Chatops is great - during incident , everything that was said, and done along with the time was recorded
This is a Q And A session - and we will be able to correlate the stuff
Who was saying what , how things were resolving , what needed to be done , how to make the process better

Utilize a pretty wellknown system to do the RCA
RCA is good
Five whys -
Not getting the full picture of what is going on
Five whys is a terrible way , to go about and look into what took place
------------------------------------------------------------------------
When something goes wrong , we are going to go around and do a RCA - nothing is going to go and undo the issue

RCA - is our own obsession with RCA
We are just putting things back the way they were

Reason 5 whys is not good ,because , it leads to blame
It doesnt help us in making progress or making things better
We make them feel bad , they tend to shut down , and stop talking

If people are afraid to speak up and give relevatnt information
We want to avoid blame at all costs
When we go into these things
example:
Jason is the one , that ran the command , that happened a system wide problem ,
Something went wrong because of a bad command, so we need to improve the system to make it resilient

We have to iterate constantly , we are here to learn from failures and success
What did we do well and keep doing that to avoind failure

In case of an issue
Team tries to detect and resolve an issue
swarm into a problem
Respond as a team to solve the problem

Learning organizations
-----------------------
Learning doesnt come from reading and listening
We are not really learning necessarily
We are not really learning , if we are not implementing

This is what unclocks oppurtunities to truly learn and do the implementation
During a learning exercise , what is the action we do to do this process better

EG: guitar
start plucking the strings and plucking the slides , from here we actually start learning
Knowing - understanding - learning - There will be mistakes along the way

If we place the finger in the wrong place , then there will be issues
If we place the finger in the right place , then it plays well
we learn from mistakes only by implementing

Engineers at Netflix say:
We will trade some up time in exchange for innovation
-----------------------------------------------------
Netflix uses chaos monkey to test their systems stability

Maintaining and protecting
Thats the part that sets up our business

When we start to see this , the monitoring and alerting tooling that we use ,
Simply because , we started implementing monitoring and alerting will help us learn and improve

Right after the experiment , we fail
We learn and improve , and implemt again
hypothesize , experiment , learn and implemt
Be innovative
Make the systems more resilient

The by product of a highly resilient system is a highly available system
Protect our system to be highly available
Learning and innovation , implementing , embracing change

AWS notes

Pages

Sunday, January 27, 2019

Monitoring

No comments:

Post a Comment

netstat

Report Abuse