Monitoring and Alerting
=======================
Why do we do monitoring - check at the metrics to figure out if something goes wrong
Operations and dev ops people are the ones who respond to the alerts the most
Managers and developers also are interested in taking the metrics and showing interest into the trends on the monitoring
which helps to understand how the application is being used and how many users are using and what are the feature being used the most
Why are we collecting the data in the first place
More than one answer
Analysis of the data and trending issues
Anamoly detection
Capacity planning
Predictiion kind of stuff
Make our systems Highly available and resilient to failure
Different SLA availability:
---------------------------
People expect our services to be available 99.999 5 nines
We should shift our Focus on predicting things and fixing things , before they actually go wrong
People fix the things , when they break, more of general maintenance stuff
Fire fighting
Fix it and go back with our day
We start to view our selves as service workers , maintenance and fire fighters
If something is wrong with the code, we have to fix it , thats good
The Bigger PICTURE :
--------------------
What is the objective of the business , what does the business care about ?
HA is important
What else is important
Existing customers - keep them happy - pricing is competitive
Customers , they do care , having high up time
Login to the services - fb,yahoo,google,snapchat
There is no innovation in the following scenario-
When we are constantly working putting out fires and trying to maintaining things , we are not learning things
The innovation piece is really missing in monitoring and alerting
Why is it even important ?
We dont have to think too far back
If we are not a learning organization , we need to learn and imporve
EG: Blockbuster could be doing the same what netflix is doing
When netflix started streaming , blockbuster wasnt doing that
They werent doing that , because , they were not a learning organization
Borders was old and we used to used to buy books from store
Now , we just go to amazon and just buy it from there
These companies which did not innovate and implement CHANGE do not exist anymore
They were focussed on keeping the lights on and keeping the things as it is
In IT department , there is really no chance to learn which uncovers all this innovation
They could be doing the next best things , if they focussed on learning
If they responded to the feed back
You can understand , what is going on , you can get a lot of information , which can help you decide your actions based on the information
How we deliver and maintain the systems
Puppet labs has information about monitoring - how to maintain
https://docs.puppet.com/pe/latest/puppet_server_metrics.html
Idea of MTTR - Mean time to recover { how to drive that number down }
---------------------------------------------------------------------
Responding to that problem
Reducing the cost of down times is very significant
We should be able to calculate the cost of down time
Reduce the MTTR to 1 hr or less
Those who focus the time on reduce the time on MTTR will try ways to innovate
Its all about prediction and preventing
We should be able to make sure , we know about these problems and respond to them very quickly , so we end up repairing them very quickly
If an engineer is fiddling out with the systems , even if we use devops , its still very much prevelant
Now for mean time between failures
MTTR
High performance teams, are not in a place to predict and prevent the issues
When it comes to failure , it would be very low and small impact outage
Its something that is always going to happen in a complex system
An example of why our systems are going to have a failure , is because , in a complicated and complex evnvironment
there are going to be problems
So , we need to understand , retrospect and figure out what we need to do
Take information about what we have learned
Understand and learn
If we start thinking a little bit about relisiency
The ability to build systems resilient to change is also a starting point
Make systems, that are resilient to failure because of a change
In order to build the systems that are resilient to change , the systems should embrace the change rather than be afraid of the change
Developers are incentivized for getting out the quality code
They are wanting to
We should be able to recover from the failure , learn what went wrong and improve the process
The way we approach our work and how to collaborate
Are we putting process into place
are we paying attention to MTTR
Postmortem reports
" Without deviation from the norm , progress is not possible "
In order to make things better , we have to go through some change
Change is the one that causes the failure
Inevitably something is going to go wrong
We are going to fix what went wrong
We are trying to learn , what went wrong and how it was fixed
We want to know , what did they look at , did they look at graphs, logs , queries in the database
traditionally ,Lots of people hop on to phone during outage , share information , chat and do the work
Devops - who said what at exactly what time , all the information is captured
Something to keep in mind , we have to rollout , what went wrong
Chatops is great - during incident , everything that was said, and done along with the time was recorded
This is a Q And A session - and we will be able to correlate the stuff
Who was saying what , how things were resolving , what needed to be done , how to make the process better
Utilize a pretty wellknown system to do the RCA
RCA is good
Five whys -
Not getting the full picture of what is going on
Five whys is a terrible way , to go about and look into what took place
------------------------------------------------------------------------
When something goes wrong , we are going to go around and do a RCA - nothing is going to go and undo the issue
RCA - is our own obsession with RCA
We are just putting things back the way they were
Reason 5 whys is not good ,because , it leads to blame
It doesnt help us in making progress or making things better
We make them feel bad , they tend to shut down , and stop talking
If people are afraid to speak up and give relevatnt information
We want to avoid blame at all costs
When we go into these things
example:
Jason is the one , that ran the command , that happened a system wide problem ,
Something went wrong because of a bad command, so we need to improve the system to make it resilient
We have to iterate constantly , we are here to learn from failures and success
What did we do well and keep doing that to avoind failure
In case of an issue
Team tries to detect and resolve an issue
swarm into a problem
Respond as a team to solve the problem
Learning organizations
-----------------------
Learning doesnt come from reading and listening
We are not really learning necessarily
We are not really learning , if we are not implementing
This is what unclocks oppurtunities to truly learn and do the implementation
During a learning exercise , what is the action we do to do this process better
EG: guitar
start plucking the strings and plucking the slides , from here we actually start learning
Knowing - understanding - learning - There will be mistakes along the way
If we place the finger in the wrong place , then there will be issues
If we place the finger in the right place , then it plays well
we learn from mistakes only by implementing
Engineers at Netflix say:
We will trade some up time in exchange for innovation
-----------------------------------------------------
Netflix uses chaos monkey to test their systems stability
Maintaining and protecting
Thats the part that sets up our business
When we start to see this , the monitoring and alerting tooling that we use ,
Simply because , we started implementing monitoring and alerting will help us learn and improve
Right after the experiment , we fail
We learn and improve , and implemt again
hypothesize , experiment , learn and implemt
Be innovative
Make the systems more resilient
The by product of a highly resilient system is a highly available system
Protect our system to be highly available
Learning and innovation , implementing , embracing change
Sunday, January 27, 2019
Splunk Training Notes
1) Using the pivot interface
Run basic searches
3) Using fields in searches
4) Creating reports
Information collected from the logs
Apache logs from public web site of customer interactions with store
Linux logs of logins and failed logins
Logs of sales to distributors
Linux logs
1) Roles and responsibilities are to gather data and statistics and report on
- Security
IT operations
Business intelligence Etc
Application Management
operations Management
Security and compliance
and the rest
The first task is to index the Data
Once the data is indexed , then we move on to the next phase
That is search and investigate
We need to investigate , what the problem is and where and what it is
l
Index the Data
Search and investigate the issue from the data that has been indexed
Add knowledge and stuff , that is required
Monitor and Alert
Report and Analyze
Index the data
Search and Investigate
Based off of the investigation , we will be adding the knowledge
We then monitor and Alert
Then comes , report and Analyzation , which is the final part
==============================================
Knowledge objects
—————————————
The knowledge objects make your data more robust ,providing ways to interpret , classify , enrich and normalize your events
- So , we do create knowledge to add value to your data
- The knowledge objects can be reused and shared
- Create knowledge objects to add value to your data
- We create knowledge objects to add value to your data
- The knowledge objects can be used and reused
Click Settings to access your knowledge objects
KOs enhance your productivity in many ways
Speed
Reuse
Quality
Depth
speed - Reports give you previously created searches , saving typing time and allowing you to execute searches without knowledge of the search language
Reports give you previously created searches , saving typing time
Spunk user are assigned roles
The roles determine the capabilities and data access
1Admin
@Power
# User
Spunk administrators can create additional roles
=========================================
Apps allow different workspaces , tailored to a specific use case or user role , to exist on a single spunk instance
This class focuses on the Search and Reporting app ( also called the Search app )
Administrators can install additional apps to your spunk instance from
http://apps.splunk.com
=========================================
Apps allow different workspaces , tailored to a specific use case or a user role
Apps allow different workspaces , tailored to a specific user case or a user role
Apps allow different workspaces , tailored to a specific user care or a user role to exist on a single spunk instance
apps allow different workspaces , tailored to a specific use case or a user roles
Apps allow different workspaces , tailored to a specific user cases
Apps allow different works paces , tailored to a specific user case or suer role , to exist on a single spunk instance
- This class focuses on the search and reporting app ( Also called the search app )
apps.splunk.com - Additional apps can be installed
Search : To navigate your data using and ordered group of string of terms and values
Event: Searches return events - single piece of data ( i.e record in a log file or other data input )
Field : Searchable name/value pair in event data , Fields give you more precision in searches.
Data Model: An abstract visual layer between the user and the raw data which makes it easier to interact with the data
Event : Search s return events : which is a single piece of data ( record in a log file or other data in input )
Record in a log file or other data in input
Record in a log file or other data in input
Search results return single event , return a data from log file or other data input )
Saturday, January 19, 2019
AWS Networking
Parts of a Network You Should Know About
If you’re running infrastructure and applications on AWS then you will encounter all of these things. They’re not the only parts of a network setup but they are, in my experience, the most important ones.
VPC
A virtual private cloud - VPC - is a private network space in which you can run your infrastructure. It has an address space (CIDR range) which you choose e.g.
10.0.0.0/16
. This determines how many IP addresses you can assign within the VPC. Each server you create inside the VPC will need an IP address so this address space defines the limit of how many resources you can have within the network. The 10.0.0.0/16
address space can use the addresses from 10.0.0.0
to 10.0.255.255
, which is 65,536 IP addresses.
The VPC is the basis of your network on AWS and all new accounts include a default VPC with a subnets in each availability zone.
+---------------+
| VPC | The Internet
| |
| |
| 10.0.0.0/16 |
| |
| |
+---------------+
Subnets
A subnet is a section of your VPC, with its own CIDR range and rules on how traffic can flow. Its CIDR range has to be a subset of the VPC’s, for example
10.0.1.0/24
which would allow for IPs from 10.0.1.0
to 10.0.1.255
giving 256 possible IP addresses.
Subnets are often denominated as ‘public’ or ‘private’ depending on whether traffic can reach them from outside the VPC (the Internet). This visibility is controlled by the traffic routing rules and each subnet can have its own rules.
A subnet has to be in a specific availability zone within a region so it’s good practice to have a subnet in each zone. If you plan to have public and private subnets then there should be one of each per availability zone.
+---------------------------+
| VPC |
| |
+------------+ +------------+
|| Subnet 1 | | Subnet 2 ||
||10.0.1.0/24| |10.0.2.0/24||
|| | | ||
|------------+ +------------|
+---------------------------+
Availability Zones
We’ve said that there should be subnets per availability zone, but what does that actually mean?
Each AWS region is divided into 2 or more different zones which, between them, aim to guarantee a very high level of availability for that region. Essentially, at least one zone should be able to operate, even if others suffer outages (:fire:).
+----------+ +----------+
|us-ea)t-1a| |us-east-1b|
|_____(____| |__________|
| ) | | |
| ( &() | | ✔ |
| ) () &( | | 8-) |
+----------+ +----------+
Routing Tables
A routing table contains rules about how IP packets in the subnets can travel to different IP addresses. There is always a default route table which will only allow traffic to travel locally, within the VPC. If a subnet has no routing table associated with it then it uses the default one. These would be ‘private’ subnets.
If you want external traffic to be able to get to a subnet then you need to create a routing table with a rule explicitly allowing this. Subnets associated to that routing table would be ‘public’.
All of the subnets in the default VPCs are associated with a route table which makes them public.
Internet Gateways
The routing table which makes a subnet public needs to reference an Internet gateway to allow the flow of external IP packets into and out of the VPC. You create your Internet gateway and then create a rule which says that packets to
0.0.0.0/0
- all IP addresses - need to go to there. Route table
+-------------------+
| 10.0.0.0/8: local | Requests within the VPC go over local connections.
+--+ 0.0.0.0/0: ig-123 | Requests to any other IPs go via the Internet Gateway.
| | |
| +-------------------+
|
|
+-----+-------+ +-------------+
| Subnet 1 | | Subnet 2 |
| 10.0.1.0/24 | | 10.0.2.0/24 |
| | | |
| | 10.0.2.9 | |
| +--------->| |
| | | |
+-------+-----+ +-------------+
| 8.8.4.4
|
| +--------+
+-->| ig-123 |
| +-----> The Internet
+--------+
NAT Gateways
If you have an EC2 instance in a private subnet - one which doesn’t allow traffic from the Internet to reach it - then there’s also no way for IP packets to reach the Internet. We need a mechanism for sending those packets out, and then routing the replies correctly. This is called network address translation and is very likely done in your house by your wifi router.
A NAT gateway is a device which sits in the public subnets, accepts any IP packets bound for the Internet coming from the private subnets, sends those packets on to their destination and then sends the returning packets back to the source.
It’s not necessary to have NAT gateways if you don’t intend instances in your private subnets to talk outside if your VPC but if you do need to do that e.g. using an external API, SaaS database etc. then you can simply set up an EC2 instance (might be cheaper, depending on your traffic), configured appropriately, or use an AWS managed NAT gateway resource (will be easier to manage because you won’t be doing it).
+---------------------+
| Public Subnet |
| 10.0.1.0/24 |
| |
| +------------+ |
| | +-----------> The Internet
| | nat-123 | |
| | | |
| +-------^----+ |
| | |
+------------|--------+
|
| 8.8.4.4
| Route table
+------------+---------+ +---------------------+
| Private Subnet +----+ 10.0.0.0/16: local |
| 10.0.20.0/24 | | 0.0.0.0/0: nat-123 |
| | +---------------------+
+----------------------+
- The public subnet contains the NAT gateway
- A request is made from the private subnet to an IP address somewhere on the Internet
- The route table says that it needs to go to the NAT gateway
- The NAT gateway sends it on
Security Groups
VPC network Security groups denote what traffic can flow to (and from) EC2 instances within your VPC. A security groups can specify ingress (inbound) and egress (outbound) traffic rules, limiting them to certain sources (inbound) and destinations (outbound). They are associated with EC2 instances rather than subnets.
By default all traffic is allowed out, but no traffic is allowed in. Inbound rules can specify a source address - either a CIDR block or another security group - and a port range. When the source is another security group then that must be within the same VPC. For example, a VPC is created with a default security group which allows traffic from anything which has that same security group. Assigning the group to everything created in the VPC (not necessarily the most secure practice) means that all those resources can talk to each another.
+---------------+
| sg-abcde |
| ALLOW TCP 443 |
+----+----------+
|
+----+------+
| i-67890 |
10.0.1.123:22 | | 10.0.1.123:443
------------------>X| <----------------
| |
+-----------+
- An instance (i-67890) has a security group (sg-abcde) which allows TCP traffic on port 443
- A request is made to its IP address (10.0.1.123) on port 22 which doesn’t get through
- A request is made to port 443 on the instance and the traffic is allowed
Putting it All Together
The complete picture of your virtual private network looks something like the picture below, with public and private subnets spread across availability zones, network address translation sitting in the public subnets and route tables to specify how packets are routed. EC2 instances are run in any subnet and have security groups attached to them.
+-------+
| ig-1 |
| |
vpc-123: 10.0.0.0/16 | | | |
+----------------------+---------+-------+--------+---------------------+
| | | |
| +-----+ | +-----+ | +-----+ |
| | NAT | | | NAT | | | NAT | |
public | | | | | | | | | |
subnets| +-----+ | +-----+ | +-----+ |
| | | |
| | | |
| | | |
| +-------+ +-------+ +-------+
| | rt-1a | | rt-1b | | rt-1c |
| 10.0.1.0/24 | | 10.0.2.0/24 | | 10.0.3.0/24 | |
-------+-----------------------------------------------------------------------+
| 10.0.4.0/24 | rt-2a | 10.0.5.0/24 | rt-2b | 10.0.6.0/24 | rt-2c |
| | | | | | |
| +-------+ +-------+ +-------+
private| | | |
subnets| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
+----------------------+--------------------------+---------------------+
| AZ 1 | AZ 2 | AZ 3 |
Tuesday, January 1, 2019
AWS 21- RDS Databases
There are 5 types of Databases
Overview:
1) RDS - Regular Relational Database System
Structured DB.
SQL, Oracle, Mysql, Postgres, AWS Aurora ( mysql flavor), Maria DB
2) DynamoDB - This is a NoSQL Database ( Similar to Cassandra and Mongo )
3) ElastiCache - In memory DB used for caching by websites ( Redis and Memcached )
4 )Neptune - (Graph Database - Very New )
5) Amazon RedShift ( Used for Dataware housing )
Heavily used Databases are RDS, DynamoDB and Amazon RedShift.
RDS is platform as a service ( PAAS )
AWS Takes cares of the Database
They are responsible for maintaining it
Ready made service
PAAS is a furnished house, we do not have control over it
No Chance for customization as needed.
Overview:
1) RDS - Regular Relational Database System
Structured DB.
SQL, Oracle, Mysql, Postgres, AWS Aurora ( mysql flavor), Maria DB
2) DynamoDB - This is a NoSQL Database ( Similar to Cassandra and Mongo )
3) ElastiCache - In memory DB used for caching by websites ( Redis and Memcached )
4 )Neptune - (Graph Database - Very New )
5) Amazon RedShift ( Used for Dataware housing )
Heavily used Databases are RDS, DynamoDB and Amazon RedShift.
RDS is platform as a service ( PAAS )
AWS Takes cares of the Database
They are responsible for maintaining it
Ready made service
PAAS is a furnished house, we do not have control over it
No Chance for customization as needed.
Subscribe to:
Posts (Atom)
netstat
A copy from there - TCP Connection States Following is a brief explanation of this handshake. In this context the "client" is ...
-
Source: https://www.youtube.com/watch?v=m7osap3K_MU&list=PLjrF25Df7idwxvJzLpECBc2rZjQLOazgV&index=3 CIDR Notation Classness int...
-
NACL is stateless SG is stateful Total # of ports are 0-65000 Dynamic ports are from 49152 to 65535 If we do not open the outboun...