Sunday, January 27, 2019

Monitoring

Monitoring and Alerting
=======================

Why do we do monitoring - check at the metrics to figure out if something goes wrong

Operations and dev ops people are the ones who respond to the alerts the most
Managers and developers also are interested in taking the metrics and showing interest into the trends on the monitoring
which helps to understand how the application is being used and how many users are using and what are the feature being used the most

Why are we collecting the data in the first place
More than one answer
Analysis of the data and trending issues

Anamoly detection

Capacity planning

Predictiion kind of stuff

Make our systems Highly available and resilient to failure

Different SLA availability:
---------------------------
People expect our services to be available 99.999 5 nines

We should shift our Focus on predicting things and fixing things , before they actually go wrong

People fix the things , when they break, more of general maintenance stuff
Fire fighting
Fix it and go back with our day
We start to view our selves as service workers , maintenance and fire fighters
If something is wrong with the code, we have to fix it , thats good

The Bigger PICTURE :
--------------------
What is the objective of the business , what does the business care about ?
HA is important
What else is important
Existing customers - keep them happy - pricing is competitive
Customers , they do care , having high up time

Login to the services - fb,yahoo,google,snapchat

There is no innovation in the following scenario-
When we are constantly working putting out fires and trying to maintaining things , we are not learning things

The innovation piece is really missing in monitoring and alerting

Why is it even important ?

We dont have to think too far back
If we are not a learning organization , we need to learn and imporve

EG: Blockbuster could be doing the same what netflix is doing
When netflix started streaming , blockbuster wasnt doing that
They werent doing that , because , they were not a learning organization

Borders was old and we used to used to buy books from store
Now , we just go to amazon and just buy it from there
These companies which did not innovate and implement CHANGE do not exist anymore

They were focussed on keeping the lights on and keeping the things as it is

In IT department , there is really no chance to learn which uncovers all this innovation
They could be doing the next best things , if they focussed on learning
If they responded to the feed back

You can understand , what is going on , you can get a lot of information , which can help you decide your actions based on the information
How we deliver and maintain the systems

Puppet labs has information about monitoring - how to maintain
https://docs.puppet.com/pe/latest/puppet_server_metrics.html

Idea of MTTR - Mean time to recover { how to drive that number down }
---------------------------------------------------------------------

Responding to that problem
Reducing the cost of down times is very significant

We should be able to calculate the cost of down time
Reduce the MTTR to 1 hr or less

Those who focus the time on reduce the time on MTTR will try ways to innovate

Its all about prediction and preventing
We should be able to make sure , we know about these problems and respond to them very quickly , so we end up repairing them very quickly

If an engineer is fiddling out with the systems , even if we use devops , its still very much prevelant

Now for mean time between failures
MTTR
High performance teams, are not in a place to predict and prevent the issues

When it comes to failure , it would be very low and small impact outage

Its something that is always going to happen in a complex system
An example of why our systems are going to have a failure , is because , in a complicated and complex evnvironment
there are going to be problems

So , we need to understand , retrospect and figure out what we need to do

Take information about what we have learned

Understand and learn

If we start thinking a little bit about relisiency

The ability to build systems resilient to change is also a starting point
Make systems, that are resilient to failure because of a change

In order to build the systems that are resilient to change , the systems should embrace the change rather than be afraid of the change
Developers are incentivized for getting out the quality code

They are wanting to

We should be able to recover from the failure , learn what went wrong and improve the process
The way we approach our work and how to collaborate

Are we putting process into place
are we paying attention to MTTR
Postmortem reports

" Without deviation from the norm , progress is not possible "
In order to make things better , we have to go through some change

Change is the one that causes the failure

Inevitably something is going to go wrong
We are going to fix what went wrong

We are trying to learn , what went wrong and how it was fixed
We want to know , what did they look at , did they look at graphs, logs , queries in the database

traditionally ,Lots of people hop on to phone during outage , share information , chat and do the work

Devops - who said what at exactly what time , all the information is captured

Something to keep in mind , we have to rollout , what went wrong

Chatops is great - during incident , everything that was said, and done along with the time was recorded
This is a Q And A session - and we will be able to correlate the stuff
Who was saying what , how things were resolving , what needed to be done , how to make the process better

Utilize a pretty wellknown system to do the RCA
RCA is good
Five whys -
Not getting the full picture of what is going on
Five whys is a terrible way , to go about and look into what took place
------------------------------------------------------------------------
When something goes wrong , we are going to go around and do a RCA - nothing is going to go and undo the issue

RCA - is our own obsession with RCA
We are just putting things back the way they were

Reason 5 whys is not good ,because , it leads to blame
It doesnt help us in making progress or making things better
We make them feel bad , they tend to shut down , and stop talking

If people are afraid to speak up and give relevatnt information
We want to avoid blame at all costs
When we go into these things
example:
Jason is the one , that ran the command , that happened a system wide problem ,
Something went wrong because of a bad command, so we need to improve the system to make it resilient

We have to iterate constantly , we are here to learn from failures and success
What did we do well and keep doing that to avoind failure

In case of an issue
Team tries to detect and resolve an issue
swarm into a problem
Respond as a team to solve the problem

Learning organizations
-----------------------
Learning doesnt come from reading and listening
We are not really learning necessarily
We are not really learning , if we are not implementing

This is what unclocks oppurtunities to truly learn and do the implementation
During a learning exercise , what is the action we do to do this process better

EG: guitar
start plucking the strings and plucking the slides , from here we actually start learning
Knowing - understanding - learning - There will be mistakes along the way

If we place the finger in the wrong place , then there will be issues
If we place the finger in the right place , then it plays well
we learn from mistakes only by implementing

Engineers at Netflix say:
We will trade some up time in exchange for innovation
-----------------------------------------------------
Netflix uses chaos monkey to test their systems stability

Maintaining and protecting
Thats the part that sets up our business

When we start to see this , the monitoring and alerting tooling that we use ,
Simply because , we started implementing monitoring and alerting will help us learn and improve

Right after the experiment , we fail
We learn and improve , and implemt again
hypothesize , experiment , learn and implemt
Be innovative
Make the systems more resilient

The by product of a highly resilient system is a highly available system
Protect our system to be highly available
Learning and innovation , implementing , embracing change

Splunk Training Notes

1) Using the pivot interface

Run basic searches

3) Using fields in searches

4) Creating reports

Information collected from the logs

Apache logs from public web site of customer interactions with store

Linux logs of logins and failed logins

Logs of sales to distributors

Linux logs

1) Roles and responsibilities are to gather data and statistics and report on

- Security

IT operations

Business intelligence Etc

Application Management

operations Management

Security and compliance

and the rest

The first task is to index the Data

Once the data is indexed , then we move on to the next phase

That is search and investigate

We need to investigate , what the problem is and where and what it is

Index the Data

Search and investigate the issue from the data that has been indexed

Add knowledge and stuff , that is required

Monitor and Alert

Report and Analyze

Index the data

Search and Investigate

Based off of the investigation , we will be adding the knowledge

We then monitor and Alert

Then comes , report and Analyzation , which is the final part

==============================================

Knowledge objects

—————————————

The knowledge objects make your data more robust ,providing ways to interpret , classify , enrich and normalize your events

- So , we do create knowledge to add value to your data

- The knowledge objects can be reused and shared

- Create knowledge objects to add value to your data

- We create knowledge objects to add value to your data

- The knowledge objects can be used and reused

Click Settings to access your knowledge objects

KOs enhance your productivity in many ways

Speed

Reuse

Quality

Depth

speed - Reports give you previously created searches , saving typing time and allowing you to execute searches without knowledge of the search language

Reports give you previously created searches , saving typing time

Spunk user are assigned roles

The roles determine the capabilities and data access

1Admin

@Power

# User

Spunk administrators can create additional roles

=========================================

Apps allow different workspaces , tailored to a specific use case or user role , to exist on a single spunk instance

This class focuses on the Search and Reporting app ( also called the Search app )

Administrators can install additional apps to your spunk instance from

http://apps.splunk.com

=========================================

Apps allow different workspaces , tailored to a specific use case or a user role

Apps allow different workspaces , tailored to a specific user case or a user role

Apps allow different workspaces , tailored to a specific user care or a user role to exist on a single spunk instance

apps allow different workspaces , tailored to a specific use case or a user roles

Apps allow different workspaces , tailored to a specific user cases

Apps allow different works paces , tailored to a specific user case or suer role , to exist on a single spunk instance

- This class focuses on the search and reporting app ( Also called the search app )

apps.splunk.com - Additional apps can be installed

Search : To navigate your data using and ordered group of string of terms and values

Event: Searches return events - single piece of data ( i.e record in a log file or other data input )

Field : Searchable name/value pair in event data , Fields give you more precision in searches.

Data Model: An abstract visual layer between the user and the raw data which makes it easier to interact with the data

Event : Search s return events : which is a single piece of data ( record in a log file or other data in input )

Record in a log file or other data in input

Search results return single event , return a data from log file or other data input )

Saturday, January 19, 2019

AWS Networking

Parts of a Network You Should Know About

If you’re running infrastructure and applications on AWS then you will encounter all of these things. They’re not the only parts of a network setup but they are, in my experience, the most important ones.

VPC

A virtual private cloud - VPC - is a private network space in which you can run your infrastructure. It has an address space (CIDR range) which you choose e.g. 10.0.0.0/16. This determines how many IP addresses you can assign within the VPC. Each server you create inside the VPC will need an IP address so this address space defines the limit of how many resources you can have within the network. The 10.0.0.0/16 address space can use the addresses from 10.0.0.0 to 10.0.255.255, which is 65,536 IP addresses.

The VPC is the basis of your network on AWS and all new accounts include a default VPC with a subnets in each availability zone.

+---------------+
|     VPC       |     The Internet
|               |
|               |
|  10.0.0.0/16  |
|               |
|               |
+---------------+

Subnets

A subnet is a section of your VPC, with its own CIDR range and rules on how traffic can flow. Its CIDR range has to be a subset of the VPC’s, for example 10.0.1.0/24 which would allow for IPs from 10.0.1.0 to 10.0.1.255 giving 256 possible IP addresses.

Subnets are often denominated as ‘public’ or ‘private’ depending on whether traffic can reach them from outside the VPC (the Internet). This visibility is controlled by the traffic routing rules and each subnet can have its own rules.

A subnet has to be in a specific availability zone within a region so it’s good practice to have a subnet in each zone. If you plan to have public and private subnets then there should be one of each per availability zone.

+---------------------------+
|            VPC            |
|                           |
+------------+ +------------+
||  Subnet 1 | |  Subnet 2 ||
||10.0.1.0/24| |10.0.2.0/24||
||           | |           ||
|------------+ +------------|
+---------------------------+

Availability Zones

We’ve said that there should be subnets per availability zone, but what does that actually mean?

Each AWS region is divided into 2 or more different zones which, between them, aim to guarantee a very high level of availability for that region. Essentially, at least one zone should be able to operate, even if others suffer outages (:fire:).

+----------+          +----------+
|us-ea)t-1a|          |us-east-1b|
|_____(____|          |__________|
|     )    |          |          |
|   ( &()  |          |    ✔     |
|  ) () &( |          |    8-)   |
+----------+          +----------+

Routing Tables

A routing table contains rules about how IP packets in the subnets can travel to different IP addresses. There is always a default route table which will only allow traffic to travel locally, within the VPC. If a subnet has no routing table associated with it then it uses the default one. These would be ‘private’ subnets.

If you want external traffic to be able to get to a subnet then you need to create a routing table with a rule explicitly allowing this. Subnets associated to that routing table would be ‘public’.

All of the subnets in the default VPCs are associated with a route table which makes them public.

Internet Gateways

The routing table which makes a subnet public needs to reference an Internet gateway to allow the flow of external IP packets into and out of the VPC. You create your Internet gateway and then create a rule which says that packets to 0.0.0.0/0 - all IP addresses - need to go to there.

          Route table
         +-------------------+
         | 10.0.0.0/8: local | Requests within the VPC go over local connections.
      +--+ 0.0.0.0/0: ig-123 | Requests to any other IPs go via the Internet Gateway.
      |  |                   |
      |  +-------------------+
      |
      |
+-----+-------+          +-------------+
|  Subnet 1   |          |  Subnet 2   |
| 10.0.1.0/24 |          | 10.0.2.0/24 |
|             |          |             |
|             | 10.0.2.9 |             |
|             +--------->|             |
|             |          |             |
+-------+-----+          +-------------+
        | 8.8.4.4
        |
        |   +--------+
        +-->| ig-123 |
            |        +-----> The Internet
            +--------+

NAT Gateways

If you have an EC2 instance in a private subnet - one which doesn’t allow traffic from the Internet to reach it - then there’s also no way for IP packets to reach the Internet. We need a mechanism for sending those packets out, and then routing the replies correctly. This is called network address translation and is very likely done in your house by your wifi router.

A NAT gateway is a device which sits in the public subnets, accepts any IP packets bound for the Internet coming from the private subnets, sends those packets on to their destination and then sends the returning packets back to the source.

It’s not necessary to have NAT gateways if you don’t intend instances in your private subnets to talk outside if your VPC but if you do need to do that e.g. using an external API, SaaS database etc. then you can simply set up an EC2 instance (might be cheaper, depending on your traffic), configured appropriately, or use an AWS managed NAT gateway resource (will be easier to manage because you won’t be doing it).

  +---------------------+
  |   Public Subnet     |
  |   10.0.1.0/24       |
  |                     |
  |    +------------+   |
  |    |            +----------->   The Internet
  |    |  nat-123   |   |
  |    |            |   |
  |    +-------^----+   |
  |            |        |
  +------------|--------+
               |
               | 8.8.4.4
               |                       Route table
  +------------+---------+    +---------------------+
  |  Private Subnet      +----+  10.0.0.0/16: local |
  |  10.0.20.0/24        |    |  0.0.0.0/0: nat-123 |
  |                      |    +---------------------+
  +----------------------+

The public subnet contains the NAT gateway
A request is made from the private subnet to an IP address somewhere on the Internet
The route table says that it needs to go to the NAT gateway
The NAT gateway sends it on

Security Groups

VPC network Security groups denote what traffic can flow to (and from) EC2 instances within your VPC. A security groups can specify ingress (inbound) and egress (outbound) traffic rules, limiting them to certain sources (inbound) and destinations (outbound). They are associated with EC2 instances rather than subnets.

By default all traffic is allowed out, but no traffic is allowed in. Inbound rules can specify a source address - either a CIDR block or another security group - and a port range. When the source is another security group then that must be within the same VPC. For example, a VPC is created with a default security group which allows traffic from anything which has that same security group. Assigning the group to everything created in the VPC (not necessarily the most secure practice) means that all those resources can talk to each another.

                   +---------------+
                   | sg-abcde      |
                   | ALLOW TCP 443 |
                   +----+----------+
                         |
                    +----+------+
                    |  i-67890  |
 10.0.1.123:22      |           | 10.0.1.123:443
------------------>X|           <----------------
                    |           |
                    +-----------+

An instance (i-67890) has a security group (sg-abcde) which allows TCP traffic on port 443
A request is made to its IP address (10.0.1.123) on port 22 which doesn’t get through
A request is made to port 443 on the instance and the traffic is allowed

Putting it All Together

The complete picture of your virtual private network looks something like the picture below, with public and private subnets spread across availability zones, network address translation sitting in the public subnets and route tables to specify how packets are routed. EC2 instances are run in any subnet and have security groups attached to them.

                                        +-------+                                  
                                        | ig-1  |                                  
                                        |       |                                  
        vpc-123: 10.0.0.0/16  |         |       |        |                         
       +----------------------+---------+-------+--------+---------------------+
       |                      |                          |                     |   
       |  +-----+             |  +-----+                 |  +-----+            |   
       |  | NAT |             |  | NAT |                 |  | NAT |            |   
public |  |     |             |  |     |                 |  |     |            |   
subnets|  +-----+             |  +-----+                 |  +-----+            |   
       |                      |                          |                     |   
       |                      |                          |                     |   
       |                      |                          |                     |   
       |              +-------+                  +-------+             +-------+
       |              | rt-1a |                  | rt-1b |             | rt-1c |
       | 10.0.1.0/24  |       | 10.0.2.0/24      |       | 10.0.3.0/24 |       |   
-------+-----------------------------------------------------------------------+
       | 10.0.4.0/24  | rt-2a | 10.0.5.0/24      | rt-2b | 10.0.6.0/24 | rt-2c |
       |              |       |                  |       |             |       |   
       |              +-------+                  +-------+             +-------+
private|                      |                          |                     |   
subnets|                      |                          |                     |   
       |                      |                          |                     |   
       |                      |                          |                     |   
       |                      |                          |                     |   
       |                      |                          |                     |   
       |                      |                          |                     |   
       |                      |                          |                     |   
       +----------------------+--------------------------+---------------------+
       |         AZ 1         |          AZ 2            |        AZ 3         |

Tuesday, January 1, 2019

AWS 21- RDS Databases

There are 5 types of Databases

Overview:

1) RDS - Regular Relational Database System
Structured DB.
SQL, Oracle, Mysql, Postgres, AWS Aurora ( mysql flavor), Maria DB

2) DynamoDB - This is a NoSQL Database ( Similar to Cassandra and Mongo )
3) ElastiCache - In memory DB used for caching by websites ( Redis and Memcached )
4 )Neptune - (Graph Database - Very New )
5) Amazon RedShift ( Used for Dataware housing )

Heavily used Databases are RDS, DynamoDB and Amazon RedShift.

RDS is platform as a service ( PAAS )
AWS Takes cares of the Database
They are responsible for maintaining it
Ready made service
PAAS is a furnished house, we do not have control over it

No Chance for customization as needed.

AWS notes

Pages