Learning from Operational mistakes

February 2nd, 2019 No comments

In the world of Ops, it’s always good to learn from mistakes. It’s not good enough that we solved a problem (*fix*), but we must also do a post-mortem to understand what went wrong (*root cause*), what can we do to prevent it in the future (*long term solution*).

I am of the opinion that long term solutions are preferred to short term fixes (hacks!). But long term solutions are not easy, they almost always require understanding the root cause, and that is not always obvious.

After any incidents, crisis, problems, whatever you want to call it, make sure you have a *blame-free* post-mortem. This is very important. We are not looking to blame anyone, we should be focusing on the root cause and how can it be prevented from happening again. Going into a post-mortem with the right mind-set also help make the process go much smoother. You will get better cooperation from involved parties. It’s a team effort, to improve everyone’s job.

The process should be something like this.

  • Assign an owner of the post-mortem process. Usually the lead engineer involved in the incident. That person is empowered to call for help from anyone needed.
  • Assign a specific time-frame for when the post-mortem must conclude by. You do not want to let it drag on. Let’s get it done and move on. Recommend no more than two weeks from date of incident.
  • Communicated what is expected of the post-mortem output.
    • When — Timeline of incidents
    • What — specific details of alerts, failures, etc.
    • Communications during the incidents — within team, with other teams, internal and external (customers, press).
    • Root Cause Analysis
    • Prevention
      • Training – better training
      • Monitoring – better monitoring (add monitor, alerts)
      • Failure detection – missed failures
      • SPOF – Single Point of Failure. Add redundancies. Re-architecture.
      • etc.

It’s good if we can learn from past mistakes. It is even better if we can learn from others’ mistakes!

Here is the start of a list of Operational mistakes published on the web. I will be adding more as I find them. Feel free to submit any that I missed. Thanks!

Kubernetes

 

Categories: Uncategorized Tags:

Differences between API Gateway and Service Mesh

May 24th, 2018 No comments

https://gluesolution.xyz/devops/2018/05/22/What-Is-Difference-Between-An-API-Gateway-And-A-Service-Mesh.html

Intuitively, I knew they are different, but could not explain it as clearly as the above post.

 

Monitoring sendgrid with Elasticsearch

April 20th, 2018 No comments

If you are using sendgrid as a service for your outbound email, you would want to monitor and be able to answer questions such as:

  • how much email are you sending
  • status of sent email – success, bounced, delayed, etc.
  • trends
  • etc.

We get questions all the time from $WORK customer support folks on whether an email sent to a customer got there (customer claimed they never got it).   There could be any number of reasons why customer do not see email sent from us.

  • our email is filtered into customer spam folder
  • email is reject/bounced by customer mail service
  • any number of network/server/services related errors between us and customer mail service
  • the email address customer provided is invalid (and email bounced)

If we have access to event logs from sendgrid, we would be able to quickly answer these types of questions.

Luckily sendgrid offers Event Webhook.

Verbatim quote from above link.

SendGrid’s Event Webhook will notify a URL of your choice via HTTP POST with information about events that occur as SendGrid processes your email. Common uses of this data are to remove unsubscribes, react to spam reports, determine unengaged recipients, identify bounced email addresses, or create advanced analytics of your email program. With Unique Arguments and Category parameters, you can insert dynamic data that will help build a sharp, clear image of your mailings.

Login to your sendgrid account and click on Mail Settings.

Then click on Event Notification

 

In HTTP Post URL, enter the DNS name of the service endpoint you are going to setup next.

For example, mine is (not a valid endpoint, but close enough): https://sendlog.mydomain.com/logger

Since I do not believe in re-inventing the wheel, Adly Abdullah has already written a simple sendgrid event listener (Note: this is my forked version, which works with ES 6.x).   This is a nodejs service.  You can install it via npm.

$ sudo npm install -g sendgrid-event-logger pm2

You want to install pm2 (nodejs Process Manager version 2).  Very nice nodejs process manager.

Next is to edit and configure sendgrid-event-logger (SEL for short).   If the default config works for you, then no need to do anything.  Check and make sure it is pointing to where your ES host is located (mine is running on the same instance, hence localhost).   I also left SEL listening on port 8080 as that is available on this instance.


$ cat /etc/sendgrid-event-logger.json
{
    "elasticsearch_host": "localhost:9200",
    "port": 8080,
    "use_basicauth": true,
    "basicauth": {
    "user": "sendgridlogger",
    "password": "KLJSDG(#@%@!gBigSecret"
},
"use_https": false,
    "https": {
        "key_file": "",
        "cert_file": ""
    },
    "days_to_retain_log": 365
}

NOTE: I have use_https set to false because my nginx front-end is already using https.

Since SEL is listening on port 8080, you can run it as yourself.

$ pm2 start sendgrid-event-logger -i 0 --name "sendgrid-event-logger"

Verify that SEL is running.

$ pm2 show 0

Describing process with id 0 - name sendgrid-event-logger
┌───────────────────┬──────────────────────────────────────────────────────┐
│ status            │ online                                               │
│ name              │ sendgrid-event-logger                                │
│ restarts          │ 0                                                    │
│ uptime            │ 11m                                                  │
│ script path       │ /usr/bin/sendgrid-event-logger                       │
│ script args       │ N/A                                                  │
│ error log path    │ $HOME/.pm2/logs/sendgrid-event-logger-error-0.log    │
│ out log path      │ $HOME/.pm2/logs/sendgrid-event-logger-out-0.log      │
│ pid path          │ $HOME/.pm2/pids/sendgrid-event-logger-0.pid          │
│ interpreter       │ node                                                 │
│ interpreter args  │ N/A                                                  │
│ script id         │ 0                                                    │
│ exec cwd          │ $HOME                                                │
│ exec mode         │ fork_mode                                            │
│ node.js version   │ 8.11.1                                               │
│ watch & reload .  │ ✘                                                    │
│ unstable restarts │ 0                                                    │
│ created at        │ 2018-02-14T23:36:06.705Z                             │
└───────────────────┴──────────────────────────────────────────────────────┘
Code metrics value
┌─────────────────┬────────┐
│ Loop delay .    │ 0.68ms │
│ Active requests │ 0      │
│ Active handles  │ 4      │
└─────────────────┴────────┘

I use nginx and here is my nginx config for SEL.

/etc/nginx/sites-available $ cat sendgrid-logger
upstream sendgrid_logger {
  server 127.0.0.1:8080;
}

server {
  server_name slog.mysite.org slog;
  listen 443 ssl ;

  include snippets/ssl.conf;
  access_log /var/log/nginx/slog/access.log;
  error_log /var/log/nginx/slog/error.log;
  proxy_connect_timeout 5m;
  proxy_send_timeout 5m;
  proxy_read_timeout 5m;

  location / {
    proxy_pass http://sendgrid_logger;
  }
}
$ sudo ln -s /etc/nginx/sites-available/sendgrid-logger /etc/nginx/sites-enabled/
$ sudo systemctl reload nginx

Make sure Sendgrid Event webhook is turned on and you should be seeing events coming in.   Check your Elasticsearch cluster for new indices.

$ curl -s localhost:9200/_cat/indices|grep mail
green open mail-2018.03.31 -g6Tw9b9RfqZnBVYLdrF-g 1 0 2967 0 1.4mb 1.4mb
green open mail-2018.03.28 GxTRx2PgR4yT5kiH0RKXrg 1 0 8673 0 4.2mb 4.2mb
green open mail-2018.04.06 2PO9YV1eS7eevZ1dfFrMGw 1 0 10216 0 4.9mb 4.9mb
green open mail-2018.04.11 _ZINqVPTSwW7b8wSgkTtTA 1 0 8774 0 4.3mb 4.3mb

etc.

Go to Kibana, setup index pattern.  In my case, it’s mail-*.  Go to Discover, select mail-* index pattern and play around.

Here is my simple report.  I see around 9am, something happened to cause a huge spike in mail events.

 

Next step is for you to create dashboards to fit your needs.

Enjoy!

 

USB Ethernet Adapters for TiVo

March 27th, 2018 No comments

USB Ethernet Adapters for TiVo

plaza-prize

Here is a collected list of USB adapters I got from http://www.tivoco
mmunity.com/tivo-vb/showthread.php?s=&threadid=54620&pagenumber=3

I bought a cheap one (Farallon USB1.1 to ethernet) for $13 from Computer
Geek, and it worked great. Just plug-n-play 🙂

09/11/2005Got word from Antonio Carlos that a Linksys USB200M
works great.

06/17/2004 I’ve received feedback from Rob Clark
that a D-Link DSB-H3ETX (USB to enet adapter) also work.
He bought his locally for $15 and the link he sent is http://support.dlink.com/products/v
iew.asp?productid=DSB%2DH3ETX
.

Basically any USB-enet adapters that uses the Pegasus chipset should
work with Tivo as Linux has driver support for that chip.

3COM
3Com USB Ethernet 3C460B

ABOCOM
USB 10/100 Fast Ethernet
USB HPNA/Ethernet

ACCTON
Accton USB 10/100 Ethernet Adapter
SpeedStream USB 10/100 Ethernet

ADMTEK
ADMtek ADM8511 Pegasus II USB Ethernet
ADMtek AN986 Pegasus USB Ethernet (eval. board)

ALLIEDTEL
Allied Telesyn Int. AT-USB100

BELKIN
Belkin F5D5050 USB Ethernet

BILLIONTON
Billionton USB-100
Billionton USBE-100
Billionton USBEL-100
Billionton USBLP-100

COMPAQ
iPAQ Networking 10/100 USB

COREGA
Corega FEter USB-TX

DLINK
D-Link DSB-650
D-Link DSB-650TX
D-Link DSB-650TX(PNA)

ELSA
Elsa Micolink USB2Ethernet

HAWKING
Hawking UF100 10/100 Ethernet

IODATA
IO DATA USB ET/TX
IO DATA USB ET/TX-S

KINGSTON
Kingston KNU101TX Ethernet

LANEED
LANEED USB Ethernet LD-USB/T
LANEED USB Ethernet LD-USB/TX

LINKSYS
Linksys USB100TX
Linksys USB10TX
Linksys USB Ethernet Adapter
Linksys USB USB10TX
Linksys USB100M
Linksys USB200M

MELCO
MELCO/BUFFALO LUA2-TX
MELCO/BUFFALO LUA-TX

SIEMENS
SpeedStream USB 10/100 Ethernet

SMARTBRIDGES
smartNIC 2 PnP Adapter

SMC
SMC 202 USB Ethernet

SOHOWARE
SOHOware NUB100 Ethernet

Tin “Tin Man” Le /
tin@le.org

Tin’s Home…


Last Updated: $Date: 2003/08/19 04:32:49 $

Categories: Uncategorized Tags:

Blockchain RSS

January 28th, 2018 No comments

I created an RSS for cryptocurrency prices. It list the 10 that I am interested in, but I can certainly add more. Let me know if you want to add one to the list.

https://blog.tinle.org/blockchain/

Site instabilities due to Meltdown and Spectre (indirectly)

January 9th, 2018 No comments

You may have notice that this blog is mostly unavailable or showing 5xx lately. It’s because I am on AWS and the recent Intel vulns has all the cloud vendors patching and rebooting their hypervisors. It’s causing various issues with my infrastructure.

I don’t blame the vendors, they are doing what they are supposed to be doing :-). I am waiting for my turn…. when the clouds are done with their patching, then I have to patch my instances and reboot them too. Ugh, joy….

Categories: Cloud, EC2 Tags: , , ,

Optimizing webservers

September 7th, 2017 No comments

This is an awesome article from Alexy Ivanov on tuning your web servers.

https://blogs.dropbox.com/tech/2017/09/optimizing-web-servers-for-high-throughput-and-low-latency/

Categories: Tech Tags: , ,

Bye bye Sun and Solaris :-(

September 7th, 2017 No comments

So sad… but it’s inevitable, Oracle killing Solaris and Sun.

Oracle Finally Killed Sun

Categories: Java, Tech Tags: , , ,

Fair use of web content

August 11th, 2017 1 comment

This news was buried among many other news, but I felt that it deserves more people knowing about it.

It is about “fair use” of publicly available web content. What is “fair use” and when can content be restricted.

The original article is here.

A small company called hiQ is locked in a high-stakes battle over Web scraping with LinkedIn. It’s a fight that could determine whether an anti-hacking law can be used to curtail the use of scraping tools across the Web.

HiQ scrapes data about thousands of employees from public LinkedIn profiles, then packages the data for sale to employers worried about their employees quitting. LinkedIn, which was acquired by Microsoft last year, sent hiQ a cease-and-desist letter warning that this scraping violated the Computer Fraud and Abuse Act, the controversial 1986 law that makes computer hacking a crime. HiQ sued, asking courts to rule that its activities did not, in fact, violate the CFAA.

James Grimmelmann, a professor at Cornell Law School, told Ars that the stakes here go well beyond the fate of one little-known company.

I will leave it up to you to read and make up your own opinion about it.

Warranty service for Enphase converter

May 26th, 2017 No comments

Anyone having issue getting warranty service for their solar panel converter? Enphase claimed 15 years warranty. My system was installed in mid 2010, and one of the converter had failed. It’s the only one reporting low/no voltage for the past 4 weeks. The rest of my 20+ panels and converters are fine.

It look like Enphase is not geared to support home end users. They kept re-directing me to my “installer”. Unfortunately, my installer had gone out of business a few years ago. Yes, live and learn for me. Next time I’ll use a more reputable company.

In any case, Enphase is giving me the run around. Sounds like time to complain to Consumer Protection Agency and local state agency.