Incident

Website Down: Guided Troubleshooting Story

This is the order I used when a site went down and people were already asking for answers. I started from the outside, not the logs.

Recommended Triage Order

I never start with the application logs. If the domain, certificate, or routing is broken, the logs can send you in the wrong direction and waste time.

Data Panel
Reference table for quick scanning and comparison.
Reference Live Status
OrderLayerWhy it comes here
1DNS / public IPProve the domain resolves to the correct destination.
2TLS / certificateProve the site can complete HTTPS without a certificate or hostname problem.
3Load balancer / reverse proxyProve traffic can reach the front door and is routed to a healthy backend.
4Web server / app hostCheck the service process, port binding, and app startup logs.
5App dependenciesCheck database, cache, queue, filesystem, and external API reachability.
6Application logsNow the logs are useful because the upstream layers are already proven.
7Recovery validationProve the fix works from browser, API, and monitoring.

Incident Path

This is the first pass I made in the opening minutes of an outage. Each step either confirmed a layer or told me where to look next.

01

DNS

Confirm the domain resolves to the expected public IP.

02

TLS

Confirm certificate chain, SNI, and expiry.

03

Routing

Confirm load balancer, reverse proxy, and firewall path.

04

App

Confirm the service is up and the host is serving requests.

05

Validate

Prove the fix works from browser and monitoring.

01

Resolve the Domain

Check that DNS resolves to the correct public IP and that the record did not drift.

02

Validate TLS

Confirm the certificate, SNI host name, chain, and expiry before moving deeper.

03

Check Routing

Verify load balancer, reverse proxy, firewall, and port bindings before the app layer.

04

Check App and Dependencies

Inspect service status, logs, database, cache, queue, and recent deployment changes.

Windows Commands
DNS, TLS, and HTTP checks from a Windows machine.

DNS and routing

nslookup anupkumarctech.com
Resolve-DnsName anupkumarctech.com
ping anupkumarctech.com
tracert anupkumarctech.com

Use this to confirm the public hostname points where you expect before checking the server.

TLS and HTTP

curl -I https://anupkumarctech.com/articles/website-down-runbook/
curl -vk https://anupkumarctech.com
openssl s_client -connect anupkumarctech.com:443 -servername anupkumarctech.com

Use this to verify certificate chain, SNI, redirect behavior, and HTTPS handshake.

Linux Commands
Web server, ports, logs, and local app health checks.

Service and ports

systemctl status nginx
systemctl status apache2
ss -lntp | grep ':80\|:443'
sudo journalctl -u nginx -n 100 --no-pager

Use this to check if the web server is running and listening on the expected ports.

Application and dependencies

ps aux | grep -i dotnet
tail -n 100 /var/log/nginx/error.log
tail -n 100 /var/log/app/app.log
curl -I http://127.0.0.1:5000/health
psql -h db-host -U appuser -d appdb -c "select 1;"

Use this to confirm the app process, local health endpoint, logs, and database connectivity.

Database Connectivity
Direct connection tests for SQL Server, PostgreSQL, and MySQL.

SQL Server

Test-NetConnection db-host -Port 1433
sqlcmd -S db-host -U appuser -P "password" -Q "SELECT 1"

Use this when the site depends on Microsoft SQL Server.

PostgreSQL / MySQL

Test-NetConnection db-host -Port 5432
psql -h db-host -U appuser -d appdb -c "select 1;"

Test-NetConnection db-host -Port 3306
mysql -h db-host -u appuser -p -e "select 1;"

Use this when the site depends on PostgreSQL or MySQL.

Windows IIS and App Host
Bindings, sites, app pools, and event logs on Windows.

IIS bindings and sites

Get-Service W3SVC
Get-Website
Get-WebBinding
Get-ChildItem IIS:\AppPools

Use this to confirm the site exists, the bindings are correct, and the app pool is in a healthy state.

Service and event logs

Get-WinEvent -LogName Application -MaxEvents 20
Get-WinEvent -LogName System -MaxEvents 20
curl.exe -vk http://localhost

Use this to check for startup failures, pool crashes, and local response behavior.

Failure Map
Quick symptom-to-layer lookup for faster triage.
Data Panel
Reference table for quick scanning and comparison.
Reference Live Status
SymptomLikely layerWhat to do
Domain points to wrong IPDNSFix the A record or stale cache.
HTTPS fails but HTTP worksTLSCheck certificate, SNI, chain, and expiry.
404 or wrong site respondsReverse proxy / routingCheck nginx/IIS bindings and site mapping.
502/503App processCheck service is running and listening.
App starts but one page failsDependencyCheck database, cache, queue, or API call.
Only database-backed pages failDatabaseTest the DB connection directly before changing code.
Inference Guide
What each symptom usually means before you start changing things.

If this is happening...

  • Domain resolves wrong: DNS change, stale cache, or wrong environment record.
  • HTTP works but HTTPS fails: certificate, SNI, or binding mismatch.
  • Home page works but one app page fails: dependency issue in that code path.
  • Only after deployment: release regression, config drift, or missing secret.
  • Only some users fail: CDN edge, geo, browser cache, or auth session issue.

What to assume next

  • Do not assume the app is broken until DNS, TLS, and routing are proven.
  • Do not assume the database is broken until the local app and network path are healthy.
  • Do not assume the browser is the problem until server-side checks are complete.
  • Do not restart services blindly; capture the exact failure first.
  • Do not close the incident until one clean request path is verified end to end.
Delivery Experience
A real outage pattern, the checks I ran, and the fix order I followed.
01

Symptom

After deployment, the site returns 502 errors but DNS still resolves.

02

External

Confirm HTTPS, redirects, and the public response path first.

03

Host

Verify reverse proxy, bindings, ports, and the app process.

04

Dependency

Test the health endpoint and database connectivity if only some pages fail.

05

Validate

Re-test from browser and monitoring before closing the incident.

What Happened

The site starts returning 502 errors after a deployment. Users can resolve DNS and reach the domain, but the home page does not load.

The working assumption here is that the failure sits between the edge and the app host, not that the entire app is down.

  • Step 1: Confirm DNS and TLS still work.
  • Step 2: Check the reverse proxy and port bindings.
  • Step 3: Confirm the app service is running.
  • Step 4: Check the app log for startup or dependency failure.
  • Step 5: Validate database connectivity if only some pages fail.

What I’d run first

curl -vk https://anupkumarctech.com
Get-WebBinding
Get-Service W3SVC
systemctl status nginx
ss -lntp | grep ':80\|:443'
tail -n 100 /var/log/nginx/error.log
curl -I http://127.0.0.1:5000/health
Test-NetConnection db-host -Port 5432

That order keeps the diagnosis focused. Each command proves one layer before moving to the next.