Monitoring Guide ยท Performance Engineering

What Do We Monitor During Performance Testing?

The objective isn't just to generate load โ€” it's to continuously monitor the entire system's behavior to identify bottlenecks, validate NFRs, and ensure stability.

Identify BottlenecksDetect FailuresValidate NFRsAnalyze ScalabilityFind Memory LeaksDetect Slow Queries
๐Ÿ‘ค
Client-Side
User perspective โ€” Response Time, Throughput, Error Rate, Latency
๐Ÿ–ฅ๏ธ
Server-Side
Infrastructure โ€” CPU, Memory, Disk I/O, Network, Thread Pool
๐Ÿ”ฌ
APM
Code-level โ€” DB Queries, GC, Cache Hit Ratio, Exceptions
๐Ÿ‘ค

1. Client-Side Metrics

User Perspective
A

Response Time

โฑ๏ธ
Total time from sending the request to receiving the complete response. Includes network delay, server processing, and data transfer.
Response Time = Network Delay + Server Processing + Data Transfer
MetricMeaning
Average RTMean of all requests
90th Percentile90% completed within this time
95th PercentileIndustry standard metric
Min / Max RTFastest / slowest request
NFR Example: 95% of login requests should complete within 2 seconds.
B

Throughput

๐Ÿ“Š
Number of requests or transactions processed per unit time. Measures system capacity โ€” higher throughput = better scalability.
Throughput = Total Requests / Time โ†’ e.g. 6000/60 = 100 RPS
Requests/sec (RPS) Transactions/sec (TPS) Hits/sec Bytes/sec
Low throughput indicates: Server overload, DB bottleneck, connection pool issue, or network congestion.
C

Error Rate

โš ๏ธ
Percentage of failed requests compared to total requests. Increasing error rate indicates system instability or resource exhaustion.
Error Rate = (Failed / Total) ร— 100 โ†’ e.g. (200/10000) ร— 100 = 2%
ErrorMeaning
500Internal Server Error
502Bad Gateway
503Service Unavailable
504Gateway Timeout
D

Latency

๐ŸŒ
Time between sending the request and receiving the first byte. Measures initial server responsiveness and network waiting time.
MetricMeasures
LatencyTime until first byte
Response TimeTime until full response
HIGH LATENCY INDICATES
Network Delay LB Issue DNS Issue Slow Backend Init
๐Ÿ–ฅ๏ธ

2. Server-Side Metrics

Infrastructure
A

CPU Utilization

๐Ÿ”ฅ
Percentage of processor capacity being used. High CPU indicates heavy processing, infinite loops, poor optimization, or thread contention.
30% Healthy 60% Moderate 85% Warning 95%+ Critical
Threshold: CPU should remain below 80% during steady state.
B

Memory Usage

๐Ÿง 
Tracks RAM, heap memory, non-heap memory, and cache consumption. If memory increases continuously and never drops โ€” it's a memory leak.
GC Pauses App Crashes OutOfMemoryError Slow Response
C

Disk I/O

๐Ÿ’พ
Measures read/write operations, disk queue length, and disk latency. DB-heavy apps depend heavily on disk performance.
Slow DB Queries Long API Response Queue Buildup
D

Network Bandwidth

๐Ÿ“ก
Amount of data transferred over the network. When utilization approaches capacity โ†’ saturation, packet drops, increased latency.
Packet Drops Increased Latency Request Timeout Reduced Throughput
E

Thread Count / Connection Pool

๐Ÿงต
Monitors active application threads, DB connection pool usage, and HTTP connections. When pool is maxed โ†’ new requests queue up โ†’ timeouts.
ComponentMonitoring Tool
Linux Servertop, vmstat, iostat
WindowsPerfMon
DatabaseOracle AWR, SQL Profiler
JVMJConsole, VisualVM
๐Ÿ”ฌ

3. Application Performance Monitoring

Code-Level
APM

Popular APM Tools

๐Ÿ› ๏ธ
New RelicAppDynamicsDynatraceDatadogSplunkGrafana
A

Database Query Time

๐Ÿ—„๏ธ
Measures execution time of SQL queries. Slow queries cause slow APIs, thread blocking, high CPU, and lock contention.
Example: SELECT * FROM ORDERS WHERE CUSTOMER_ID='1001' โ†’ takes 8 seconds under load. This is a slow query that cascades into full app degradation.
B

Garbage Collection (GC)

โ™ป๏ธ
GC frees unused memory in Java apps. Frequent GC causes Stop-The-World pauses โ€” application temporarily freezes.
Example: GC runs every 2 seconds โ†’ response time spikes from 1s โ†’ 12s. Monitor via GC Logs, VisualVM, JConsole, Grafana JVM dashboards.
C

Cache Hit Ratio

๐Ÿ’จ
Measures how effectively cache serves requests. Low ratio = more DB hits = higher load, slower responses.
Cache Hit Ratio = (Cache Hits / Total Requests) ร— 100 โ†’ e.g. 90/100 = 90%
๐ŸŽฏ

End-to-End Monitoring Example

Login API response becomes slow โ€” monitoring reveals:
10 secResponse Time
95%CPU Usage
8 secDB Query Time
12%Error Rate
Very HighGC Activity
โšก Root Cause: Slow SQL query causing CPU spike and GC pressure
โˆ‘

Monitoring Summary

CategoryMetrics
Client-SideResponse Time, Throughput, Error Rate, Latency
Server-SideCPU, Memory, Disk I/O, Network, Thread Pool
APMDB Queries, GC, Cache Hit Ratio, Exceptions