Monitoring Guide · Performance Engineering

What Do We Monitor During Performance Testing?

The objective isn't just to generate load — it's to continuously monitor the entire system's behavior to identify bottlenecks, validate NFRs, and ensure stability.

Identify BottlenecksDetect FailuresValidate NFRsAnalyze ScalabilityFind Memory LeaksDetect Slow Queries

👤

Client-Side

User perspective — Response Time, Throughput, Error Rate, Latency

🖥️

Server-Side

Infrastructure — CPU, Memory, Disk I/O, Network, Thread Pool

🔬

APM

Code-level — DB Queries, GC, Cache Hit Ratio, Exceptions

👤

1. Client-Side Metrics

User Perspective

Response Time

⏱️

Total time from sending the request to receiving the complete response. Includes network delay, server processing, and data transfer.

Response Time = Network Delay + Server Processing + Data Transfer

Metric	Meaning
Average RT	Mean of all requests
90th Percentile	90% completed within this time
95th Percentile	Industry standard metric
Min / Max RT	Fastest / slowest request

NFR Example: 95% of login requests should complete within 2 seconds.

Throughput

📊

Number of requests or transactions processed per unit time. Measures system capacity — higher throughput = better scalability.

Throughput = Total Requests / Time → e.g. 6000/60 = 100 RPS

Requests/sec (RPS) Transactions/sec (TPS) Hits/sec Bytes/sec

Low throughput indicates: Server overload, DB bottleneck, connection pool issue, or network congestion.

Error Rate

⚠️

Percentage of failed requests compared to total requests. Increasing error rate indicates system instability or resource exhaustion.

Error Rate = (Failed / Total) × 100 → e.g. (200/10000) × 100 = 2%

Error	Meaning
500	Internal Server Error
502	Bad Gateway
503	Service Unavailable
504	Gateway Timeout

Latency

🌐

Time between sending the request and receiving the first byte. Measures initial server responsiveness and network waiting time.

Metric	Measures
Latency	Time until first byte
Response Time	Time until full response

HIGH LATENCY INDICATES

Network Delay LB Issue DNS Issue Slow Backend Init

🖥️

2. Server-Side Metrics

Infrastructure

CPU Utilization

🔥

Percentage of processor capacity being used. High CPU indicates heavy processing, infinite loops, poor optimization, or thread contention.

30% Healthy 60% Moderate 85% Warning 95%+ Critical

Threshold: CPU should remain below 80% during steady state.

Memory Usage

🧠

Tracks RAM, heap memory, non-heap memory, and cache consumption. If memory increases continuously and never drops — it's a memory leak.

GC Pauses App Crashes OutOfMemoryError Slow Response

Disk I/O

💾

Measures read/write operations, disk queue length, and disk latency. DB-heavy apps depend heavily on disk performance.

Slow DB Queries Long API Response Queue Buildup

Network Bandwidth

📡

Amount of data transferred over the network. When utilization approaches capacity → saturation, packet drops, increased latency.

Packet Drops Increased Latency Request Timeout Reduced Throughput

Thread Count / Connection Pool

🧵

Monitors active application threads, DB connection pool usage, and HTTP connections. When pool is maxed → new requests queue up → timeouts.

Component	Monitoring Tool
Linux Server	top, vmstat, iostat
Windows	PerfMon
Database	Oracle AWR, SQL Profiler
JVM	JConsole, VisualVM

🔬

3. Application Performance Monitoring

Code-Level

APM

Popular APM Tools

🛠️

New RelicAppDynamicsDynatraceDatadogSplunkGrafana

Database Query Time

🗄️

Measures execution time of SQL queries. Slow queries cause slow APIs, thread blocking, high CPU, and lock contention.

Example: SELECT * FROM ORDERS WHERE CUSTOMER_ID='1001' → takes 8 seconds under load. This is a slow query that cascades into full app degradation.

Garbage Collection (GC)

♻️

GC frees unused memory in Java apps. Frequent GC causes Stop-The-World pauses — application temporarily freezes.

Example: GC runs every 2 seconds → response time spikes from 1s → 12s. Monitor via GC Logs, VisualVM, JConsole, Grafana JVM dashboards.

Cache Hit Ratio

💨

Measures how effectively cache serves requests. Low ratio = more DB hits = higher load, slower responses.

Cache Hit Ratio = (Cache Hits / Total Requests) × 100 → e.g. 90/100 = 90%

🎯

End-to-End Monitoring Example

Login API response becomes slow — monitoring reveals:

10 secResponse Time

95%CPU Usage

8 secDB Query Time

12%Error Rate

Very HighGC Activity

⚡ Root Cause: Slow SQL query causing CPU spike and GC pressure

∑

Monitoring Summary

Category	Metrics
Client-Side	Response Time, Throughput, Error Rate, Latency
Server-Side	CPU, Memory, Disk I/O, Network, Thread Pool
APM	DB Queries, GC, Cache Hit Ratio, Exceptions