Skip to main content

Monitoring and Troubleshooting

This guide provides comprehensive monitoring capabilities and troubleshooting procedures for FLOPY-NET experiments, helping you identify issues, optimize performance, and maintain system health.

Monitoring Overview

FLOPY-NET provides multi-layered monitoring through:

  • Real-time Dashboard: Web-based visualization and alerts
  • Metrics Collection: Time-series data aggregation and storage
  • Log Aggregation: Centralized logging from all components
  • Health Checks: Automated service health monitoring
  • Alert System: Proactive issue detection and notification

Dashboard Monitoring

1. Main Dashboard

Access the main monitoring dashboard at http://localhost:3001:

System Overview Panel:

  • Service health status (green/yellow/red indicators)
  • Resource utilization (CPU, memory, network)
  • Active experiments count
  • System uptime and version info

Experiment Monitoring Panel:

  • Current experiment status and progress
  • Real-time accuracy and loss graphs
  • Client participation and status
  • Network performance metrics

Network Topology Panel:

  • Live network topology visualization
  • Switch and link status indicators
  • Traffic flow visualization
  • QoS policy enforcement status

2. Detailed Monitoring Views

FL Training Metrics:

// Real-time FL metrics
const flMetrics = {
current_round: 12,
global_accuracy: 0.89,
global_loss: 0.34,
convergence_rate: 0.02,
active_clients: 5,
client_metrics: [
{
id: "client_001",
local_accuracy: 0.87,
local_loss: 0.38,
training_time: "45s",
communication_time: "12s"
}
]
};

Network Performance Metrics:

// Network monitoring data
const networkMetrics = {
bandwidth_utilization: 45.2,
average_latency: "15ms",
packet_loss_rate: 0.001,
qos_enforcements: 156,
flow_modifications: 12,
topology_changes: 0
};

3. Custom Dashboards

Create custom monitoring dashboards:

curl -X POST http://localhost:3001/api/v1/dashboards \
-H "Content-Type: application/json" \
-d '{
"name": "FL Performance Dashboard",
"layout": {
"panels": [
{
"type": "line_chart",
"title": "Global Model Accuracy",
"metrics": ["global_accuracy"],
"time_range": "1h"
},
{
"type": "gauge",
"title": "Network Utilization",
"metrics": ["bandwidth_utilization"],
"thresholds": [70, 90]
}
]
}
}'

Metrics Collection and Analysis

1. Time-Series Metrics

Query historical metrics using the Collector API:

# Get FL training progress over time
curl "http://localhost:8081/api/v1/metrics/query?metric=global_accuracy&from=2024-01-15T10:00:00Z&to=2024-01-15T11:00:00Z&granularity=1m"

# Get network utilization statistics
curl "http://localhost:8081/api/v1/metrics/summary?metric=bandwidth_utilization&from=2024-01-15T10:00:00Z&to=2024-01-15T11:00:00Z"

2. Performance Analytics

Convergence Analysis:

import requests
import matplotlib.pyplot as plt

# Fetch convergence data
response = requests.get(
"http://localhost:8081/api/v1/metrics/query",
params={
"metric": "global_accuracy",
"experiment_id": "exp_001",
"granularity": "1m"
}
)

data = response.json()["data_points"]
rounds = [point["round"] for point in data]
accuracy = [point["value"] for point in data]

# Plot convergence curve
plt.plot(rounds, accuracy)
plt.xlabel("Training Round")
plt.ylabel("Global Accuracy")
plt.title("FL Model Convergence")
plt.show()

Communication Overhead Analysis:

# Analyze communication patterns
response = requests.get(
"http://localhost:8081/api/v1/metrics/query",
params={
"metric": "communication_overhead",
"experiment_id": "exp_001"
}
)

# Calculate efficiency metrics
total_data = sum(point["value"] for point in data)
training_time = data[-1]["timestamp"] - data[0]["timestamp"]
efficiency = final_accuracy / (total_data / 1024**2) # Accuracy per MB

print(f"Communication Efficiency: {efficiency:.4f} accuracy/MB")

3. Anomaly Detection

Set up automated anomaly detection:

curl -X POST http://localhost:8081/api/v1/metrics/anomalies \
-H "Content-Type: application/json" \
-d '{
"metric": "global_accuracy",
"algorithm": "isolation_forest",
"sensitivity": 0.1,
"alert_threshold": 0.8
}'

Log Management

1. Centralized Logging

All FLOPY-NET components use structured logging:

# View all service logs
docker-compose logs -f

# View specific service logs
docker logs flopy-net-fl-server -f
docker logs flopy-net-policy-engine -f
docker logs flopy-net-sdn-controller -f

# Filter logs by level
docker logs flopy-net-dashboard 2>&1 | grep ERROR

2. Log Analysis

Parse structured logs:

import json
import subprocess

def analyze_fl_logs():
# Get FL server logs
result = subprocess.run(
["docker", "logs", "flopy-net-fl-server"],
capture_output=True, text=True
)

errors = []
warnings = []

for line in result.stdout.split('\n'):
if line.strip():
try:
log_entry = json.loads(line)
if log_entry.get("level") == "ERROR":
errors.append(log_entry)
elif log_entry.get("level") == "WARNING":
warnings.append(log_entry)
except json.JSONDecodeError:
continue

return {"errors": errors, "warnings": warnings}

analysis = analyze_fl_logs()
print(f"Found {len(analysis['errors'])} errors and {len(analysis['warnings'])} warnings")

3. Log Aggregation with ELK Stack

Set up centralized logging (optional):

# docker-compose.elk.yml
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"

logstash:
image: docker.elastic.co/logstash/logstash:8.5.0
volumes:
- ./logstash/config:/usr/share/logstash/pipeline
ports:
- "5044:5044"

kibana:
image: docker.elastic.co/kibana/kibana:8.5.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"

Health Monitoring

1. Service Health Checks

Monitor individual service health:

# Check all services - Note: Replace with actual port numbers
services=("dashboard:3001" "collector:8081" "policy-engine:5000" "fl-server:8080" "sdn-controller:8181")
for service_port in "\${services[@]}"; do
service_name=\$(echo \$service_port | cut -d: -f1)
port=\$(echo \$service_port | cut -d: -f2)
echo "Checking \$service_name on port \$port..."
curl -s http://localhost:\$port/api/v1/health | jq '.status'
done

# Automated health check script
#!/bin/bash
services=(
"dashboard:3001"
"collector:8081"
"policy-engine:5000"
"fl-server:8080"
"sdn-controller:8181"
)

for service_port in "\${services[@]}"; do
service=\${service_port%:*}
port=\${service_port#*:}

if curl -s -f http://localhost:\$port/api/v1/health > /dev/null; then
echo "✓ \$service is healthy"
else
echo "✗ \$service is unhealthy"
fi
done

2. Dependency Health

Check external dependencies:

# Check database connections
curl http://localhost:8081/api/v1/health/dependencies

# Check network connectivity
curl http://localhost:8181/api/v1/topology

# Check GNS3 integration
curl http://localhost:3080/v2/projects

3. Resource Monitoring

Monitor system resources:

# Docker container resources
docker stats --no-stream

# System resources
top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1
free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }'
df -h | awk '$NF=="/"{printf "Disk Usage: %d/%dGB (%s)\n", $3,$2,$5}'

# Network interface statistics
cat /proc/net/dev | grep eth0 | awk '{print "RX bytes: " $2 ", TX bytes: " $10}'

Alert System

1. Configure Alerts

Set up proactive alerting:

curl -X POST http://localhost:8081/api/v1/alerts \
-H "Content-Type: application/json" \
-d '{
"name": "High Network Utilization",
"condition": {
"metric": "bandwidth_utilization",
"operator": "greater_than",
"threshold": 80,
"duration": "5m"
},
"actions": [
{
"type": "webhook",
"url": "https://alerts.example.com/webhook",
"payload": {
"message": "Network utilization exceeded 80%",
"severity": "warning"
}
},
{
"type": "email",
"recipients": ["admin@example.com"],
"subject": "FLOPY-NET Alert: High Network Utilization"
}
]
}'

2. Alert Types

Performance Alerts:

  • FL convergence stalled
  • High communication overhead
  • Poor client participation
  • Network congestion

System Alerts:

  • Service unavailable
  • High resource usage
  • Database connection failure
  • Policy enforcement failure

Security Alerts:

  • Unauthorized access attempts
  • Anomalous client behavior
  • Network intrusion detection
  • Data integrity violations

3. Alert Management

# List active alerts
curl http://localhost:8081/api/v1/alerts?status=active

# Acknowledge alert
curl -X POST http://localhost:8081/api/v1/alerts/alert_001/acknowledge \
-H "Content-Type: application/json" \
-d '{"comment": "Investigating high network usage"}'

# Resolve alert
curl -X POST http://localhost:8081/api/v1/alerts/alert_001/resolve \
-H "Content-Type: application/json" \
-d '{"resolution": "Network optimized, utilization reduced"}'

Troubleshooting Guide

1. Common Issues

Issue: Experiment Won't Start

Symptoms:

  • Experiment status stuck in "initializing"
  • No client connections
  • FL server not responding

Diagnosis:

# Check FL server status
curl http://localhost:8080/api/v1/health

# Check client connectivity
docker exec flopy-net-client-001 ping fl-server

# Review FL server logs
docker logs flopy-net-fl-server --tail=50

Solutions:

# Restart FL server
docker restart flopy-net-fl-server

# Check network configuration
docker network inspect flopy-net-network

# Verify port bindings
docker port flopy-net-fl-server

Issue: Poor FL Performance

Symptoms:

  • Slow convergence
  • High communication overhead
  • Frequent client disconnections

Diagnosis:

# Check network metrics
curl http://localhost:8181/api/v1/statistics

# Analyze client performance
curl http://localhost:3001/api/v1/experiments/exp_001/clients

# Review policy enforcement
curl http://localhost:5000/api/v1/events?severity=warning

Solutions:

# Optimize network policies
curl -X PUT http://localhost:5000/api/v1/policies/pol_001 \
-d '{"actions": [{"type": "allocate_bandwidth", "parameters": {"min_bandwidth": "20Mbps"}}]}'

# Adjust FL parameters
curl -X PUT http://localhost:3001/api/v1/experiments/exp_001 \
-d '{"configuration": {"local_epochs": 3, "batch_size": 64}}'

Issue: Network Policies Not Enforcing

Symptoms:

  • QoS not applied
  • Traffic not prioritized
  • Policy events missing

Diagnosis:

# Check policy engine status
curl http://localhost:5000/api/v1/health

# Verify SDN controller connection
curl http://localhost:8181/api/v1/health

# Review policy evaluation logs
curl http://localhost:5000/api/v1/events?type=evaluation

Solutions:

# Restart policy engine
docker restart flopy-net-policy-engine

# Re-sync with SDN controller
curl -X POST http://localhost:5000/api/v1/sync

# Validate policy configuration
curl http://localhost:5000/api/v1/policies/pol_001/validate

2. Performance Optimization

FL Training Optimization:

# Reduce communication frequency
curl -X PUT http://localhost:8080/api/v1/config \
-d '{"aggregation_interval": 120}' # 2 minutes

# Enable model compression
curl -X PUT http://localhost:8080/api/v1/config \
-d '{"compression": {"enabled": true, "algorithm": "gzip"}}'

# Optimize client selection
curl -X PUT http://localhost:8080/api/v1/config \
-d '{"client_selection": {"strategy": "fastest", "max_clients": 5}}'

Network Optimization:

# Increase buffer sizes
docker exec flopy-net-sdn-controller \
ovs-vsctl set Bridge br0 other_config:flow-limit=200000

# Optimize flow table
curl -X POST http://localhost:8181/api/v1/flows/optimize

# Enable hardware offloading
docker exec flopy-net-sdn-controller \
ethtool -K eth0 rx on tx on

3. Debug Mode

Enable comprehensive debugging:

# Enable debug logging
export FLOPY_NET_DEBUG=true
export FLOPY_NET_LOG_LEVEL=debug

# Enable metrics collection
export FLOPY_NET_METRICS_DETAILED=true

# Enable profiling
export FLOPY_NET_PROFILING=true

# Restart with debug settings
docker-compose down
docker-compose up -d

Debug Tools:

# Network packet capture
docker exec flopy-net-sdn-controller tcpdump -i any -w /tmp/capture.pcap

# Flow table analysis
docker exec flopy-net-sdn-controller ovs-ofctl dump-flows br0

# Performance profiling
curl http://localhost:8080/api/v1/debug/profile > profile.json

4. Recovery Procedures

Experiment Recovery:

# Save experiment state
curl http://localhost:3001/api/v1/experiments/exp_001/checkpoint

# Restart from checkpoint
curl -X POST http://localhost:3001/api/v1/experiments/exp_001/recover \
-d '{"checkpoint_id": "checkpoint_001"}'

System Recovery:

# Full system restart
docker-compose down
docker system prune -f
docker-compose up -d

# Database recovery
docker exec flopy-net-collector influx restore --bucket primary backup.tar.gz

# Configuration restore
curl -X POST http://localhost:5000/api/v1/config/restore \
--data-binary @backup_config.json

Monitoring Best Practices

1. Proactive Monitoring

  • Set up comprehensive alerts for all critical metrics
  • Monitor trends, not just current values
  • Use predictive analytics to identify potential issues
  • Regularly review and update monitoring thresholds

2. Performance Baselines

  • Establish baseline performance metrics
  • Compare experiments against baselines
  • Track performance degradation over time
  • Document known performance characteristics

3. Documentation

  • Document all monitoring procedures
  • Maintain troubleshooting runbooks
  • Record solutions to common issues
  • Keep monitoring configuration in version control

4. Automation

  • Automate routine monitoring tasks
  • Use scripts for common diagnostic procedures
  • Implement self-healing where possible
  • Create automated reporting and dashboards

Next Steps