Monitoring and Troubleshooting

This guide provides comprehensive monitoring capabilities and troubleshooting procedures for FLOPY-NET experiments, helping you identify issues, optimize performance, and maintain system health.

Monitoring Overview

FLOPY-NET provides multi-layered monitoring through:

Real-time Dashboard: Web-based visualization and alerts
Metrics Collection: Time-series data aggregation and storage
Log Aggregation: Centralized logging from all components
Health Checks: Automated service health monitoring
Alert System: Proactive issue detection and notification

Dashboard Monitoring

1. Main Dashboard

Access the main monitoring dashboard at http://localhost:3001:

System Overview Panel:

Service health status (green/yellow/red indicators)
Resource utilization (CPU, memory, network)
Active experiments count
System uptime and version info

Experiment Monitoring Panel:

Current experiment status and progress
Real-time accuracy and loss graphs
Client participation and status
Network performance metrics

Network Topology Panel:

Live network topology visualization
Switch and link status indicators
Traffic flow visualization
QoS policy enforcement status

2. Detailed Monitoring Views

FL Training Metrics:

// Real-time FL metrics
const flMetrics = {
    current_round: 12,
    global_accuracy: 0.89,
    global_loss: 0.34,
    convergence_rate: 0.02,
    active_clients: 5,
    client_metrics: [
        {
            id: "client_001",
            local_accuracy: 0.87,
            local_loss: 0.38,
            training_time: "45s",
            communication_time: "12s"
        }
    ]
};

Network Performance Metrics:

// Network monitoring data
const networkMetrics = {
    bandwidth_utilization: 45.2,
    average_latency: "15ms",
    packet_loss_rate: 0.001,
    qos_enforcements: 156,
    flow_modifications: 12,
    topology_changes: 0
};

3. Custom Dashboards

Create custom monitoring dashboards:

curl -X POST http://localhost:3001/api/v1/dashboards \
  -H "Content-Type: application/json" \
  -d '{
    "name": "FL Performance Dashboard",
    "layout": {
      "panels": [
        {
          "type": "line_chart",
          "title": "Global Model Accuracy",
          "metrics": ["global_accuracy"],
          "time_range": "1h"
        },
        {
          "type": "gauge",
          "title": "Network Utilization",
          "metrics": ["bandwidth_utilization"],
          "thresholds": [70, 90]
        }
      ]
    }
  }'

Metrics Collection and Analysis

1. Time-Series Metrics

Query historical metrics using the Collector API:

# Get FL training progress over time
curl "http://localhost:8081/api/v1/metrics/query?metric=global_accuracy&from=2024-01-15T10:00:00Z&to=2024-01-15T11:00:00Z&granularity=1m"

# Get network utilization statistics
curl "http://localhost:8081/api/v1/metrics/summary?metric=bandwidth_utilization&from=2024-01-15T10:00:00Z&to=2024-01-15T11:00:00Z"

2. Performance Analytics

Convergence Analysis:

import requests
import matplotlib.pyplot as plt

# Fetch convergence data
response = requests.get(
    "http://localhost:8081/api/v1/metrics/query",
    params={
        "metric": "global_accuracy",
        "experiment_id": "exp_001",
        "granularity": "1m"
    }
)

data = response.json()["data_points"]
rounds = [point["round"] for point in data]
accuracy = [point["value"] for point in data]

# Plot convergence curve
plt.plot(rounds, accuracy)
plt.xlabel("Training Round")
plt.ylabel("Global Accuracy")
plt.title("FL Model Convergence")
plt.show()

Communication Overhead Analysis:

# Analyze communication patterns
response = requests.get(
    "http://localhost:8081/api/v1/metrics/query",
    params={
        "metric": "communication_overhead",
        "experiment_id": "exp_001"
    }
)

# Calculate efficiency metrics
total_data = sum(point["value"] for point in data)
training_time = data[-1]["timestamp"] - data[0]["timestamp"]
efficiency = final_accuracy / (total_data / 1024**2)  # Accuracy per MB

print(f"Communication Efficiency: {efficiency:.4f} accuracy/MB")

3. Anomaly Detection

Set up automated anomaly detection:

curl -X POST http://localhost:8081/api/v1/metrics/anomalies \
  -H "Content-Type: application/json" \
  -d '{
    "metric": "global_accuracy",
    "algorithm": "isolation_forest",
    "sensitivity": 0.1,
    "alert_threshold": 0.8
  }'

Log Management

1. Centralized Logging

All FLOPY-NET components use structured logging:

# View all service logs
docker-compose logs -f

# View specific service logs
docker logs flopy-net-fl-server -f
docker logs flopy-net-policy-engine -f
docker logs flopy-net-sdn-controller -f

# Filter logs by level
docker logs flopy-net-dashboard 2>&1 | grep ERROR

2. Log Analysis

Parse structured logs:

import json
import subprocess

def analyze_fl_logs():
    # Get FL server logs
    result = subprocess.run(
        ["docker", "logs", "flopy-net-fl-server"],
        capture_output=True, text=True
    )
    
    errors = []
    warnings = []
    
    for line in result.stdout.split('\n'):
        if line.strip():
            try:
                log_entry = json.loads(line)
                if log_entry.get("level") == "ERROR":
                    errors.append(log_entry)
                elif log_entry.get("level") == "WARNING":
                    warnings.append(log_entry)
            except json.JSONDecodeError:
                continue
    
    return {"errors": errors, "warnings": warnings}

analysis = analyze_fl_logs()
print(f"Found {len(analysis['errors'])} errors and {len(analysis['warnings'])} warnings")

3. Log Aggregation with ELK Stack

Set up centralized logging (optional):

# docker-compose.elk.yml
version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"

  logstash:
    image: docker.elastic.co/logstash/logstash:8.5.0
    volumes:
      - ./logstash/config:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.5.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"

Health Monitoring

1. Service Health Checks

Monitor individual service health:

# Check all services - Note: Replace with actual port numbers
services=("dashboard:3001" "collector:8081" "policy-engine:5000" "fl-server:8080" "sdn-controller:8181")
for service_port in "\${services[@]}"; do
  service_name=\$(echo \$service_port | cut -d: -f1)
  port=\$(echo \$service_port | cut -d: -f2)
  echo "Checking \$service_name on port \$port..."
  curl -s http://localhost:\$port/api/v1/health | jq '.status'
done

# Automated health check script
#!/bin/bash
services=(
  "dashboard:3001"
  "collector:8081" 
  "policy-engine:5000"
  "fl-server:8080"
  "sdn-controller:8181"
)

for service_port in "\${services[@]}"; do
  service=\${service_port%:*}
  port=\${service_port#*:}
  
  if curl -s -f http://localhost:\$port/api/v1/health > /dev/null; then
    echo "✓ \$service is healthy"
  else
    echo "✗ \$service is unhealthy"
  fi
done

2. Dependency Health

Check external dependencies:

# Check database connections
curl http://localhost:8081/api/v1/health/dependencies

# Check network connectivity
curl http://localhost:8181/api/v1/topology

# Check GNS3 integration
curl http://localhost:3080/v2/projects

3. Resource Monitoring

Monitor system resources:

# Docker container resources
docker stats --no-stream

# System resources
top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1
free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }'
df -h | awk '$NF=="/"{printf "Disk Usage: %d/%dGB (%s)\n", $3,$2,$5}'

# Network interface statistics
cat /proc/net/dev | grep eth0 | awk '{print "RX bytes: " $2 ", TX bytes: " $10}'

Alert System

1. Configure Alerts

Set up proactive alerting:

curl -X POST http://localhost:8081/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '{
    "name": "High Network Utilization",
    "condition": {
      "metric": "bandwidth_utilization",
      "operator": "greater_than",
      "threshold": 80,
      "duration": "5m"
    },
    "actions": [
      {
        "type": "webhook",
        "url": "https://alerts.example.com/webhook",
        "payload": {
          "message": "Network utilization exceeded 80%",
          "severity": "warning"
        }
      },
      {
        "type": "email",
        "recipients": ["admin@example.com"],
        "subject": "FLOPY-NET Alert: High Network Utilization"
      }
    ]
  }'

2. Alert Types

Performance Alerts:

FL convergence stalled
High communication overhead
Poor client participation
Network congestion

System Alerts:

Service unavailable
High resource usage
Database connection failure
Policy enforcement failure

Security Alerts:

Unauthorized access attempts
Anomalous client behavior
Network intrusion detection
Data integrity violations

3. Alert Management

# List active alerts
curl http://localhost:8081/api/v1/alerts?status=active

# Acknowledge alert
curl -X POST http://localhost:8081/api/v1/alerts/alert_001/acknowledge \
  -H "Content-Type: application/json" \
  -d '{"comment": "Investigating high network usage"}'

# Resolve alert
curl -X POST http://localhost:8081/api/v1/alerts/alert_001/resolve \
  -H "Content-Type: application/json" \
  -d '{"resolution": "Network optimized, utilization reduced"}'

Troubleshooting Guide

1. Common Issues

Issue: Experiment Won't Start

Symptoms:

Experiment status stuck in "initializing"
No client connections
FL server not responding

Diagnosis:

# Check FL server status
curl http://localhost:8080/api/v1/health

# Check client connectivity
docker exec flopy-net-client-001 ping fl-server

# Review FL server logs
docker logs flopy-net-fl-server --tail=50

Solutions:

# Restart FL server
docker restart flopy-net-fl-server

# Check network configuration
docker network inspect flopy-net-network

# Verify port bindings
docker port flopy-net-fl-server

Issue: Poor FL Performance

Symptoms:

Slow convergence
High communication overhead
Frequent client disconnections

Diagnosis:

# Check network metrics
curl http://localhost:8181/api/v1/statistics

# Analyze client performance
curl http://localhost:3001/api/v1/experiments/exp_001/clients

# Review policy enforcement
curl http://localhost:5000/api/v1/events?severity=warning

Solutions:

# Optimize network policies
curl -X PUT http://localhost:5000/api/v1/policies/pol_001 \
  -d '{"actions": [{"type": "allocate_bandwidth", "parameters": {"min_bandwidth": "20Mbps"}}]}'

# Adjust FL parameters
curl -X PUT http://localhost:3001/api/v1/experiments/exp_001 \
  -d '{"configuration": {"local_epochs": 3, "batch_size": 64}}'

Issue: Network Policies Not Enforcing

Symptoms:

QoS not applied
Traffic not prioritized
Policy events missing

Diagnosis:

# Check policy engine status
curl http://localhost:5000/api/v1/health

# Verify SDN controller connection
curl http://localhost:8181/api/v1/health

# Review policy evaluation logs
curl http://localhost:5000/api/v1/events?type=evaluation

Solutions:

# Restart policy engine
docker restart flopy-net-policy-engine

# Re-sync with SDN controller
curl -X POST http://localhost:5000/api/v1/sync

# Validate policy configuration
curl http://localhost:5000/api/v1/policies/pol_001/validate

2. Performance Optimization

FL Training Optimization:

# Reduce communication frequency
curl -X PUT http://localhost:8080/api/v1/config \
  -d '{"aggregation_interval": 120}'  # 2 minutes

# Enable model compression
curl -X PUT http://localhost:8080/api/v1/config \
  -d '{"compression": {"enabled": true, "algorithm": "gzip"}}'

# Optimize client selection
curl -X PUT http://localhost:8080/api/v1/config \
  -d '{"client_selection": {"strategy": "fastest", "max_clients": 5}}'

Network Optimization:

# Increase buffer sizes
docker exec flopy-net-sdn-controller \
  ovs-vsctl set Bridge br0 other_config:flow-limit=200000

# Optimize flow table
curl -X POST http://localhost:8181/api/v1/flows/optimize

# Enable hardware offloading
docker exec flopy-net-sdn-controller \
  ethtool -K eth0 rx on tx on

3. Debug Mode

Enable comprehensive debugging:

# Enable debug logging
export FLOPY_NET_DEBUG=true
export FLOPY_NET_LOG_LEVEL=debug

# Enable metrics collection
export FLOPY_NET_METRICS_DETAILED=true

# Enable profiling
export FLOPY_NET_PROFILING=true

# Restart with debug settings
docker-compose down
docker-compose up -d

Debug Tools:

# Network packet capture
docker exec flopy-net-sdn-controller tcpdump -i any -w /tmp/capture.pcap

# Flow table analysis
docker exec flopy-net-sdn-controller ovs-ofctl dump-flows br0

# Performance profiling
curl http://localhost:8080/api/v1/debug/profile > profile.json

4. Recovery Procedures

Experiment Recovery:

# Save experiment state
curl http://localhost:3001/api/v1/experiments/exp_001/checkpoint

# Restart from checkpoint
curl -X POST http://localhost:3001/api/v1/experiments/exp_001/recover \
  -d '{"checkpoint_id": "checkpoint_001"}'

System Recovery:

# Full system restart
docker-compose down
docker system prune -f
docker-compose up -d

# Database recovery
docker exec flopy-net-collector influx restore --bucket primary backup.tar.gz

# Configuration restore
curl -X POST http://localhost:5000/api/v1/config/restore \
  --data-binary @backup_config.json

Monitoring Best Practices

1. Proactive Monitoring

Set up comprehensive alerts for all critical metrics
Monitor trends, not just current values
Use predictive analytics to identify potential issues
Regularly review and update monitoring thresholds

2. Performance Baselines

Establish baseline performance metrics
Compare experiments against baselines
Track performance degradation over time
Document known performance characteristics

3. Documentation

Document all monitoring procedures
Maintain troubleshooting runbooks
Record solutions to common issues
Keep monitoring configuration in version control

4. Automation

Automate routine monitoring tasks
Use scripts for common diagnostic procedures
Implement self-healing where possible
Create automated reporting and dashboards

Next Steps

Policy Management - Advanced policy configuration and troubleshooting
GNS3 Integration - Network simulation monitoring
Advanced Configurations - Expert monitoring setups
API Reference - Detailed API documentation for monitoring

Monitoring Overview​

Dashboard Monitoring​

1. Main Dashboard​

2. Detailed Monitoring Views​

3. Custom Dashboards​

Metrics Collection and Analysis​

1. Time-Series Metrics​

2. Performance Analytics​

3. Anomaly Detection​

Log Management​

1. Centralized Logging​

2. Log Analysis​

3. Log Aggregation with ELK Stack​

Health Monitoring​

1. Service Health Checks​

2. Dependency Health​

3. Resource Monitoring​

Alert System​

1. Configure Alerts​

2. Alert Types​

3. Alert Management​

Troubleshooting Guide​

1. Common Issues​

2. Performance Optimization​

3. Debug Mode​

4. Recovery Procedures​

Monitoring Best Practices​

1. Proactive Monitoring​

2. Performance Baselines​

3. Documentation​

4. Automation​

Next Steps​

Monitoring Overview

Dashboard Monitoring

1. Main Dashboard

2. Detailed Monitoring Views

3. Custom Dashboards

Metrics Collection and Analysis

1. Time-Series Metrics

2. Performance Analytics

3. Anomaly Detection

Log Management

1. Centralized Logging

2. Log Analysis

3. Log Aggregation with ELK Stack

Health Monitoring

1. Service Health Checks

2. Dependency Health

3. Resource Monitoring

Alert System

1. Configure Alerts

2. Alert Types

3. Alert Management

Troubleshooting Guide

1. Common Issues

2. Performance Optimization

3. Debug Mode

4. Recovery Procedures

Monitoring Best Practices

1. Proactive Monitoring

2. Performance Baselines

3. Documentation

4. Automation

Next Steps