BitEdge Integration Specification

This document defines the specification for integrating edge node applications with the BitFlow platform using RabbitMQ messaging protocol.

Overview

BitEdge is the edge computing component of the BitFlow platform that executes workflow tasks on distributed edge nodes. This specification defines the messaging protocol and data structures required for edge nodes to:

Consume workflow tasks from RabbitMQ queues
Execute containerized workloads (Docker, Kubernetes, etc.)
Report progress and results back to the platform
Provide real-time monitoring and logging

Implementation Note: This specification is language-agnostic and can be implemented in Python, Go, Java, Node.js, or any language with RabbitMQ support.

Architecture


┌─────────────────┐    ┌──────────────┐    ┌─────────────────┐
│   BitFlow API   │───▶│   RabbitMQ   │───▶│   BitEdge Node  │
│     Server      │    │   Message    │    │   Application   │
│                 │    │    Broker    │    │                 │
└─────────────────┘    └──────────────┘    └─────────────────┘
         ▲                       ▲                       │
         │              ┌────────┴────────┐              │
         │              │                 │              │
         └──────────────┤  Progress &     │◀─────────────┘
                        │  Results        │
                        └─────────────────┘

Prerequisites

Container Runtime (Docker, containerd, or Kubernetes)
RabbitMQ Client Library for your chosen language
RabbitMQ Access (connection URL and credentials)
Network Connectivity to RabbitMQ broker and container registries

Connection Configuration

RabbitMQ Connection Parameters


rabbitmq:
  url: "amqp://username:password@rabbitmq-host:5672/"
  exchange_name: "bitflow"
  connection_name: "bitedge-{node_id}"
  prefetch_count: 1
  publisher_confirm: true
  reconnect_delay: 5s
  max_reconnects: 10

Required Environment Variables


RABBITMQ_URL=amqp://guest:guest@localhost:5672/
DATACENTER_ID=edge-datacenter-1
NODE_ID=edge-node-001

API Endpoints for BitEdge Integration

Artifact Upload using Presigned URLs

Important: BitEdge nodes should use presigned URLs for artifact uploads instead of direct API uploads. Presigned URLs are provided in the task configuration and offer secure, direct-to-storage uploads without exposing credentials.

Task Configuration with Presigned URLs: Tasks include artifact_upload_urls in their configuration:


{
  "task_id": "task_1234567890",
  "execution_config": {
    "image": "python:3.11-slim",
    "command": ["python", "train.py"],
    "artifact_upload_urls": {
      "base_upload_url": "https://minio.example.com/bitflow-data/runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
      "expires_at": "2024-01-16T11:30:00Z",
      "max_file_size": 1073741824
    }
  }
}

Upload Process:

Package Artifacts: Create a tar.gz archive of all output files
Upload using Presigned URL: Use HTTP PUT to upload directly to storage
Register Artifact: Call the artifact registration endpoint to record the upload
Report Results: Include artifact information in task result message

Example Upload:


# Package artifacts
tar -czf output.tar.gz -C /workspace/outputs .
 
# Upload using presigned URL (HTTP PUT, not POST)
curl -X PUT \
  -H "Content-Type: application/gzip" \
  -T output.tar.gz \
  "${PRESIGNED_UPLOAD_URL}"
 
# Register the uploaded artifact (NEW STEP)
curl -X POST \
  -H "Authorization: Bearer ${API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "output.tar.gz",
    "type": "model",
    "path": "runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz",
    "size": 134217728,
    "mime_type": "application/gzip",
    "checksum": "sha256:a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3",
    "uploaded_at": "2024-01-16T10:30:00Z",
    "metadata": {
      "task_id": "task_1234567890",
      "framework": "tensorflow",
      "model_type": "resnet50"
    }
  }' \
  "${API_BASE_URL}/api/v1/runs/run_0987654321/artifacts/register"

Benefits of Presigned URLs:

Security: No credentials exposed to edge nodes
Performance: Direct upload to storage, bypassing API server
Reliability: Pre-authorized access with automatic expiration
Scalability: Reduces API server load for large file uploads

Message Types and Protocol

1. Node Registration Message

Queue: node.status
Routing Key: node.registration.{datacenter_id}
Content-Type: application/json

Message Structure:


{
  "event_type": "node_registration",
  "node_id": "edge-node-001",
  "datacenter_id": "edge-datacenter-1",
  "timestamp": "2024-01-15T10:30:00Z",
  "node_info": {
    "name": "GPU Training Node 1",
    "architecture": "amd64",
    "os": "linux",
    "capabilities": ["docker", "gpu", "pytorch", "tensorflow"],
    "resources": {
      "cpu_cores": 16,
      "memory": "64Gi",
      "gpus": 4,
      "gpu_type": "nvidia-a100",
      "storage": "2Ti",
      "network_speed": "10Gbps"
    },
    "labels": {
      "node-type": "gpu",
      "zone": "us-west-1a",
      "tier": "premium"
    }
  }
}

Field Descriptions:

event_type: Always node_registration for registration messages
node_id: Unique identifier for the node within the datacenter
datacenter_id: ID of the datacenter this node belongs to
node_info.capabilities: List of supported features (docker, gpu, pytorch, etc.)
node_info.resources.gpu_type: Specific GPU model (nvidia-a100, nvidia-v100, etc.)
node_info.labels: Key-value pairs for node classification and scheduling

2. Node Heartbeat Message

Queue: node.status
Routing Key: node.heartbeat.{datacenter_id}
Content-Type: application/json

Message Structure:


{
  "event_type": "node_heartbeat",
  "node_id": "edge-node-001",
  "datacenter_id": "edge-datacenter-1",
  "timestamp": "2024-01-15T10:35:00Z",
  "health": {
    "status": "healthy",
    "uptime_seconds": 86400,
    "resource_usage": {
      "cpu_percent": 45.2,
      "memory_used": 13631488000,
      "memory_percent": 12.5,
      "gpu_usage": [
        {
          "gpu_id": 0,
          "name": "NVIDIA A100",
          "utilization": 85.5,
          "memory_used": 68719476736,
          "memory_total": 85899345920,
          "temperature": 72.0,
          "power_usage": 220.0
        }
      ],
      "network_io": {
        "bytes_received": 1048576000,
        "bytes_transmitted": 524288000,
        "packets_received": 1000000,
        "packets_transmitted": 500000
      },
      "disk_io": {
        "bytes_read": 2147483648,
        "bytes_written": 1073741824,
        "reads_count": 10000,
        "writes_count": 5000
      },
      "duration": "300s",
      "peak_memory": 4294967296,
      "peak_cpu": 89.7
    },
    "tasks_processed": 127,
    "tasks_successful": 122,
    "tasks_failed": 5,
    "checks": {
      "rabbitmq": { "status": "ok", "latency_ms": 5 },
      "docker": { "status": "ok", "version": "24.0.7" },
      "gpu_driver": { "status": "ok", "version": "535.104.12" }
    }
  },
  "dataset_stats": [
    {
      "dataset_id": "dataset_abc123",
      "status": "available",
      "file_count": 1250,
      "samples_count": 50000,
      "total_size": 2147483648,
      "local_path": "/data/datasets/dataset_abc123",
      "last_accessed": "2024-01-15T09:15:00Z",
      "last_synced": "2024-01-14T20:00:00Z",
      "sync_progress": 1.0,
      "checksum_verified": true,
      "access_count": 45,
      "bytes_read": 524288000
    }
  ]
}

Field Descriptions:

event_type: Always node_heartbeat for heartbeat messages
health.status: Node health status (healthy, degraded, unhealthy)
health.resource_usage: Current resource consumption metrics
health.tasks_*: Cumulative task execution statistics
health.checks: Health check results for various components
dataset_stats: Current dataset availability and statistics
dataset_stats[].samples_count: Critical field - Number of samples in dataset
dataset_stats[].status: Dataset availability (available, syncing, corrupted, missing)

3. Workflow Task Message

Queue: workflow.tasks.{datacenter_id} Routing Key: workflow.task.execute.{datacenter_id} Content-Type: application/json

Message Structure:


{
  "task_id": "task_1234567890",
  "run_id": "run_0987654321",
  "workflow_id": "wf_abcdef123456",
  "datacenter_id": "edge-datacenter-1",
  "user_id": "user_123",
  "task_type": "execution",
  "priority": 3,
  "created_at": "2024-01-15T10:30:00Z",
  "expires_at": "2024-01-15T11:30:00Z",
  "retry_count": 0,
  "max_retries": 3,
  "execution_config": {
    "image": "python:3.11-slim",
    "command": ["python", "train.py", "--epochs", "100"],
    "environment": {
      "MODEL_TYPE": "resnet50",
      "BATCH_SIZE": "32",
      "LEARNING_RATE": "0.001"
    },
    "working_dir": "/app",
    "resources": {
      "cpu_limit": "2",
      "memory_limit": "4Gi",
      "gpu_count": 1,
      "gpu_type": "nvidia.com/gpu",
      "storage_limit": "10Gi"
    },
    "mounts": [
      {
        "type": "dataset",
        "source": "dataset_abc123",
        "target": "/data",
        "permissions": "ro",
        "filter_paths": ["train/", "validation/"]
      }
    ],
    "timeout": "3600s",
    "kill_timeout": "30s",
    "artifact_upload_urls": {
      "base_upload_url": "https://minio.example.com/bitflow-data/runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
      "expires_at": "2024-01-16T11:30:00Z",
      "max_file_size": 1073741824
    }
  },
  "callback": {
    "url": "https://api.bitroc.ai/v1/workflows/{workflow_id}/callback",
    "token": "secure-callback-token",
    "events": ["started", "progress", "completed", "failed"]
  }
}

Field Descriptions:

task_type: execution | monitoring | cleanup
priority: 1 (high) to 5 (low)
expires_at: Task expires and should be rejected if current time > expires_at
execution_config.mounts: Volume mount specifications for datasets
execution_config.artifact_upload_urls: Presigned URLs for secure artifact upload
- base_upload_url: Presigned URL for uploading the packaged artifact file
- expires_at: ISO 8601 timestamp when the upload URL expires
- max_file_size: Maximum allowed file size in bytes (typically 1GB)
callback: Structured callback configuration for status updates
- url: HTTP endpoint for posting task status updates
- token: Authentication token to include in callback requests
- events: List of events that trigger callbacks

2. Task Result Message

Queue: workflow.results Routing Key: workflow.task.result Content-Type: application/json

Message Structure:


{
  "task_id": "task_1234567890",
  "run_id": "run_0987654321",
  "status": "completed",
  "exit_code": 0,
  "error_message": null,
  "started_at": "2024-01-15T10:30:15Z",
  "completed_at": "2024-01-15T11:25:30Z",
  "container_id": "container_xyz789",
  "node_info": {
    "node_id": "edge-node-001",
    "node_name": "edge-gpu-server-01",
    "datacenter_id": "edge-datacenter-1",
    "architecture": "amd64",
    "os": "linux",
    "capabilities": ["docker", "gpu", "high-memory"],
    "resources": {
      "cpu_cores": 16,
      "memory": "64Gi",
      "gpus": 4,
      "storage": "2Ti",
      "network_speed": "10Gbps"
    },
    "labels": {
      "node-type": "gpu",
      "zone": "us-west-1a"
    }
  },
  "resource_usage": {
    "cpu_percent": 75.5,
    "memory_used": 3221225472,
    "memory_percent": 12.5,
    "gpu_usage": [
      {
        "gpu_id": 0,
        "name": "NVIDIA A100",
        "utilization": 95.2,
        "memory_used": 68719476736,
        "memory_total": 85899345920,
        "temperature": 78.5,
        "power_usage": 250.0
      }
    ],
    "network_io": {
      "bytes_received": 1048576000,
      "bytes_transmitted": 524288000,
      "packets_received": 1000000,
      "packets_transmitted": 500000
    },
    "disk_io": {
      "bytes_read": 2147483648,
      "bytes_written": 1073741824,
      "reads_count": 10000,
      "writes_count": 5000
    },
    "duration": "3315s",
    "peak_memory": 4294967296,
    "peak_cpu": 89.7
  },
  "log_summary": "Training completed successfully. Final accuracy: 94.2%",
  "output_summary": "Model saved to /output/model.pth. Metrics logged to /output/metrics.json",
  "artifacts": [
    {
      "name": "output.tar.gz",
      "path": "runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz",
      "size": 134217728,
      "content_type": "application/gzip",
      "checksum": "sha256:a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3",
      "uploaded_at": "2024-01-15T11:25:30Z",
      "upload_method": "presigned_url"
    }
  ],
  "metrics": {
    "final_accuracy": 94.2,
    "final_loss": 0.045,
    "training_time": 3315,
    "epochs_completed": 100,
    "samples_processed": 50000
  }
}

Field Descriptions:

status: completed | failed | cancelled
exit_code: Container exit code (null if not available)
resource_usage.memory_used: Memory in bytes
resource_usage.duration: Execution time in seconds
artifacts: Output files uploaded using presigned URLs
- name: Artifact filename (typically “output.tar.gz” for packaged outputs)
- path: Object storage path where the artifact is stored
- size: File size in bytes
- content_type: MIME type of the uploaded file
- checksum: SHA256 checksum for integrity verification
- uploaded_at: Timestamp when artifact was uploaded to object storage
- upload_method: Always “presigned_url” for BitEdge uploads
metrics: Custom metrics specific to the workflow

3. Progress Updates

Queue: workflow.progress Routing Key: workflow.task.progress Content-Type: application/json

Message Structure:


{
  "task_id": "task_1234567890",
  "run_id": "run_0987654321",
  "timestamp": "2024-01-15T10:45:30Z",
  "stage": "running",
  "progress": 0.65,
  "message": "Training epoch 65/100 completed",
  "details": {
    "current_epoch": 65,
    "total_epochs": 100,
    "current_loss": 0.087,
    "current_accuracy": 0.891,
    "time_remaining": "1800s"
  }
}

Field Descriptions:

stage: preparing | running | finalizing | started | completed | failed
progress: Float between 0.0 and 1.0 representing completion percentage
details: Optional task-specific progress information

4. Log Messages

Queue: workflow.logs Routing Key: workflow.task.log Content-Type: application/json

Message Structure:


{
  "task_id": "task_1234567890",
  "run_id": "run_0987654321",
  "timestamp": "2024-01-15T10:45:35Z",
  "level": "info",
  "source": "stdout",
  "message": "Epoch 65/100 - loss: 0.087 - accuracy: 0.891 - val_loss: 0.092 - val_accuracy: 0.883",
  "stream": "training_logs",
  "metadata": {
    "container_id": "abc123",
    "framework": "tensorflow",
    "log_type": "training"
  }
}

Field Descriptions:

level: debug | info | warn | error
source: stdout | stderr | system | container
stream: Optional stream identifier for log categorization
metadata: Optional structured metadata for enhanced log processing
- framework: ML framework identifier (tensorflow, pytorch, scikit-learn) - Used for direct parser routing
- log_type: Log classification (console, training, validation, metrics)
- container_id: Container identifier for log correlation

Framework Selection Priority:

BitFlow uses a two-tier priority system for optimal log parsing performance:

Priority 1 - Metadata-based Direct Routing (Recommended):
- When metadata.framework is specified, routes directly to framework parser
- Supports: tensorflow, tf, pytorch, torch, plain, console
- Case-insensitive matching
- 23% faster with 23% less memory allocation
Priority 2 - Content-based Detection:
- Falls back to pattern matching when no framework metadata provided
- Analyzes log message content to determine appropriate parser
- Maintains backward compatibility with existing edge nodes

Recommended Usage:


{
  "metadata": {
    "framework": "tensorflow", // Direct routing - optimal performance
    "log_type": "training",
    "phase": "train"
  }
}

4.1. Enhanced ML Training Logs

For machine learning training workflows, BitEdge nodes should categorize and enrich log messages to enable automatic metric extraction and training progress tracking.

ML Training Log Categories:

Training Logs (stream: "training_logs"):
- Training progress messages with epoch/step information
- Loss, accuracy, and other training metrics
- Learning rate schedules and optimizer updates
Validation Logs (stream: "validation_logs"):
- Validation metrics and performance scores
- Model evaluation results
- Early stopping indicators
System Logs (stream: "system_logs"):
- Resource usage notifications
- GPU memory allocation/deallocation
- Checkpoint save/load operations

Enhanced Training Log Example:


{
  "task_id": "task_1234567890",
  "run_id": "run_0987654321",
  "timestamp": "2024-01-15T10:45:35Z",
  "level": "info",
  "source": "stdout",
  "message": "Epoch 65/100 - loss: 0.087 - accuracy: 0.891 - val_loss: 0.092 - val_accuracy: 0.883 - lr: 0.0001",
  "stream": "training_logs",
  "metadata": {
    "framework": "tensorflow",
    "log_type": "training",
    "phase": "train",
    "container_id": "abc123",
    "step": 6500,
    "epoch": 65
  }
}

Metric Extraction Guidelines:

BitEdge nodes should format training logs to enable automatic metric extraction:

TensorFlow Format: Epoch 1/10 - loss: 0.5 - accuracy: 0.8 - val_loss: 0.6 - val_accuracy: 0.75
PyTorch Format: [Epoch 1/10] Train Loss: 0.5, Train Acc: 80.0%, Val Loss: 0.6, Val Acc: 75.0%
Custom Format: Include metric names with numeric values for easy parsing

Training Phase Indicators:

Use metadata.phase to indicate training phases:

train: Training phase metrics
validation: Validation/test phase metrics
test: Final evaluation metrics

5. Node Health Status

Queue: node.health
Routing Key: node.health.{datacenter_id}
Content-Type: application/json

Message Structure:


{
  "node_id": "edge-node-001",
  "datacenter_id": "edge-datacenter-1",
  "timestamp": "2024-01-15T10:30:00Z",
  "status": "healthy",
  "checks": {
    "rabbitmq": { "status": "ok", "latency_ms": 5 },
    "docker": { "status": "ok", "version": "24.0.7" },
    "resources": {
      "cpu_available": 85.3,
      "memory_available_gb": 28.5,
      "disk_available_gb": 450.2
    }
  },
  "capabilities": ["docker", "gpu", "high-memory"],
  "uptime_seconds": 86400
}

6. Node Performance Metrics

Queue: node.metrics
Routing Key: node.metrics.{datacenter_id}
Content-Type: application/json

Message Structure:


{
  "node_id": "edge-node-001",
  "datacenter_id": "edge-datacenter-1",
  "timestamp": "2024-01-15T10:30:00Z",
  "tasks_processed": 1247,
  "tasks_successful": 1198,
  "tasks_failed": 49,
  "average_execution_time_seconds": 3240.5,
  "resource_usage": {
    "cpu_percent": 45.2,
    "memory_used_gb": 12.8,
    "memory_total_gb": 64.0,
    "disk_used_gb": 125.3,
    "network_bytes_sent": 5368709120,
    "network_bytes_received": 10737418240
  }
}

Field Descriptions:

status: healthy | degraded | unhealthy | offline
Health messages should be sent every 30-60 seconds
Metrics messages should be sent every 5-10 minutes
Both message types help the API server monitor datacenter health

Status Update Mechanisms

BitEdge nodes should use message queues as the primary mechanism for reporting task status updates. HTTP callbacks are optional and supplementary.

Primary Method: Message Queue Updates

All status updates should be published to the appropriate RabbitMQ queues:

Progress Updates → workflow.progress queue
Final Results → workflow.results queue
Log Streams → workflow.logs queue
Node Health → node.health queue
Performance Metrics → node.metrics queue

This ensures reliable, asynchronous status reporting that works well with edge nodes that may have intermittent connectivity.

Optional Method: HTTP Callback Protocol

When a task includes callback configuration, the edge node may optionally send HTTP POST requests to the specified callback URL for critical events only (started, completed, failed). High-frequency progress updates should only use message queues to avoid overwhelming the API.

Callback Request Format:


POST {callback.url}
Authorization: Bearer {callback.token}
Content-Type: application/json
 
{
  "task_id": "task_1234567890",
  "run_id": "run_0987654321",
  "datacenter_id": "edge-datacenter-1",
  "event": "progress",
  "status": "running",
  "progress": 0.65,
  "message": "Training epoch 65/100 completed",
  "error_message": null,
  "exit_code": null,
  "metrics": {
    "current_epoch": 65,
    "total_epochs": 100,
    "current_loss": 0.087,
    "current_accuracy": 0.891
  },
  "logs": [
    "Epoch 65/100 - loss: 0.087 - accuracy: 0.891"
  ]
}

Callback Events:

started: Task execution has begun
- Include: task_id, run_id, datacenter_id, event: "started"
- Optional: message with startup information
progress: Periodic progress updates during execution
- Include: task_id, run_id, datacenter_id, event: "progress", progress (0.0-1.0)
- Optional: message, metrics with current metrics
completed: Task finished successfully
- Include: task_id, run_id, datacenter_id, event: "completed", exit_code: 0
- Optional: message, metrics with final results
failed: Task execution failed
- Include: task_id, run_id, datacenter_id, event: "failed", error_message
- Optional: exit_code, logs with error details

Callback Response Handling:

2xx Success: Callback acknowledged, continue processing
401/403: Invalid token, log error but continue task execution
404: Endpoint not found, disable further callbacks for this task
5xx: Server error, retry with exponential backoff (max 3 retries)
Timeout: After 10 seconds, log warning and continue

Security Requirements:

Always include the authentication token in the Authorization header
Use HTTPS for production callback URLs
Never include sensitive data in callback payloads
Validate SSL certificates for callback endpoints

Edge Node Protocol Implementation

1. Connection and Queue Management

Edge Node Startup Process:

Initialize RabbitMQ Connection
- Connect using configuration parameters
- Set connection name as bitedge-{node_id}
- Configure prefetch count to 1 for fair task distribution
- Enable publisher confirms for reliable message delivery
Declare Required Queues
- Task queue: workflow.tasks.{datacenter_id}
- Results queue: workflow.results
- Progress queue: workflow.progress
- Logs queue: workflow.logs
- Health queue: node.health
- Metrics queue: node.metrics
Start Task Consumer
- Consume from workflow.tasks.{datacenter_id}
- Handle message acknowledgments properly
- Implement graceful shutdown on connection loss

2. Task Processing Protocol

Message Processing Flow:

Receive Task Message
- Parse JSON message structure
- Validate required fields (task_id, run_id, execution_config)
- Check task expiration timestamp
Task Validation
- Verify node capabilities match task requirements
- Check resource availability (CPU, memory, GPU)
- Validate container image accessibility
Execution Lifecycle
- Send “started” progress update immediately
- If callback configured, send HTTP POST with “started” event
- Pull container image if not cached
- Create and start container with specified configuration
- Stream real-time logs during execution
- Monitor resource usage and send periodic progress updates
- If callback configured, send periodic “progress” events
- Handle timeouts and cancellation requests
Artifact Upload and Result Publication
- Collect execution metrics and logs
- Package output files into tar.gz archive
- Upload artifacts using presigned URL from task configuration
- Publish final result with artifact storage path and metadata
- If callback configured, send “completed” or “failed” event
- Clean up containers and temporary resources

3. Container Execution Requirements

Container Runtime Integration:

Image Management: Pull images from configured registries
Resource Limits: Apply CPU, memory, and GPU constraints
Volume Mounts: Mount datasets and output directories
Environment Variables: Inject task-specific configuration
Network Isolation: Configure container networking
Security Context: Apply security policies and constraints

Resource Monitoring:

Track CPU usage, memory consumption, and GPU utilization
Monitor network I/O and disk operations
Collect container logs (stdout/stderr)
Report resource metrics in result messages

Configuration

Environment Variables

Create a .env file or set environment variables:


# Required
RABBITMQ_URL=amqp://guest:guest@localhost:5672/
DATACENTER_ID=edge-datacenter-1
NODE_ID=edge-node-001
 
# Optional
RABBITMQ_EXCHANGE=bitflow
RABBITMQ_QUEUE_PREFIX=workflow
RABBITMQ_CONNECTION_NAME=bitedge-node
RABBITMQ_PREFETCH_COUNT=1
RABBITMQ_RECONNECT_DELAY=5s
RABBITMQ_MAX_RECONNECTS=10
 
# Docker configuration
DOCKER_HOST=unix:///var/run/docker.sock

Configuration File

Create config.yaml:


datacenter_id: "edge-datacenter-1"
node_id: "edge-node-001"
 
rabbitmq:
  url: "amqp://guest:guest@localhost:5672/"
  exchange_name: "bitflow"
  queue_prefix: "workflow"
  connection_name: "bitedge-node"
  prefetch_count: 1
  reconnect_delay: "5s"
  max_reconnects: 10
  publisher_confirm: true
 
docker:
  host: "unix:///var/run/docker.sock"
 
resources:
  max_cpu: "8"
  max_memory: "16Gi"
  max_storage: "500Gi"
  gpus: 2
 
capabilities:
  - docker
  - gpu
  - high-memory

Deployment

Docker Deployment

Container Requirements:

Base image with container runtime support (Docker, containerd)
RabbitMQ client library for chosen language
Process manager for graceful shutdowns
Health check endpoints for monitoring

Required Docker Configuration:


# Environment variables
RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672/
DATACENTER_ID=edge-datacenter-1
NODE_ID=edge-node-001
 
# Volume mounts
/var/run/docker.sock:/var/run/docker.sock  # Docker socket access
/data:/data                                # Dataset storage
/tmp:/tmp                                  # Temporary files
 
# Network access
# - RabbitMQ broker connectivity
# - Container registry access
# - Dataset download endpoints

Deployment Command Example:


docker run -d \
  --name bitedge-node \
  --restart unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v /data:/data \
  -e RABBITMQ_URL=amqp://user:pass@rabbitmq:5672/ \
  -e DATACENTER_ID=edge-datacenter-1 \
  -e NODE_ID=edge-node-001 \
  your-registry/bitedge-node:latest

Kubernetes Deployment

Resource Requirements:

Service account with container management permissions
ConfigMap or Secret for configuration
PersistentVolume for dataset caching
NodeAffinity for GPU nodes (if required)

Required Kubernetes Resources:

Deployment: BitEdge node application
ConfigMap: Application configuration
Secret: RabbitMQ credentials and certificates
PersistentVolumeClaim: Dataset and cache storage
ServiceAccount: RBAC permissions for container operations

Node Selection and Affinity:


nodeSelector:
  node-type: edge-compute
  gpu-support: "true"
 
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: edge-zone
              operator: In
              values: ["us-west-1a", "us-west-1b"]

Monitoring and Observability

Health Checks

Required Health Check Endpoints:

RabbitMQ Connectivity: Verify message broker connection
Container Runtime: Check Docker/containerd daemon status
Resource Availability: Monitor CPU, memory, and disk space
Node Capabilities: Validate GPU drivers and other requirements

Health Check Response Format:


{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "checks": {
    "rabbitmq": { "status": "ok", "latency_ms": 5 },
    "docker": { "status": "ok", "version": "24.0.7" },
    "resources": {
      "cpu_available": 85.3,
      "memory_available_gb": 28.5,
      "disk_available_gb": 450.2
    }
  },
  "node_info": {
    "node_id": "edge-node-001",
    "datacenter_id": "edge-datacenter-1",
    "capabilities": ["docker", "gpu", "high-memory"]
  }
}

Metrics Collection

Node Performance Metrics:


{
  "tasks_processed": 1247,
  "tasks_successful": 1198,
  "tasks_failed": 49,
  "average_execution_time_seconds": 3240.5,
  "resource_usage": {
    "cpu_percent": 45.2,
    "memory_used_gb": 12.8,
    "memory_total_gb": 64.0,
    "disk_used_gb": 125.3,
    "network_bytes_sent": 5368709120,
    "network_bytes_received": 10737418240
  },
  "last_activity": "2024-01-15T10:30:00Z",
  "uptime_seconds": 86400
}

Security Considerations

Container Security

Required Security Measures:

Read-only Root Filesystem: Prevent container modification
Capability Dropping: Remove unnecessary Linux capabilities
Resource Limits: Enforce CPU, memory, and process limits
Network Isolation: Use bridge networking with restricted access
User Namespace: Run containers as non-root users
AppArmor/SELinux: Apply mandatory access controls

Security Configuration Requirements:


security_config:
  read_only_root_fs: true
  no_new_privileges: true
  drop_capabilities: ["ALL"]
  add_capabilities: ["CHOWN", "SETUID", "SETGID"]
  run_as_user: 1000
  run_as_group: 1000
  seccomp_profile: "runtime/default"
  apparmor_profile: "docker-default"
 
resource_limits:
  memory_bytes: 4294967296 # 4GB
  cpu_quota: 200000 # 2 CPU cores
  pids_limit: 1024
  ulimits:
    nofile: 65536
    nproc: 65536
 
network_config:
  network_mode: "bridge"
  port_bindings: {} # No exposed ports by default
  dns_servers: ["8.8.8.8", "8.8.4.4"]

Credential Management

Use Kubernetes secrets or Docker secrets for sensitive data
Implement credential rotation
Never log or expose credentials in error messages

Testing

Unit Tests

Test Coverage Requirements:

Message Parsing: Validate JSON deserialization of all message types
Configuration Loading: Test environment variable and config file handling
Task Validation: Verify expiration checks and resource validation
Container Creation: Mock container runtime operations
Resource Monitoring: Test metrics collection and reporting
Error Handling: Validate failure scenarios and recovery

Test Data Examples:


{
  "test_task": {
    "task_id": "test-task-123",
    "run_id": "test-run-456",
    "execution_config": {
      "image": "alpine:latest",
      "command": ["echo", "hello world"],
      "timeout": "300s"
    }
  },
  "expected_result": {
    "status": "completed",
    "exit_code": 0,
    "error_message": null
  }
}

Integration Tests

Test Environment Setup:

RabbitMQ Container: Run message broker for testing
Docker-in-Docker: Enable container execution in test environment
Mock Datasets: Provide sample data for mount testing
Network Isolation: Test container networking restrictions

End-to-End Test Scenarios:

Successful Task Execution: Complete workflow with result reporting
Task Timeout Handling: Verify timeout behavior and cleanup
Resource Limit Enforcement: Test CPU and memory constraints
Error Recovery: Validate retry logic and error reporting
Connection Failure: Test RabbitMQ reconnection behavior
Concurrent Tasks: Verify multiple task handling capabilities

Troubleshooting

Common Issues

Connection Issues


# Check RabbitMQ connectivity
curl -u guest:guest http://localhost:15672/api/overview
 
# Verify queue exists
curl -u guest:guest http://localhost:15672/api/queues

Docker Issues


# Check Docker daemon
docker info
 
# Test container creation
docker run --rm alpine:latest echo "test"

Resource Constraints


# Monitor resource usage
docker stats
 
# Check system resources
free -h
df -h

Logging Configuration


logging:
  level: info
  format: json
  outputs:
    - type: stdout
    - type: file
      path: /var/log/bitedge/node.log
      max_size: 100MB
      max_backups: 5
      max_age: 30

Advanced Features

Custom Task Types

Task Handler Interface Requirements:

Task Type Detection: Identify if a task can be handled by the node
Resource Validation: Check if required resources are available
Execution Logic: Implement custom task execution behavior
Progress Reporting: Send progress updates during execution
Result Formatting: Return standardized task results

Custom Task Handler Registration:


{
  "task_handlers": [
    {
      "name": "pytorch-trainer",
      "supported_task_types": ["pytorch_training", "model_inference"],
      "required_capabilities": ["gpu", "pytorch"],
      "resource_requirements": {
        "min_gpu_memory_gb": 8,
        "min_cpu_cores": 4
      }
    },
    {
      "name": "data-processor",
      "supported_task_types": ["data_preprocessing", "feature_extraction"],
      "required_capabilities": ["high-memory"],
      "resource_requirements": {
        "min_memory_gb": 32
      }
    }
  ]
}

Resource Management

Resource Allocation Protocol:

Resource Discovery: Detect available CPU, memory, GPU, storage
Allocation Tracking: Monitor resource usage per task
Queuing Logic: Schedule tasks based on resource availability
Resource Limits: Enforce per-task resource constraints
Cleanup Handling: Release resources after task completion

Resource Status Reporting:


{
  "resource_status": {
    "total_resources": {
      "cpu_cores": 16,
      "memory_gb": 64,
      "gpus": 4,
      "storage_gb": 1000
    },
    "available_resources": {
      "cpu_cores": 12,
      "memory_gb": 48,
      "gpus": 2,
      "storage_gb": 750
    },
    "allocated_tasks": 3,
    "queued_tasks": 1
  }
}

Plugin System

Plugin Architecture Requirements:

Dynamic Loading: Support runtime plugin registration
API Compatibility: Maintain consistent plugin interfaces
Configuration Management: Handle plugin-specific settings
Dependency Resolution: Manage plugin dependencies
Error Isolation: Prevent plugin failures from affecting core functionality

Plugin Manifest Format:


{
  "plugin_info": {
    "name": "custom-ml-plugin",
    "version": "1.2.3",
    "description": "Custom machine learning task handler",
    "author": "Your Organization",
    "license": "Apache-2.0"
  },
  "capabilities": {
    "task_types": ["custom_ml_training", "custom_inference"],
    "required_resources": ["gpu", "tensorflow"],
    "supported_frameworks": ["tensorflow", "pytorch"]
  },
  "configuration_schema": {
    "model_registry_url": { "type": "string", "required": true },
    "max_concurrent_tasks": { "type": "integer", "default": 2 }
  }
}

Support and Contributing

For issues, questions, or contributions:

Check the troubleshooting section
Review existing GitHub issues
Create a new issue with detailed information
Follow the contributing guidelines for pull requests

License

This integration guide is part of the BitFlow platform documentation.