Skip to Content
DocsBitedge 集成

BitEdge Integration Specification

This document defines the specification for integrating edge node applications with the BitFlow platform using RabbitMQ messaging protocol.

Overview

BitEdge is the edge computing component of the BitFlow platform that executes workflow tasks on distributed edge nodes. This specification defines the messaging protocol and data structures required for edge nodes to:

  • Consume workflow tasks from RabbitMQ queues
  • Execute containerized workloads (Docker, Kubernetes, etc.)
  • Report progress and results back to the platform
  • Provide real-time monitoring and logging

Implementation Note: This specification is language-agnostic and can be implemented in Python, Go, Java, Node.js, or any language with RabbitMQ support.

Architecture

┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐ │ BitFlow API │───▶│ RabbitMQ │───▶│ BitEdge Node │ │ Server │ │ Message │ │ Application │ │ │ │ Broker │ │ │ └─────────────────┘ └──────────────┘ └─────────────────┘ ▲ ▲ │ │ ┌────────┴────────┐ │ │ │ │ │ └──────────────┤ Progress & │◀─────────────┘ │ Results │ └─────────────────┘

Prerequisites

  • Container Runtime (Docker, containerd, or Kubernetes)
  • RabbitMQ Client Library for your chosen language
  • RabbitMQ Access (connection URL and credentials)
  • Network Connectivity to RabbitMQ broker and container registries

Connection Configuration

RabbitMQ Connection Parameters

rabbitmq: url: "amqp://username:password@rabbitmq-host:5672/" exchange_name: "bitflow" connection_name: "bitedge-{node_id}" prefetch_count: 1 publisher_confirm: true reconnect_delay: 5s max_reconnects: 10

Required Environment Variables

RABBITMQ_URL=amqp://guest:guest@localhost:5672/ DATACENTER_ID=edge-datacenter-1 NODE_ID=edge-node-001

API Endpoints for BitEdge Integration

Artifact Upload using Presigned URLs

Important: BitEdge nodes should use presigned URLs for artifact uploads instead of direct API uploads. Presigned URLs are provided in the task configuration and offer secure, direct-to-storage uploads without exposing credentials.

Task Configuration with Presigned URLs: Tasks include artifact_upload_urls in their configuration:

{ "task_id": "task_1234567890", "execution_config": { "image": "python:3.11-slim", "command": ["python", "train.py"], "artifact_upload_urls": { "base_upload_url": "https://minio.example.com/bitflow-data/runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&...", "expires_at": "2024-01-16T11:30:00Z", "max_file_size": 1073741824 } } }

Upload Process:

  1. Package Artifacts: Create a tar.gz archive of all output files
  2. Upload using Presigned URL: Use HTTP PUT to upload directly to storage
  3. Register Artifact: Call the artifact registration endpoint to record the upload
  4. Report Results: Include artifact information in task result message

Example Upload:

# Package artifacts tar -czf output.tar.gz -C /workspace/outputs . # Upload using presigned URL (HTTP PUT, not POST) curl -X PUT \ -H "Content-Type: application/gzip" \ -T output.tar.gz \ "${PRESIGNED_UPLOAD_URL}" # Register the uploaded artifact (NEW STEP) curl -X POST \ -H "Authorization: Bearer ${API_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ "name": "output.tar.gz", "type": "model", "path": "runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz", "size": 134217728, "mime_type": "application/gzip", "checksum": "sha256:a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3", "uploaded_at": "2024-01-16T10:30:00Z", "metadata": { "task_id": "task_1234567890", "framework": "tensorflow", "model_type": "resnet50" } }' \ "${API_BASE_URL}/api/v1/runs/run_0987654321/artifacts/register"

Benefits of Presigned URLs:

  • Security: No credentials exposed to edge nodes
  • Performance: Direct upload to storage, bypassing API server
  • Reliability: Pre-authorized access with automatic expiration
  • Scalability: Reduces API server load for large file uploads

Message Types and Protocol

1. Node Registration Message

Queue: node.status
Routing Key: node.registration.{datacenter_id}
Content-Type: application/json

Message Structure:

{ "event_type": "node_registration", "node_id": "edge-node-001", "datacenter_id": "edge-datacenter-1", "timestamp": "2024-01-15T10:30:00Z", "node_info": { "name": "GPU Training Node 1", "architecture": "amd64", "os": "linux", "capabilities": ["docker", "gpu", "pytorch", "tensorflow"], "resources": { "cpu_cores": 16, "memory": "64Gi", "gpus": 4, "gpu_type": "nvidia-a100", "storage": "2Ti", "network_speed": "10Gbps" }, "labels": { "node-type": "gpu", "zone": "us-west-1a", "tier": "premium" } } }

Field Descriptions:

  • event_type: Always node_registration for registration messages
  • node_id: Unique identifier for the node within the datacenter
  • datacenter_id: ID of the datacenter this node belongs to
  • node_info.capabilities: List of supported features (docker, gpu, pytorch, etc.)
  • node_info.resources.gpu_type: Specific GPU model (nvidia-a100, nvidia-v100, etc.)
  • node_info.labels: Key-value pairs for node classification and scheduling

2. Node Heartbeat Message

Queue: node.status
Routing Key: node.heartbeat.{datacenter_id}
Content-Type: application/json

Message Structure:

{ "event_type": "node_heartbeat", "node_id": "edge-node-001", "datacenter_id": "edge-datacenter-1", "timestamp": "2024-01-15T10:35:00Z", "health": { "status": "healthy", "uptime_seconds": 86400, "resource_usage": { "cpu_percent": 45.2, "memory_used": 13631488000, "memory_percent": 12.5, "gpu_usage": [ { "gpu_id": 0, "name": "NVIDIA A100", "utilization": 85.5, "memory_used": 68719476736, "memory_total": 85899345920, "temperature": 72.0, "power_usage": 220.0 } ], "network_io": { "bytes_received": 1048576000, "bytes_transmitted": 524288000, "packets_received": 1000000, "packets_transmitted": 500000 }, "disk_io": { "bytes_read": 2147483648, "bytes_written": 1073741824, "reads_count": 10000, "writes_count": 5000 }, "duration": "300s", "peak_memory": 4294967296, "peak_cpu": 89.7 }, "tasks_processed": 127, "tasks_successful": 122, "tasks_failed": 5, "checks": { "rabbitmq": { "status": "ok", "latency_ms": 5 }, "docker": { "status": "ok", "version": "24.0.7" }, "gpu_driver": { "status": "ok", "version": "535.104.12" } } }, "dataset_stats": [ { "dataset_id": "dataset_abc123", "status": "available", "file_count": 1250, "samples_count": 50000, "total_size": 2147483648, "local_path": "/data/datasets/dataset_abc123", "last_accessed": "2024-01-15T09:15:00Z", "last_synced": "2024-01-14T20:00:00Z", "sync_progress": 1.0, "checksum_verified": true, "access_count": 45, "bytes_read": 524288000 } ] }

Field Descriptions:

  • event_type: Always node_heartbeat for heartbeat messages
  • health.status: Node health status (healthy, degraded, unhealthy)
  • health.resource_usage: Current resource consumption metrics
  • health.tasks_*: Cumulative task execution statistics
  • health.checks: Health check results for various components
  • dataset_stats: Current dataset availability and statistics
  • dataset_stats[].samples_count: Critical field - Number of samples in dataset
  • dataset_stats[].status: Dataset availability (available, syncing, corrupted, missing)

3. Workflow Task Message

Queue: workflow.tasks.{datacenter_id} Routing Key: workflow.task.execute.{datacenter_id} Content-Type: application/json

Message Structure:

{ "task_id": "task_1234567890", "run_id": "run_0987654321", "workflow_id": "wf_abcdef123456", "datacenter_id": "edge-datacenter-1", "user_id": "user_123", "task_type": "execution", "priority": 3, "created_at": "2024-01-15T10:30:00Z", "expires_at": "2024-01-15T11:30:00Z", "retry_count": 0, "max_retries": 3, "execution_config": { "image": "python:3.11-slim", "command": ["python", "train.py", "--epochs", "100"], "environment": { "MODEL_TYPE": "resnet50", "BATCH_SIZE": "32", "LEARNING_RATE": "0.001" }, "working_dir": "/app", "resources": { "cpu_limit": "2", "memory_limit": "4Gi", "gpu_count": 1, "gpu_type": "nvidia.com/gpu", "storage_limit": "10Gi" }, "mounts": [ { "type": "dataset", "source": "dataset_abc123", "target": "/data", "permissions": "ro", "filter_paths": ["train/", "validation/"] } ], "timeout": "3600s", "kill_timeout": "30s", "artifact_upload_urls": { "base_upload_url": "https://minio.example.com/bitflow-data/runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&...", "expires_at": "2024-01-16T11:30:00Z", "max_file_size": 1073741824 } }, "callback": { "url": "https://api.bitroc.ai/v1/workflows/{workflow_id}/callback", "token": "secure-callback-token", "events": ["started", "progress", "completed", "failed"] } }

Field Descriptions:

  • task_type: execution | monitoring | cleanup
  • priority: 1 (high) to 5 (low)
  • expires_at: Task expires and should be rejected if current time > expires_at
  • execution_config.mounts: Volume mount specifications for datasets
  • execution_config.artifact_upload_urls: Presigned URLs for secure artifact upload
    • base_upload_url: Presigned URL for uploading the packaged artifact file
    • expires_at: ISO 8601 timestamp when the upload URL expires
    • max_file_size: Maximum allowed file size in bytes (typically 1GB)
  • callback: Structured callback configuration for status updates
    • url: HTTP endpoint for posting task status updates
    • token: Authentication token to include in callback requests
    • events: List of events that trigger callbacks

2. Task Result Message

Queue: workflow.results Routing Key: workflow.task.result Content-Type: application/json

Message Structure:

{ "task_id": "task_1234567890", "run_id": "run_0987654321", "status": "completed", "exit_code": 0, "error_message": null, "started_at": "2024-01-15T10:30:15Z", "completed_at": "2024-01-15T11:25:30Z", "container_id": "container_xyz789", "node_info": { "node_id": "edge-node-001", "node_name": "edge-gpu-server-01", "datacenter_id": "edge-datacenter-1", "architecture": "amd64", "os": "linux", "capabilities": ["docker", "gpu", "high-memory"], "resources": { "cpu_cores": 16, "memory": "64Gi", "gpus": 4, "storage": "2Ti", "network_speed": "10Gbps" }, "labels": { "node-type": "gpu", "zone": "us-west-1a" } }, "resource_usage": { "cpu_percent": 75.5, "memory_used": 3221225472, "memory_percent": 12.5, "gpu_usage": [ { "gpu_id": 0, "name": "NVIDIA A100", "utilization": 95.2, "memory_used": 68719476736, "memory_total": 85899345920, "temperature": 78.5, "power_usage": 250.0 } ], "network_io": { "bytes_received": 1048576000, "bytes_transmitted": 524288000, "packets_received": 1000000, "packets_transmitted": 500000 }, "disk_io": { "bytes_read": 2147483648, "bytes_written": 1073741824, "reads_count": 10000, "writes_count": 5000 }, "duration": "3315s", "peak_memory": 4294967296, "peak_cpu": 89.7 }, "log_summary": "Training completed successfully. Final accuracy: 94.2%", "output_summary": "Model saved to /output/model.pth. Metrics logged to /output/metrics.json", "artifacts": [ { "name": "output.tar.gz", "path": "runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz", "size": 134217728, "content_type": "application/gzip", "checksum": "sha256:a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3", "uploaded_at": "2024-01-15T11:25:30Z", "upload_method": "presigned_url" } ], "metrics": { "final_accuracy": 94.2, "final_loss": 0.045, "training_time": 3315, "epochs_completed": 100, "samples_processed": 50000 } }

Field Descriptions:

  • status: completed | failed | cancelled
  • exit_code: Container exit code (null if not available)
  • resource_usage.memory_used: Memory in bytes
  • resource_usage.duration: Execution time in seconds
  • artifacts: Output files uploaded using presigned URLs
    • name: Artifact filename (typically “output.tar.gz” for packaged outputs)
    • path: Object storage path where the artifact is stored
    • size: File size in bytes
    • content_type: MIME type of the uploaded file
    • checksum: SHA256 checksum for integrity verification
    • uploaded_at: Timestamp when artifact was uploaded to object storage
    • upload_method: Always “presigned_url” for BitEdge uploads
  • metrics: Custom metrics specific to the workflow

3. Progress Updates

Queue: workflow.progress Routing Key: workflow.task.progress Content-Type: application/json

Message Structure:

{ "task_id": "task_1234567890", "run_id": "run_0987654321", "timestamp": "2024-01-15T10:45:30Z", "stage": "running", "progress": 0.65, "message": "Training epoch 65/100 completed", "details": { "current_epoch": 65, "total_epochs": 100, "current_loss": 0.087, "current_accuracy": 0.891, "time_remaining": "1800s" } }

Field Descriptions:

  • stage: preparing | running | finalizing | started | completed | failed
  • progress: Float between 0.0 and 1.0 representing completion percentage
  • details: Optional task-specific progress information

4. Log Messages

Queue: workflow.logs Routing Key: workflow.task.log Content-Type: application/json

Message Structure:

{ "task_id": "task_1234567890", "run_id": "run_0987654321", "timestamp": "2024-01-15T10:45:35Z", "level": "info", "source": "stdout", "message": "Epoch 65/100 - loss: 0.087 - accuracy: 0.891 - val_loss: 0.092 - val_accuracy: 0.883", "stream": "training_logs", "metadata": { "container_id": "abc123", "framework": "tensorflow", "log_type": "training" } }

Field Descriptions:

  • level: debug | info | warn | error
  • source: stdout | stderr | system | container
  • stream: Optional stream identifier for log categorization
  • metadata: Optional structured metadata for enhanced log processing
    • framework: ML framework identifier (tensorflow, pytorch, scikit-learn) - Used for direct parser routing
    • log_type: Log classification (console, training, validation, metrics)
    • container_id: Container identifier for log correlation

Framework Selection Priority:

BitFlow uses a two-tier priority system for optimal log parsing performance:

  1. Priority 1 - Metadata-based Direct Routing (Recommended):

    • When metadata.framework is specified, routes directly to framework parser
    • Supports: tensorflow, tf, pytorch, torch, plain, console
    • Case-insensitive matching
    • 23% faster with 23% less memory allocation
  2. Priority 2 - Content-based Detection:

    • Falls back to pattern matching when no framework metadata provided
    • Analyzes log message content to determine appropriate parser
    • Maintains backward compatibility with existing edge nodes

Recommended Usage:

{ "metadata": { "framework": "tensorflow", // Direct routing - optimal performance "log_type": "training", "phase": "train" } }

4.1. Enhanced ML Training Logs

For machine learning training workflows, BitEdge nodes should categorize and enrich log messages to enable automatic metric extraction and training progress tracking.

ML Training Log Categories:

  1. Training Logs (stream: "training_logs"):

    • Training progress messages with epoch/step information
    • Loss, accuracy, and other training metrics
    • Learning rate schedules and optimizer updates
  2. Validation Logs (stream: "validation_logs"):

    • Validation metrics and performance scores
    • Model evaluation results
    • Early stopping indicators
  3. System Logs (stream: "system_logs"):

    • Resource usage notifications
    • GPU memory allocation/deallocation
    • Checkpoint save/load operations

Enhanced Training Log Example:

{ "task_id": "task_1234567890", "run_id": "run_0987654321", "timestamp": "2024-01-15T10:45:35Z", "level": "info", "source": "stdout", "message": "Epoch 65/100 - loss: 0.087 - accuracy: 0.891 - val_loss: 0.092 - val_accuracy: 0.883 - lr: 0.0001", "stream": "training_logs", "metadata": { "framework": "tensorflow", "log_type": "training", "phase": "train", "container_id": "abc123", "step": 6500, "epoch": 65 } }

Metric Extraction Guidelines:

BitEdge nodes should format training logs to enable automatic metric extraction:

  • TensorFlow Format: Epoch 1/10 - loss: 0.5 - accuracy: 0.8 - val_loss: 0.6 - val_accuracy: 0.75
  • PyTorch Format: [Epoch 1/10] Train Loss: 0.5, Train Acc: 80.0%, Val Loss: 0.6, Val Acc: 75.0%
  • Custom Format: Include metric names with numeric values for easy parsing

Training Phase Indicators:

Use metadata.phase to indicate training phases:

  • train: Training phase metrics
  • validation: Validation/test phase metrics
  • test: Final evaluation metrics

5. Node Health Status

Queue: node.health
Routing Key: node.health.{datacenter_id}
Content-Type: application/json

Message Structure:

{ "node_id": "edge-node-001", "datacenter_id": "edge-datacenter-1", "timestamp": "2024-01-15T10:30:00Z", "status": "healthy", "checks": { "rabbitmq": { "status": "ok", "latency_ms": 5 }, "docker": { "status": "ok", "version": "24.0.7" }, "resources": { "cpu_available": 85.3, "memory_available_gb": 28.5, "disk_available_gb": 450.2 } }, "capabilities": ["docker", "gpu", "high-memory"], "uptime_seconds": 86400 }

6. Node Performance Metrics

Queue: node.metrics
Routing Key: node.metrics.{datacenter_id}
Content-Type: application/json

Message Structure:

{ "node_id": "edge-node-001", "datacenter_id": "edge-datacenter-1", "timestamp": "2024-01-15T10:30:00Z", "tasks_processed": 1247, "tasks_successful": 1198, "tasks_failed": 49, "average_execution_time_seconds": 3240.5, "resource_usage": { "cpu_percent": 45.2, "memory_used_gb": 12.8, "memory_total_gb": 64.0, "disk_used_gb": 125.3, "network_bytes_sent": 5368709120, "network_bytes_received": 10737418240 } }

Field Descriptions:

  • status: healthy | degraded | unhealthy | offline
  • Health messages should be sent every 30-60 seconds
  • Metrics messages should be sent every 5-10 minutes
  • Both message types help the API server monitor datacenter health

Status Update Mechanisms

BitEdge nodes should use message queues as the primary mechanism for reporting task status updates. HTTP callbacks are optional and supplementary.

Primary Method: Message Queue Updates

All status updates should be published to the appropriate RabbitMQ queues:

  1. Progress Updatesworkflow.progress queue
  2. Final Resultsworkflow.results queue
  3. Log Streamsworkflow.logs queue
  4. Node Healthnode.health queue
  5. Performance Metricsnode.metrics queue

This ensures reliable, asynchronous status reporting that works well with edge nodes that may have intermittent connectivity.

Optional Method: HTTP Callback Protocol

When a task includes callback configuration, the edge node may optionally send HTTP POST requests to the specified callback URL for critical events only (started, completed, failed). High-frequency progress updates should only use message queues to avoid overwhelming the API.

Callback Request Format:

POST {callback.url} Authorization: Bearer {callback.token} Content-Type: application/json { "task_id": "task_1234567890", "run_id": "run_0987654321", "datacenter_id": "edge-datacenter-1", "event": "progress", "status": "running", "progress": 0.65, "message": "Training epoch 65/100 completed", "error_message": null, "exit_code": null, "metrics": { "current_epoch": 65, "total_epochs": 100, "current_loss": 0.087, "current_accuracy": 0.891 }, "logs": [ "Epoch 65/100 - loss: 0.087 - accuracy: 0.891" ] }

Callback Events:

  1. started: Task execution has begun

    • Include: task_id, run_id, datacenter_id, event: "started"
    • Optional: message with startup information
  2. progress: Periodic progress updates during execution

    • Include: task_id, run_id, datacenter_id, event: "progress", progress (0.0-1.0)
    • Optional: message, metrics with current metrics
  3. completed: Task finished successfully

    • Include: task_id, run_id, datacenter_id, event: "completed", exit_code: 0
    • Optional: message, metrics with final results
  4. failed: Task execution failed

    • Include: task_id, run_id, datacenter_id, event: "failed", error_message
    • Optional: exit_code, logs with error details

Callback Response Handling:

  • 2xx Success: Callback acknowledged, continue processing
  • 401/403: Invalid token, log error but continue task execution
  • 404: Endpoint not found, disable further callbacks for this task
  • 5xx: Server error, retry with exponential backoff (max 3 retries)
  • Timeout: After 10 seconds, log warning and continue

Security Requirements:

  • Always include the authentication token in the Authorization header
  • Use HTTPS for production callback URLs
  • Never include sensitive data in callback payloads
  • Validate SSL certificates for callback endpoints

Edge Node Protocol Implementation

1. Connection and Queue Management

Edge Node Startup Process:

  1. Initialize RabbitMQ Connection

    • Connect using configuration parameters
    • Set connection name as bitedge-{node_id}
    • Configure prefetch count to 1 for fair task distribution
    • Enable publisher confirms for reliable message delivery
  2. Declare Required Queues

    • Task queue: workflow.tasks.{datacenter_id}
    • Results queue: workflow.results
    • Progress queue: workflow.progress
    • Logs queue: workflow.logs
    • Health queue: node.health
    • Metrics queue: node.metrics
  3. Start Task Consumer

    • Consume from workflow.tasks.{datacenter_id}
    • Handle message acknowledgments properly
    • Implement graceful shutdown on connection loss

2. Task Processing Protocol

Message Processing Flow:

  1. Receive Task Message

    • Parse JSON message structure
    • Validate required fields (task_id, run_id, execution_config)
    • Check task expiration timestamp
  2. Task Validation

    • Verify node capabilities match task requirements
    • Check resource availability (CPU, memory, GPU)
    • Validate container image accessibility
  3. Execution Lifecycle

    • Send “started” progress update immediately
    • If callback configured, send HTTP POST with “started” event
    • Pull container image if not cached
    • Create and start container with specified configuration
    • Stream real-time logs during execution
    • Monitor resource usage and send periodic progress updates
    • If callback configured, send periodic “progress” events
    • Handle timeouts and cancellation requests
  4. Artifact Upload and Result Publication

    • Collect execution metrics and logs
    • Package output files into tar.gz archive
    • Upload artifacts using presigned URL from task configuration
    • Publish final result with artifact storage path and metadata
    • If callback configured, send “completed” or “failed” event
    • Clean up containers and temporary resources

3. Container Execution Requirements

Container Runtime Integration:

  • Image Management: Pull images from configured registries
  • Resource Limits: Apply CPU, memory, and GPU constraints
  • Volume Mounts: Mount datasets and output directories
  • Environment Variables: Inject task-specific configuration
  • Network Isolation: Configure container networking
  • Security Context: Apply security policies and constraints

Resource Monitoring:

  • Track CPU usage, memory consumption, and GPU utilization
  • Monitor network I/O and disk operations
  • Collect container logs (stdout/stderr)
  • Report resource metrics in result messages

Configuration

Environment Variables

Create a .env file or set environment variables:

# Required RABBITMQ_URL=amqp://guest:guest@localhost:5672/ DATACENTER_ID=edge-datacenter-1 NODE_ID=edge-node-001 # Optional RABBITMQ_EXCHANGE=bitflow RABBITMQ_QUEUE_PREFIX=workflow RABBITMQ_CONNECTION_NAME=bitedge-node RABBITMQ_PREFETCH_COUNT=1 RABBITMQ_RECONNECT_DELAY=5s RABBITMQ_MAX_RECONNECTS=10 # Docker configuration DOCKER_HOST=unix:///var/run/docker.sock

Configuration File

Create config.yaml:

datacenter_id: "edge-datacenter-1" node_id: "edge-node-001" rabbitmq: url: "amqp://guest:guest@localhost:5672/" exchange_name: "bitflow" queue_prefix: "workflow" connection_name: "bitedge-node" prefetch_count: 1 reconnect_delay: "5s" max_reconnects: 10 publisher_confirm: true docker: host: "unix:///var/run/docker.sock" resources: max_cpu: "8" max_memory: "16Gi" max_storage: "500Gi" gpus: 2 capabilities: - docker - gpu - high-memory

Deployment

Docker Deployment

Container Requirements:

  • Base image with container runtime support (Docker, containerd)
  • RabbitMQ client library for chosen language
  • Process manager for graceful shutdowns
  • Health check endpoints for monitoring

Required Docker Configuration:

# Environment variables RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672/ DATACENTER_ID=edge-datacenter-1 NODE_ID=edge-node-001 # Volume mounts /var/run/docker.sock:/var/run/docker.sock # Docker socket access /data:/data # Dataset storage /tmp:/tmp # Temporary files # Network access # - RabbitMQ broker connectivity # - Container registry access # - Dataset download endpoints

Deployment Command Example:

docker run -d \ --name bitedge-node \ --restart unless-stopped \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /data:/data \ -e RABBITMQ_URL=amqp://user:pass@rabbitmq:5672/ \ -e DATACENTER_ID=edge-datacenter-1 \ -e NODE_ID=edge-node-001 \ your-registry/bitedge-node:latest

Kubernetes Deployment

Resource Requirements:

  • Service account with container management permissions
  • ConfigMap or Secret for configuration
  • PersistentVolume for dataset caching
  • NodeAffinity for GPU nodes (if required)

Required Kubernetes Resources:

  • Deployment: BitEdge node application
  • ConfigMap: Application configuration
  • Secret: RabbitMQ credentials and certificates
  • PersistentVolumeClaim: Dataset and cache storage
  • ServiceAccount: RBAC permissions for container operations

Node Selection and Affinity:

nodeSelector: node-type: edge-compute gpu-support: "true" affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: edge-zone operator: In values: ["us-west-1a", "us-west-1b"]

Monitoring and Observability

Health Checks

Required Health Check Endpoints:

  • RabbitMQ Connectivity: Verify message broker connection
  • Container Runtime: Check Docker/containerd daemon status
  • Resource Availability: Monitor CPU, memory, and disk space
  • Node Capabilities: Validate GPU drivers and other requirements

Health Check Response Format:

{ "status": "healthy", "timestamp": "2024-01-15T10:30:00Z", "checks": { "rabbitmq": { "status": "ok", "latency_ms": 5 }, "docker": { "status": "ok", "version": "24.0.7" }, "resources": { "cpu_available": 85.3, "memory_available_gb": 28.5, "disk_available_gb": 450.2 } }, "node_info": { "node_id": "edge-node-001", "datacenter_id": "edge-datacenter-1", "capabilities": ["docker", "gpu", "high-memory"] } }

Metrics Collection

Node Performance Metrics:

{ "tasks_processed": 1247, "tasks_successful": 1198, "tasks_failed": 49, "average_execution_time_seconds": 3240.5, "resource_usage": { "cpu_percent": 45.2, "memory_used_gb": 12.8, "memory_total_gb": 64.0, "disk_used_gb": 125.3, "network_bytes_sent": 5368709120, "network_bytes_received": 10737418240 }, "last_activity": "2024-01-15T10:30:00Z", "uptime_seconds": 86400 }

Security Considerations

Container Security

Required Security Measures:

  • Read-only Root Filesystem: Prevent container modification
  • Capability Dropping: Remove unnecessary Linux capabilities
  • Resource Limits: Enforce CPU, memory, and process limits
  • Network Isolation: Use bridge networking with restricted access
  • User Namespace: Run containers as non-root users
  • AppArmor/SELinux: Apply mandatory access controls

Security Configuration Requirements:

security_config: read_only_root_fs: true no_new_privileges: true drop_capabilities: ["ALL"] add_capabilities: ["CHOWN", "SETUID", "SETGID"] run_as_user: 1000 run_as_group: 1000 seccomp_profile: "runtime/default" apparmor_profile: "docker-default" resource_limits: memory_bytes: 4294967296 # 4GB cpu_quota: 200000 # 2 CPU cores pids_limit: 1024 ulimits: nofile: 65536 nproc: 65536 network_config: network_mode: "bridge" port_bindings: {} # No exposed ports by default dns_servers: ["8.8.8.8", "8.8.4.4"]

Credential Management

  • Use Kubernetes secrets or Docker secrets for sensitive data
  • Implement credential rotation
  • Never log or expose credentials in error messages

Testing

Unit Tests

Test Coverage Requirements:

  • Message Parsing: Validate JSON deserialization of all message types
  • Configuration Loading: Test environment variable and config file handling
  • Task Validation: Verify expiration checks and resource validation
  • Container Creation: Mock container runtime operations
  • Resource Monitoring: Test metrics collection and reporting
  • Error Handling: Validate failure scenarios and recovery

Test Data Examples:

{ "test_task": { "task_id": "test-task-123", "run_id": "test-run-456", "execution_config": { "image": "alpine:latest", "command": ["echo", "hello world"], "timeout": "300s" } }, "expected_result": { "status": "completed", "exit_code": 0, "error_message": null } }

Integration Tests

Test Environment Setup:

  • RabbitMQ Container: Run message broker for testing
  • Docker-in-Docker: Enable container execution in test environment
  • Mock Datasets: Provide sample data for mount testing
  • Network Isolation: Test container networking restrictions

End-to-End Test Scenarios:

  1. Successful Task Execution: Complete workflow with result reporting
  2. Task Timeout Handling: Verify timeout behavior and cleanup
  3. Resource Limit Enforcement: Test CPU and memory constraints
  4. Error Recovery: Validate retry logic and error reporting
  5. Connection Failure: Test RabbitMQ reconnection behavior
  6. Concurrent Tasks: Verify multiple task handling capabilities

Troubleshooting

Common Issues

  1. Connection Issues

    # Check RabbitMQ connectivity curl -u guest:guest http://localhost:15672/api/overview # Verify queue exists curl -u guest:guest http://localhost:15672/api/queues
  2. Docker Issues

    # Check Docker daemon docker info # Test container creation docker run --rm alpine:latest echo "test"
  3. Resource Constraints

    # Monitor resource usage docker stats # Check system resources free -h df -h

Logging Configuration

logging: level: info format: json outputs: - type: stdout - type: file path: /var/log/bitedge/node.log max_size: 100MB max_backups: 5 max_age: 30

Advanced Features

Custom Task Types

Task Handler Interface Requirements:

  • Task Type Detection: Identify if a task can be handled by the node
  • Resource Validation: Check if required resources are available
  • Execution Logic: Implement custom task execution behavior
  • Progress Reporting: Send progress updates during execution
  • Result Formatting: Return standardized task results

Custom Task Handler Registration:

{ "task_handlers": [ { "name": "pytorch-trainer", "supported_task_types": ["pytorch_training", "model_inference"], "required_capabilities": ["gpu", "pytorch"], "resource_requirements": { "min_gpu_memory_gb": 8, "min_cpu_cores": 4 } }, { "name": "data-processor", "supported_task_types": ["data_preprocessing", "feature_extraction"], "required_capabilities": ["high-memory"], "resource_requirements": { "min_memory_gb": 32 } } ] }

Resource Management

Resource Allocation Protocol:

  • Resource Discovery: Detect available CPU, memory, GPU, storage
  • Allocation Tracking: Monitor resource usage per task
  • Queuing Logic: Schedule tasks based on resource availability
  • Resource Limits: Enforce per-task resource constraints
  • Cleanup Handling: Release resources after task completion

Resource Status Reporting:

{ "resource_status": { "total_resources": { "cpu_cores": 16, "memory_gb": 64, "gpus": 4, "storage_gb": 1000 }, "available_resources": { "cpu_cores": 12, "memory_gb": 48, "gpus": 2, "storage_gb": 750 }, "allocated_tasks": 3, "queued_tasks": 1 } }

Plugin System

Plugin Architecture Requirements:

  • Dynamic Loading: Support runtime plugin registration
  • API Compatibility: Maintain consistent plugin interfaces
  • Configuration Management: Handle plugin-specific settings
  • Dependency Resolution: Manage plugin dependencies
  • Error Isolation: Prevent plugin failures from affecting core functionality

Plugin Manifest Format:

{ "plugin_info": { "name": "custom-ml-plugin", "version": "1.2.3", "description": "Custom machine learning task handler", "author": "Your Organization", "license": "Apache-2.0" }, "capabilities": { "task_types": ["custom_ml_training", "custom_inference"], "required_resources": ["gpu", "tensorflow"], "supported_frameworks": ["tensorflow", "pytorch"] }, "configuration_schema": { "model_registry_url": { "type": "string", "required": true }, "max_concurrent_tasks": { "type": "integer", "default": 2 } } }

Support and Contributing

For issues, questions, or contributions:

  1. Check the troubleshooting section
  2. Review existing GitHub issues 
  3. Create a new issue with detailed information
  4. Follow the contributing guidelines for pull requests

License

This integration guide is part of the BitFlow platform documentation.

最后更新