BitEdge Integration Specification
This document defines the specification for integrating edge node applications with the BitFlow platform using RabbitMQ messaging protocol.
Overview
BitEdge is the edge computing component of the BitFlow platform that executes workflow tasks on distributed edge nodes. This specification defines the messaging protocol and data structures required for edge nodes to:
- Consume workflow tasks from RabbitMQ queues
- Execute containerized workloads (Docker, Kubernetes, etc.)
- Report progress and results back to the platform
- Provide real-time monitoring and logging
Implementation Note: This specification is language-agnostic and can be implemented in Python, Go, Java, Node.js, or any language with RabbitMQ support.
Architecture
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ BitFlow API │───▶│ RabbitMQ │───▶│ BitEdge Node │
│ Server │ │ Message │ │ Application │
│ │ │ Broker │ │ │
└─────────────────┘ └──────────────┘ └─────────────────┘
▲ ▲ │
│ ┌────────┴────────┐ │
│ │ │ │
└──────────────┤ Progress & │◀─────────────┘
│ Results │
└─────────────────┘Prerequisites
- Container Runtime (Docker, containerd, or Kubernetes)
- RabbitMQ Client Library for your chosen language
- RabbitMQ Access (connection URL and credentials)
- Network Connectivity to RabbitMQ broker and container registries
Connection Configuration
RabbitMQ Connection Parameters
rabbitmq:
url: "amqp://username:password@rabbitmq-host:5672/"
exchange_name: "bitflow"
connection_name: "bitedge-{node_id}"
prefetch_count: 1
publisher_confirm: true
reconnect_delay: 5s
max_reconnects: 10Required Environment Variables
RABBITMQ_URL=amqp://guest:guest@localhost:5672/
DATACENTER_ID=edge-datacenter-1
NODE_ID=edge-node-001API Endpoints for BitEdge Integration
Artifact Upload using Presigned URLs
Important: BitEdge nodes should use presigned URLs for artifact uploads instead of direct API uploads. Presigned URLs are provided in the task configuration and offer secure, direct-to-storage uploads without exposing credentials.
Task Configuration with Presigned URLs:
Tasks include artifact_upload_urls in their configuration:
{
"task_id": "task_1234567890",
"execution_config": {
"image": "python:3.11-slim",
"command": ["python", "train.py"],
"artifact_upload_urls": {
"base_upload_url": "https://minio.example.com/bitflow-data/runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
"expires_at": "2024-01-16T11:30:00Z",
"max_file_size": 1073741824
}
}
}Upload Process:
- Package Artifacts: Create a tar.gz archive of all output files
- Upload using Presigned URL: Use HTTP PUT to upload directly to storage
- Register Artifact: Call the artifact registration endpoint to record the upload
- Report Results: Include artifact information in task result message
Example Upload:
# Package artifacts
tar -czf output.tar.gz -C /workspace/outputs .
# Upload using presigned URL (HTTP PUT, not POST)
curl -X PUT \
-H "Content-Type: application/gzip" \
-T output.tar.gz \
"${PRESIGNED_UPLOAD_URL}"
# Register the uploaded artifact (NEW STEP)
curl -X POST \
-H "Authorization: Bearer ${API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"name": "output.tar.gz",
"type": "model",
"path": "runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz",
"size": 134217728,
"mime_type": "application/gzip",
"checksum": "sha256:a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3",
"uploaded_at": "2024-01-16T10:30:00Z",
"metadata": {
"task_id": "task_1234567890",
"framework": "tensorflow",
"model_type": "resnet50"
}
}' \
"${API_BASE_URL}/api/v1/runs/run_0987654321/artifacts/register"Benefits of Presigned URLs:
- Security: No credentials exposed to edge nodes
- Performance: Direct upload to storage, bypassing API server
- Reliability: Pre-authorized access with automatic expiration
- Scalability: Reduces API server load for large file uploads
Message Types and Protocol
1. Node Registration Message
Queue: node.status
Routing Key: node.registration.{datacenter_id}
Content-Type: application/json
Message Structure:
{
"event_type": "node_registration",
"node_id": "edge-node-001",
"datacenter_id": "edge-datacenter-1",
"timestamp": "2024-01-15T10:30:00Z",
"node_info": {
"name": "GPU Training Node 1",
"architecture": "amd64",
"os": "linux",
"capabilities": ["docker", "gpu", "pytorch", "tensorflow"],
"resources": {
"cpu_cores": 16,
"memory": "64Gi",
"gpus": 4,
"gpu_type": "nvidia-a100",
"storage": "2Ti",
"network_speed": "10Gbps"
},
"labels": {
"node-type": "gpu",
"zone": "us-west-1a",
"tier": "premium"
}
}
}Field Descriptions:
event_type: Alwaysnode_registrationfor registration messagesnode_id: Unique identifier for the node within the datacenterdatacenter_id: ID of the datacenter this node belongs tonode_info.capabilities: List of supported features (docker, gpu, pytorch, etc.)node_info.resources.gpu_type: Specific GPU model (nvidia-a100, nvidia-v100, etc.)node_info.labels: Key-value pairs for node classification and scheduling
2. Node Heartbeat Message
Queue: node.status
Routing Key: node.heartbeat.{datacenter_id}
Content-Type: application/json
Message Structure:
{
"event_type": "node_heartbeat",
"node_id": "edge-node-001",
"datacenter_id": "edge-datacenter-1",
"timestamp": "2024-01-15T10:35:00Z",
"health": {
"status": "healthy",
"uptime_seconds": 86400,
"resource_usage": {
"cpu_percent": 45.2,
"memory_used": 13631488000,
"memory_percent": 12.5,
"gpu_usage": [
{
"gpu_id": 0,
"name": "NVIDIA A100",
"utilization": 85.5,
"memory_used": 68719476736,
"memory_total": 85899345920,
"temperature": 72.0,
"power_usage": 220.0
}
],
"network_io": {
"bytes_received": 1048576000,
"bytes_transmitted": 524288000,
"packets_received": 1000000,
"packets_transmitted": 500000
},
"disk_io": {
"bytes_read": 2147483648,
"bytes_written": 1073741824,
"reads_count": 10000,
"writes_count": 5000
},
"duration": "300s",
"peak_memory": 4294967296,
"peak_cpu": 89.7
},
"tasks_processed": 127,
"tasks_successful": 122,
"tasks_failed": 5,
"checks": {
"rabbitmq": { "status": "ok", "latency_ms": 5 },
"docker": { "status": "ok", "version": "24.0.7" },
"gpu_driver": { "status": "ok", "version": "535.104.12" }
}
},
"dataset_stats": [
{
"dataset_id": "dataset_abc123",
"status": "available",
"file_count": 1250,
"samples_count": 50000,
"total_size": 2147483648,
"local_path": "/data/datasets/dataset_abc123",
"last_accessed": "2024-01-15T09:15:00Z",
"last_synced": "2024-01-14T20:00:00Z",
"sync_progress": 1.0,
"checksum_verified": true,
"access_count": 45,
"bytes_read": 524288000
}
]
}Field Descriptions:
event_type: Alwaysnode_heartbeatfor heartbeat messageshealth.status: Node health status (healthy,degraded,unhealthy)health.resource_usage: Current resource consumption metricshealth.tasks_*: Cumulative task execution statisticshealth.checks: Health check results for various componentsdataset_stats: Current dataset availability and statisticsdataset_stats[].samples_count: Critical field - Number of samples in datasetdataset_stats[].status: Dataset availability (available,syncing,corrupted,missing)
3. Workflow Task Message
Queue: workflow.tasks.{datacenter_id}
Routing Key: workflow.task.execute.{datacenter_id}
Content-Type: application/json
Message Structure:
{
"task_id": "task_1234567890",
"run_id": "run_0987654321",
"workflow_id": "wf_abcdef123456",
"datacenter_id": "edge-datacenter-1",
"user_id": "user_123",
"task_type": "execution",
"priority": 3,
"created_at": "2024-01-15T10:30:00Z",
"expires_at": "2024-01-15T11:30:00Z",
"retry_count": 0,
"max_retries": 3,
"execution_config": {
"image": "python:3.11-slim",
"command": ["python", "train.py", "--epochs", "100"],
"environment": {
"MODEL_TYPE": "resnet50",
"BATCH_SIZE": "32",
"LEARNING_RATE": "0.001"
},
"working_dir": "/app",
"resources": {
"cpu_limit": "2",
"memory_limit": "4Gi",
"gpu_count": 1,
"gpu_type": "nvidia.com/gpu",
"storage_limit": "10Gi"
},
"mounts": [
{
"type": "dataset",
"source": "dataset_abc123",
"target": "/data",
"permissions": "ro",
"filter_paths": ["train/", "validation/"]
}
],
"timeout": "3600s",
"kill_timeout": "30s",
"artifact_upload_urls": {
"base_upload_url": "https://minio.example.com/bitflow-data/runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
"expires_at": "2024-01-16T11:30:00Z",
"max_file_size": 1073741824
}
},
"callback": {
"url": "https://api.bitroc.ai/v1/workflows/{workflow_id}/callback",
"token": "secure-callback-token",
"events": ["started", "progress", "completed", "failed"]
}
}Field Descriptions:
task_type:execution|monitoring|cleanuppriority: 1 (high) to 5 (low)expires_at: Task expires and should be rejected if current time > expires_atexecution_config.mounts: Volume mount specifications for datasetsexecution_config.artifact_upload_urls: Presigned URLs for secure artifact uploadbase_upload_url: Presigned URL for uploading the packaged artifact fileexpires_at: ISO 8601 timestamp when the upload URL expiresmax_file_size: Maximum allowed file size in bytes (typically 1GB)
callback: Structured callback configuration for status updatesurl: HTTP endpoint for posting task status updatestoken: Authentication token to include in callback requestsevents: List of events that trigger callbacks
2. Task Result Message
Queue: workflow.results
Routing Key: workflow.task.result
Content-Type: application/json
Message Structure:
{
"task_id": "task_1234567890",
"run_id": "run_0987654321",
"status": "completed",
"exit_code": 0,
"error_message": null,
"started_at": "2024-01-15T10:30:15Z",
"completed_at": "2024-01-15T11:25:30Z",
"container_id": "container_xyz789",
"node_info": {
"node_id": "edge-node-001",
"node_name": "edge-gpu-server-01",
"datacenter_id": "edge-datacenter-1",
"architecture": "amd64",
"os": "linux",
"capabilities": ["docker", "gpu", "high-memory"],
"resources": {
"cpu_cores": 16,
"memory": "64Gi",
"gpus": 4,
"storage": "2Ti",
"network_speed": "10Gbps"
},
"labels": {
"node-type": "gpu",
"zone": "us-west-1a"
}
},
"resource_usage": {
"cpu_percent": 75.5,
"memory_used": 3221225472,
"memory_percent": 12.5,
"gpu_usage": [
{
"gpu_id": 0,
"name": "NVIDIA A100",
"utilization": 95.2,
"memory_used": 68719476736,
"memory_total": 85899345920,
"temperature": 78.5,
"power_usage": 250.0
}
],
"network_io": {
"bytes_received": 1048576000,
"bytes_transmitted": 524288000,
"packets_received": 1000000,
"packets_transmitted": 500000
},
"disk_io": {
"bytes_read": 2147483648,
"bytes_written": 1073741824,
"reads_count": 10000,
"writes_count": 5000
},
"duration": "3315s",
"peak_memory": 4294967296,
"peak_cpu": 89.7
},
"log_summary": "Training completed successfully. Final accuracy: 94.2%",
"output_summary": "Model saved to /output/model.pth. Metrics logged to /output/metrics.json",
"artifacts": [
{
"name": "output.tar.gz",
"path": "runs/run_0987654321/tasks/task_1234567890/artifacts/output.tar.gz",
"size": 134217728,
"content_type": "application/gzip",
"checksum": "sha256:a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3",
"uploaded_at": "2024-01-15T11:25:30Z",
"upload_method": "presigned_url"
}
],
"metrics": {
"final_accuracy": 94.2,
"final_loss": 0.045,
"training_time": 3315,
"epochs_completed": 100,
"samples_processed": 50000
}
}Field Descriptions:
status:completed|failed|cancelledexit_code: Container exit code (null if not available)resource_usage.memory_used: Memory in bytesresource_usage.duration: Execution time in secondsartifacts: Output files uploaded using presigned URLsname: Artifact filename (typically “output.tar.gz” for packaged outputs)path: Object storage path where the artifact is storedsize: File size in bytescontent_type: MIME type of the uploaded filechecksum: SHA256 checksum for integrity verificationuploaded_at: Timestamp when artifact was uploaded to object storageupload_method: Always “presigned_url” for BitEdge uploads
metrics: Custom metrics specific to the workflow
3. Progress Updates
Queue: workflow.progress
Routing Key: workflow.task.progress
Content-Type: application/json
Message Structure:
{
"task_id": "task_1234567890",
"run_id": "run_0987654321",
"timestamp": "2024-01-15T10:45:30Z",
"stage": "running",
"progress": 0.65,
"message": "Training epoch 65/100 completed",
"details": {
"current_epoch": 65,
"total_epochs": 100,
"current_loss": 0.087,
"current_accuracy": 0.891,
"time_remaining": "1800s"
}
}Field Descriptions:
stage:preparing|running|finalizing|started|completed|failedprogress: Float between 0.0 and 1.0 representing completion percentagedetails: Optional task-specific progress information
4. Log Messages
Queue: workflow.logs
Routing Key: workflow.task.log
Content-Type: application/json
Message Structure:
{
"task_id": "task_1234567890",
"run_id": "run_0987654321",
"timestamp": "2024-01-15T10:45:35Z",
"level": "info",
"source": "stdout",
"message": "Epoch 65/100 - loss: 0.087 - accuracy: 0.891 - val_loss: 0.092 - val_accuracy: 0.883",
"stream": "training_logs",
"metadata": {
"container_id": "abc123",
"framework": "tensorflow",
"log_type": "training"
}
}Field Descriptions:
level:debug|info|warn|errorsource:stdout|stderr|system|containerstream: Optional stream identifier for log categorizationmetadata: Optional structured metadata for enhanced log processingframework: ML framework identifier (tensorflow,pytorch,scikit-learn) - Used for direct parser routinglog_type: Log classification (console,training,validation,metrics)container_id: Container identifier for log correlation
Framework Selection Priority:
BitFlow uses a two-tier priority system for optimal log parsing performance:
-
Priority 1 - Metadata-based Direct Routing (Recommended):
- When
metadata.frameworkis specified, routes directly to framework parser - Supports:
tensorflow,tf,pytorch,torch,plain,console - Case-insensitive matching
- 23% faster with 23% less memory allocation
- When
-
Priority 2 - Content-based Detection:
- Falls back to pattern matching when no framework metadata provided
- Analyzes log message content to determine appropriate parser
- Maintains backward compatibility with existing edge nodes
Recommended Usage:
{
"metadata": {
"framework": "tensorflow", // Direct routing - optimal performance
"log_type": "training",
"phase": "train"
}
}4.1. Enhanced ML Training Logs
For machine learning training workflows, BitEdge nodes should categorize and enrich log messages to enable automatic metric extraction and training progress tracking.
ML Training Log Categories:
-
Training Logs (
stream: "training_logs"):- Training progress messages with epoch/step information
- Loss, accuracy, and other training metrics
- Learning rate schedules and optimizer updates
-
Validation Logs (
stream: "validation_logs"):- Validation metrics and performance scores
- Model evaluation results
- Early stopping indicators
-
System Logs (
stream: "system_logs"):- Resource usage notifications
- GPU memory allocation/deallocation
- Checkpoint save/load operations
Enhanced Training Log Example:
{
"task_id": "task_1234567890",
"run_id": "run_0987654321",
"timestamp": "2024-01-15T10:45:35Z",
"level": "info",
"source": "stdout",
"message": "Epoch 65/100 - loss: 0.087 - accuracy: 0.891 - val_loss: 0.092 - val_accuracy: 0.883 - lr: 0.0001",
"stream": "training_logs",
"metadata": {
"framework": "tensorflow",
"log_type": "training",
"phase": "train",
"container_id": "abc123",
"step": 6500,
"epoch": 65
}
}Metric Extraction Guidelines:
BitEdge nodes should format training logs to enable automatic metric extraction:
- TensorFlow Format:
Epoch 1/10 - loss: 0.5 - accuracy: 0.8 - val_loss: 0.6 - val_accuracy: 0.75 - PyTorch Format:
[Epoch 1/10] Train Loss: 0.5, Train Acc: 80.0%, Val Loss: 0.6, Val Acc: 75.0% - Custom Format: Include metric names with numeric values for easy parsing
Training Phase Indicators:
Use metadata.phase to indicate training phases:
train: Training phase metricsvalidation: Validation/test phase metricstest: Final evaluation metrics
5. Node Health Status
Queue: node.health
Routing Key: node.health.{datacenter_id}
Content-Type: application/json
Message Structure:
{
"node_id": "edge-node-001",
"datacenter_id": "edge-datacenter-1",
"timestamp": "2024-01-15T10:30:00Z",
"status": "healthy",
"checks": {
"rabbitmq": { "status": "ok", "latency_ms": 5 },
"docker": { "status": "ok", "version": "24.0.7" },
"resources": {
"cpu_available": 85.3,
"memory_available_gb": 28.5,
"disk_available_gb": 450.2
}
},
"capabilities": ["docker", "gpu", "high-memory"],
"uptime_seconds": 86400
}6. Node Performance Metrics
Queue: node.metrics
Routing Key: node.metrics.{datacenter_id}
Content-Type: application/json
Message Structure:
{
"node_id": "edge-node-001",
"datacenter_id": "edge-datacenter-1",
"timestamp": "2024-01-15T10:30:00Z",
"tasks_processed": 1247,
"tasks_successful": 1198,
"tasks_failed": 49,
"average_execution_time_seconds": 3240.5,
"resource_usage": {
"cpu_percent": 45.2,
"memory_used_gb": 12.8,
"memory_total_gb": 64.0,
"disk_used_gb": 125.3,
"network_bytes_sent": 5368709120,
"network_bytes_received": 10737418240
}
}Field Descriptions:
status:healthy|degraded|unhealthy|offline- Health messages should be sent every 30-60 seconds
- Metrics messages should be sent every 5-10 minutes
- Both message types help the API server monitor datacenter health
Status Update Mechanisms
BitEdge nodes should use message queues as the primary mechanism for reporting task status updates. HTTP callbacks are optional and supplementary.
Primary Method: Message Queue Updates
All status updates should be published to the appropriate RabbitMQ queues:
- Progress Updates →
workflow.progressqueue - Final Results →
workflow.resultsqueue - Log Streams →
workflow.logsqueue - Node Health →
node.healthqueue - Performance Metrics →
node.metricsqueue
This ensures reliable, asynchronous status reporting that works well with edge nodes that may have intermittent connectivity.
Optional Method: HTTP Callback Protocol
When a task includes callback configuration, the edge node may optionally send HTTP POST requests to the specified callback URL for critical events only (started, completed, failed). High-frequency progress updates should only use message queues to avoid overwhelming the API.
Callback Request Format:
POST {callback.url}
Authorization: Bearer {callback.token}
Content-Type: application/json
{
"task_id": "task_1234567890",
"run_id": "run_0987654321",
"datacenter_id": "edge-datacenter-1",
"event": "progress",
"status": "running",
"progress": 0.65,
"message": "Training epoch 65/100 completed",
"error_message": null,
"exit_code": null,
"metrics": {
"current_epoch": 65,
"total_epochs": 100,
"current_loss": 0.087,
"current_accuracy": 0.891
},
"logs": [
"Epoch 65/100 - loss: 0.087 - accuracy: 0.891"
]
}Callback Events:
-
started: Task execution has begun- Include:
task_id,run_id,datacenter_id,event: "started" - Optional:
messagewith startup information
- Include:
-
progress: Periodic progress updates during execution- Include:
task_id,run_id,datacenter_id,event: "progress",progress(0.0-1.0) - Optional:
message,metricswith current metrics
- Include:
-
completed: Task finished successfully- Include:
task_id,run_id,datacenter_id,event: "completed",exit_code: 0 - Optional:
message,metricswith final results
- Include:
-
failed: Task execution failed- Include:
task_id,run_id,datacenter_id,event: "failed",error_message - Optional:
exit_code,logswith error details
- Include:
Callback Response Handling:
- 2xx Success: Callback acknowledged, continue processing
- 401/403: Invalid token, log error but continue task execution
- 404: Endpoint not found, disable further callbacks for this task
- 5xx: Server error, retry with exponential backoff (max 3 retries)
- Timeout: After 10 seconds, log warning and continue
Security Requirements:
- Always include the authentication token in the
Authorizationheader - Use HTTPS for production callback URLs
- Never include sensitive data in callback payloads
- Validate SSL certificates for callback endpoints
Edge Node Protocol Implementation
1. Connection and Queue Management
Edge Node Startup Process:
-
Initialize RabbitMQ Connection
- Connect using configuration parameters
- Set connection name as
bitedge-{node_id} - Configure prefetch count to 1 for fair task distribution
- Enable publisher confirms for reliable message delivery
-
Declare Required Queues
- Task queue:
workflow.tasks.{datacenter_id} - Results queue:
workflow.results - Progress queue:
workflow.progress - Logs queue:
workflow.logs - Health queue:
node.health - Metrics queue:
node.metrics
- Task queue:
-
Start Task Consumer
- Consume from
workflow.tasks.{datacenter_id} - Handle message acknowledgments properly
- Implement graceful shutdown on connection loss
- Consume from
2. Task Processing Protocol
Message Processing Flow:
-
Receive Task Message
- Parse JSON message structure
- Validate required fields (
task_id,run_id,execution_config) - Check task expiration timestamp
-
Task Validation
- Verify node capabilities match task requirements
- Check resource availability (CPU, memory, GPU)
- Validate container image accessibility
-
Execution Lifecycle
- Send “started” progress update immediately
- If callback configured, send HTTP POST with “started” event
- Pull container image if not cached
- Create and start container with specified configuration
- Stream real-time logs during execution
- Monitor resource usage and send periodic progress updates
- If callback configured, send periodic “progress” events
- Handle timeouts and cancellation requests
-
Artifact Upload and Result Publication
- Collect execution metrics and logs
- Package output files into tar.gz archive
- Upload artifacts using presigned URL from task configuration
- Publish final result with artifact storage path and metadata
- If callback configured, send “completed” or “failed” event
- Clean up containers and temporary resources
3. Container Execution Requirements
Container Runtime Integration:
- Image Management: Pull images from configured registries
- Resource Limits: Apply CPU, memory, and GPU constraints
- Volume Mounts: Mount datasets and output directories
- Environment Variables: Inject task-specific configuration
- Network Isolation: Configure container networking
- Security Context: Apply security policies and constraints
Resource Monitoring:
- Track CPU usage, memory consumption, and GPU utilization
- Monitor network I/O and disk operations
- Collect container logs (stdout/stderr)
- Report resource metrics in result messages
Configuration
Environment Variables
Create a .env file or set environment variables:
# Required
RABBITMQ_URL=amqp://guest:guest@localhost:5672/
DATACENTER_ID=edge-datacenter-1
NODE_ID=edge-node-001
# Optional
RABBITMQ_EXCHANGE=bitflow
RABBITMQ_QUEUE_PREFIX=workflow
RABBITMQ_CONNECTION_NAME=bitedge-node
RABBITMQ_PREFETCH_COUNT=1
RABBITMQ_RECONNECT_DELAY=5s
RABBITMQ_MAX_RECONNECTS=10
# Docker configuration
DOCKER_HOST=unix:///var/run/docker.sockConfiguration File
Create config.yaml:
datacenter_id: "edge-datacenter-1"
node_id: "edge-node-001"
rabbitmq:
url: "amqp://guest:guest@localhost:5672/"
exchange_name: "bitflow"
queue_prefix: "workflow"
connection_name: "bitedge-node"
prefetch_count: 1
reconnect_delay: "5s"
max_reconnects: 10
publisher_confirm: true
docker:
host: "unix:///var/run/docker.sock"
resources:
max_cpu: "8"
max_memory: "16Gi"
max_storage: "500Gi"
gpus: 2
capabilities:
- docker
- gpu
- high-memoryDeployment
Docker Deployment
Container Requirements:
- Base image with container runtime support (Docker, containerd)
- RabbitMQ client library for chosen language
- Process manager for graceful shutdowns
- Health check endpoints for monitoring
Required Docker Configuration:
# Environment variables
RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672/
DATACENTER_ID=edge-datacenter-1
NODE_ID=edge-node-001
# Volume mounts
/var/run/docker.sock:/var/run/docker.sock # Docker socket access
/data:/data # Dataset storage
/tmp:/tmp # Temporary files
# Network access
# - RabbitMQ broker connectivity
# - Container registry access
# - Dataset download endpointsDeployment Command Example:
docker run -d \
--name bitedge-node \
--restart unless-stopped \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /data:/data \
-e RABBITMQ_URL=amqp://user:pass@rabbitmq:5672/ \
-e DATACENTER_ID=edge-datacenter-1 \
-e NODE_ID=edge-node-001 \
your-registry/bitedge-node:latestKubernetes Deployment
Resource Requirements:
- Service account with container management permissions
- ConfigMap or Secret for configuration
- PersistentVolume for dataset caching
- NodeAffinity for GPU nodes (if required)
Required Kubernetes Resources:
- Deployment: BitEdge node application
- ConfigMap: Application configuration
- Secret: RabbitMQ credentials and certificates
- PersistentVolumeClaim: Dataset and cache storage
- ServiceAccount: RBAC permissions for container operations
Node Selection and Affinity:
nodeSelector:
node-type: edge-compute
gpu-support: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: edge-zone
operator: In
values: ["us-west-1a", "us-west-1b"]Monitoring and Observability
Health Checks
Required Health Check Endpoints:
- RabbitMQ Connectivity: Verify message broker connection
- Container Runtime: Check Docker/containerd daemon status
- Resource Availability: Monitor CPU, memory, and disk space
- Node Capabilities: Validate GPU drivers and other requirements
Health Check Response Format:
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"checks": {
"rabbitmq": { "status": "ok", "latency_ms": 5 },
"docker": { "status": "ok", "version": "24.0.7" },
"resources": {
"cpu_available": 85.3,
"memory_available_gb": 28.5,
"disk_available_gb": 450.2
}
},
"node_info": {
"node_id": "edge-node-001",
"datacenter_id": "edge-datacenter-1",
"capabilities": ["docker", "gpu", "high-memory"]
}
}Metrics Collection
Node Performance Metrics:
{
"tasks_processed": 1247,
"tasks_successful": 1198,
"tasks_failed": 49,
"average_execution_time_seconds": 3240.5,
"resource_usage": {
"cpu_percent": 45.2,
"memory_used_gb": 12.8,
"memory_total_gb": 64.0,
"disk_used_gb": 125.3,
"network_bytes_sent": 5368709120,
"network_bytes_received": 10737418240
},
"last_activity": "2024-01-15T10:30:00Z",
"uptime_seconds": 86400
}Security Considerations
Container Security
Required Security Measures:
- Read-only Root Filesystem: Prevent container modification
- Capability Dropping: Remove unnecessary Linux capabilities
- Resource Limits: Enforce CPU, memory, and process limits
- Network Isolation: Use bridge networking with restricted access
- User Namespace: Run containers as non-root users
- AppArmor/SELinux: Apply mandatory access controls
Security Configuration Requirements:
security_config:
read_only_root_fs: true
no_new_privileges: true
drop_capabilities: ["ALL"]
add_capabilities: ["CHOWN", "SETUID", "SETGID"]
run_as_user: 1000
run_as_group: 1000
seccomp_profile: "runtime/default"
apparmor_profile: "docker-default"
resource_limits:
memory_bytes: 4294967296 # 4GB
cpu_quota: 200000 # 2 CPU cores
pids_limit: 1024
ulimits:
nofile: 65536
nproc: 65536
network_config:
network_mode: "bridge"
port_bindings: {} # No exposed ports by default
dns_servers: ["8.8.8.8", "8.8.4.4"]Credential Management
- Use Kubernetes secrets or Docker secrets for sensitive data
- Implement credential rotation
- Never log or expose credentials in error messages
Testing
Unit Tests
Test Coverage Requirements:
- Message Parsing: Validate JSON deserialization of all message types
- Configuration Loading: Test environment variable and config file handling
- Task Validation: Verify expiration checks and resource validation
- Container Creation: Mock container runtime operations
- Resource Monitoring: Test metrics collection and reporting
- Error Handling: Validate failure scenarios and recovery
Test Data Examples:
{
"test_task": {
"task_id": "test-task-123",
"run_id": "test-run-456",
"execution_config": {
"image": "alpine:latest",
"command": ["echo", "hello world"],
"timeout": "300s"
}
},
"expected_result": {
"status": "completed",
"exit_code": 0,
"error_message": null
}
}Integration Tests
Test Environment Setup:
- RabbitMQ Container: Run message broker for testing
- Docker-in-Docker: Enable container execution in test environment
- Mock Datasets: Provide sample data for mount testing
- Network Isolation: Test container networking restrictions
End-to-End Test Scenarios:
- Successful Task Execution: Complete workflow with result reporting
- Task Timeout Handling: Verify timeout behavior and cleanup
- Resource Limit Enforcement: Test CPU and memory constraints
- Error Recovery: Validate retry logic and error reporting
- Connection Failure: Test RabbitMQ reconnection behavior
- Concurrent Tasks: Verify multiple task handling capabilities
Troubleshooting
Common Issues
-
Connection Issues
# Check RabbitMQ connectivity curl -u guest:guest http://localhost:15672/api/overview # Verify queue exists curl -u guest:guest http://localhost:15672/api/queues -
Docker Issues
# Check Docker daemon docker info # Test container creation docker run --rm alpine:latest echo "test" -
Resource Constraints
# Monitor resource usage docker stats # Check system resources free -h df -h
Logging Configuration
logging:
level: info
format: json
outputs:
- type: stdout
- type: file
path: /var/log/bitedge/node.log
max_size: 100MB
max_backups: 5
max_age: 30Advanced Features
Custom Task Types
Task Handler Interface Requirements:
- Task Type Detection: Identify if a task can be handled by the node
- Resource Validation: Check if required resources are available
- Execution Logic: Implement custom task execution behavior
- Progress Reporting: Send progress updates during execution
- Result Formatting: Return standardized task results
Custom Task Handler Registration:
{
"task_handlers": [
{
"name": "pytorch-trainer",
"supported_task_types": ["pytorch_training", "model_inference"],
"required_capabilities": ["gpu", "pytorch"],
"resource_requirements": {
"min_gpu_memory_gb": 8,
"min_cpu_cores": 4
}
},
{
"name": "data-processor",
"supported_task_types": ["data_preprocessing", "feature_extraction"],
"required_capabilities": ["high-memory"],
"resource_requirements": {
"min_memory_gb": 32
}
}
]
}Resource Management
Resource Allocation Protocol:
- Resource Discovery: Detect available CPU, memory, GPU, storage
- Allocation Tracking: Monitor resource usage per task
- Queuing Logic: Schedule tasks based on resource availability
- Resource Limits: Enforce per-task resource constraints
- Cleanup Handling: Release resources after task completion
Resource Status Reporting:
{
"resource_status": {
"total_resources": {
"cpu_cores": 16,
"memory_gb": 64,
"gpus": 4,
"storage_gb": 1000
},
"available_resources": {
"cpu_cores": 12,
"memory_gb": 48,
"gpus": 2,
"storage_gb": 750
},
"allocated_tasks": 3,
"queued_tasks": 1
}
}Plugin System
Plugin Architecture Requirements:
- Dynamic Loading: Support runtime plugin registration
- API Compatibility: Maintain consistent plugin interfaces
- Configuration Management: Handle plugin-specific settings
- Dependency Resolution: Manage plugin dependencies
- Error Isolation: Prevent plugin failures from affecting core functionality
Plugin Manifest Format:
{
"plugin_info": {
"name": "custom-ml-plugin",
"version": "1.2.3",
"description": "Custom machine learning task handler",
"author": "Your Organization",
"license": "Apache-2.0"
},
"capabilities": {
"task_types": ["custom_ml_training", "custom_inference"],
"required_resources": ["gpu", "tensorflow"],
"supported_frameworks": ["tensorflow", "pytorch"]
},
"configuration_schema": {
"model_registry_url": { "type": "string", "required": true },
"max_concurrent_tasks": { "type": "integer", "default": 2 }
}
}Support and Contributing
For issues, questions, or contributions:
- Check the troubleshooting section
- Review existing GitHub issues
- Create a new issue with detailed information
- Follow the contributing guidelines for pull requests
License
This integration guide is part of the BitFlow platform documentation.