Skip to main content

Key Concepts

HySDS (Hybrid Cloud Science Data Processing System) is an open source science data processing system designed to handle large-scale Earth Science data processing across hybrid cloud and on-premise environments. The system enables processing of hundreds of terabytes per day and can scale to thousands of parallel nodes.

Core Components

1. Resource Management and Control Layer

GRQ (Geo Region Query)

  • Purpose: Geospatial data catalog and management system
  • Key Functions:
    • Faceted search for data discovery
    • Data product metadata management
    • Production rules evaluation and triggers
    • Spatial and temporal data querying

Mozart

  • Purpose: Job and workflow management system
  • Key Functions:
    • Job queue management
    • Worker coordination
    • Production rules processing
    • Job state tracking and control

Metrics

  • Purpose: Runtime analytics and monitoring
  • Key Functions:
    • Real-time job metrics collection
    • Worker fleet monitoring
    • Performance analytics
    • Resource utilization tracking

Factotum

  • Purpose: Management of "hot" helper workers
  • Key Functions:
    • Maintains ready workers for low-latency processes
    • Handles immediate processing needs
    • Assists with job management tasks

2. Processing Architecture

Verdi Workers

  • Distributed compute nodes that run the actual processing jobs
  • Auto-scale based on workload demands
  • Support multiple computing environments (cloud, on-premise, HPC)
  • Handle data staging and processing
  • Self-monitor and report status

Job Processing Flow

  1. Jobs enter the system through queues
  2. Each queue corresponds to a specific job type
  3. Auto-scaling groups manage worker pools
  4. Workers process jobs and report status
  5. System tracks and manages the entire lifecycle

Key Operational Concepts

1. Trigger Rules System

Monitoring Triggers

  • Automatically evaluate new data as it enters the system
  • Execute predefined actions based on data characteristics
  • Enable automated workflow chains

On-demand Triggers

  • Manual triggering based on search criteria
  • Support interactive processing requests
  • Enable custom processing workflows

Job State Triggers

  • React to changes in job status
  • Enable complex workflow dependencies
  • Support error handling and recovery

2. Auto-scaling System

Scale-Out Process

  • Monitor queue depths
  • Track processing demands
  • Automatically increase worker count
  • Balance resources across job types

Scale-In Process

  • Monitor worker utilization
  • Track idle time
  • Gracefully terminate unused workers
  • Optimize resource usage

3. Data Management

Data Discovery

  • Faceted search capabilities
  • Spatial and temporal queries
  • Metadata-based filtering
  • Product lineage tracking

Data Processing

  • Distributed processing across workers
  • Data locality optimization
  • Automated staging and management
  • Product generation and validation

Deployment Models

1. Cloud Deployment

  • Support for major cloud providers (AWS, GCP, Azure)
  • Cloud-native scaling capabilities
  • Managed service integration
  • Cost optimization features

2. On-Premise Deployment

  • Local infrastructure support
  • HPC integration
  • Shared filesystem support
  • Resource management integration

3. Hybrid Deployment

  • Spans cloud and on-premise resources
  • Unified management interface
  • Cross-environment data handling
  • Flexible resource allocation

Processing Concepts

1. Job Types

  • Science Algorithm Execution
  • Data Preprocessing
  • Product Generation
  • Quality Assessment
  • Data Distribution

2. Workflow Management

  • Multi-step processing chains
  • Dependency management
  • Error handling and recovery
  • Resource optimization

Security Model

1. Authentication and Authorization

  • Multi-level access control
  • Role-based permissions
  • API security
  • Resource isolation

2. Data Protection

  • Encryption in transit and at rest
  • Secure data access
  • Audit logging
  • Compliance management

Integration Points

1. External Systems

  • Data archives and DAACs
  • Science processing systems
  • Monitoring and alerting
  • User interfaces

2. APIs and Interfaces

  • RESTful service interfaces
  • Monitoring endpoints
  • Control interfaces
  • Data access APIs

Cost and Resource Management

1. Resource Optimization

  • Automated scaling
  • Resource pooling
  • Cost tracking
  • Usage optimization

2. Performance Management

  • Real-time monitoring
  • Performance metrics
  • Resource utilization
  • Capacity planning

Community Model

1. Open Source Development

  • Community contributions
  • Shared improvements
  • Collaborative development
  • Version control

2. Multi-Mission Support

  • Adaptable to different missions
  • Shared infrastructure
  • Common tooling
  • Best practices

Conclusion

HySDS provides a comprehensive framework for handling large-scale science data processing needs. Its modular architecture, scalability features, and flexible deployment options make it suitable for a wide range of Earth Science processing requirements. Understanding these key concepts is essential for effectively using, operating, or developing within the HySDS ecosystem.