For Developers

HySDS (Hybrid Cloud Science Data Processing System) is an open source science data processing system used across many large-scale Earth Science missions. This guide provides key information for developers looking to work with or contribute to HySDS.

Key Components

Core Architecture

GRQ (Geo Region Query): Geospatial data catalog and management system
- Provides faceted search of data
- Enables production rules evaluation and triggers
- Handles data product metadata
Mozart: Job management system
- Faceted search for job management
- Production rules evaluation and actions
- Queue management
Metrics: Runtime analytics
- Real-time job metrics
- Worker metrics tracking
- Performance monitoring
Factotum: "Hot" helper workers
- Maintains workers for low-latency processes
- Assists with job management
Verdi Workers: Distributed compute nodes
- Runs Product Generation Executables (PGEs) at scale
- Handles data staging and processing
- Auto-scales based on workload

Development Environment

Source Code Access

Main repository: https://github.com/hysds/
Releases: https://github.com/hysds/hysds-framework/releases
Community wiki: https://hysds-core.atlassian.net/
Issue tracking: https://hysds-core.atlassian.net/jira/software/c/projects/HC/issues

Container Support

Docker support for cloud deployments
Podman support for HPC environments
Singularity/Apptainer support for specific HPC deployments

Key Development Concepts

Job Processing Flow

Jobs are submitted to queues
Each queue is backed by an auto-scaling group (ASG)
Queue-ASG pairs represent specific job types
Jobs are processed by Verdi workers
Workers scale up/down based on workload

Trigger Rules System

Monitoring Triggers: Automatically evaluate new data
On-demand Triggers: Manual triggering based on search criteria
Job State Change Triggers: React to changes in job status

Auto-scaling Implementation

Scale Up Logic

Based on PCM queue backlog
Target tracking increases auto-scaling to desired fleet size
Supports both cloud and HPC environments

Scale Down Logic

Based on worker idle time
Workers self-terminate after timeout
Graceful shutdown process

Best Practices

Development Guidelines

Follow existing code style and patterns
Write comprehensive tests
Document all new features and changes
Use proper error handling and logging
Consider both cloud and HPC compatibility

Contributing

Join community discussions
Submit pull requests through GitHub
Follow the project's coding standards
Update documentation for changes
Participate in bi-weekly coordination meetings

Deployment Considerations

Support for hybrid cloud/on-premise operations
Consider security implications
Plan for proper monitoring and logging
Implement appropriate error handling
Design for scalability

Resource Management

Data Management

Use appropriate storage solutions for different data types
Implement efficient data staging
Consider data locality for processing
Handle proper cleanup of temporary data

Job Management

Implement proper queue management
Handle job dependencies correctly
Provide adequate monitoring
Implement proper error handling and recovery

Integration Points

External Systems

Support for various cloud providers (AWS, GCP, Azure)
Integration with HPC systems
Support for various data archives and DAACs
Integration with monitoring systems

APIs and Interfaces

REST APIs for job management
Metrics collection endpoints
Data catalog interfaces
Processing status APIs

Testing and Validation

Testing Requirements

Unit tests for components
Integration tests for workflows
Performance testing for scalability
Security testing for deployments

Monitoring and Debugging

Use logging effectively
Monitor system metrics
Track job performance
Debug deployment issues

Security Considerations

Authentication and Authorization

Implement proper access controls
Handle credentials securely
Follow security best practices
Support various authentication methods

Network Security

Secure communication between components
Handle firewalls and network restrictions
Implement proper encryption
Consider cloud security groups

Performance Optimization

Resource Utilization

Optimize compute resource usage
Manage memory effectively
Handle storage efficiently
Monitor network usage

Scaling Considerations

Design for horizontal scaling
Handle state management
Consider data locality
Implement proper caching

Troubleshooting Guide

Common Issues

Job failures
Scaling problems
Network connectivity
Resource exhaustion

Debug Procedures

Check logs
Monitor metrics
Verify configurations
Test connectivity

Additional Resources

Community Slack channels
Developer documentation
API documentation
Example implementations

Remember to check the official documentation for the most up-to-date information and best practices.

Key Components​

Core Architecture​

Development Environment​

Source Code Access​

Container Support​

Key Development Concepts​

Job Processing Flow​

Trigger Rules System​

Auto-scaling Implementation​

Scale Up Logic​

Scale Down Logic​

Best Practices​

Development Guidelines​

Contributing​

Deployment Considerations​

Resource Management​

Data Management​

Job Management​

Integration Points​

External Systems​

APIs and Interfaces​

Testing and Validation​

Testing Requirements​

Monitoring and Debugging​

Security Considerations​

Authentication and Authorization​

Network Security​

Performance Optimization​

Resource Utilization​

Scaling Considerations​

Troubleshooting Guide​

Common Issues​

Debug Procedures​

Additional Resources​