Service Solution

About Project : Server Failure Forecast System

The Server Failure Forecast System aims to develop an intelligent system capable of detecting and alerting abnormal changes in system operations. This system allows for manual and automatic threshold adjustments for alerts, triggers scripts/commands based on specific conditions, and performs event/log correlation across applications to analyze and predict potential issues. It supports high availability, enables administrators to add new devices and services, and operates in various environments, including public cloud, hybrid cloud, private cloud, cloud-native, and on-premise. Additionally, the system analyzes transaction data, correlates information to identify root causes, supports real-time analysis, and integrates seamlessly with existing infrastructure and data warehouses without any disruptions.

Scope of Work


  1. Detection and Alerting: Develop the capability to detect and alert when there are abnormal changes in system operation volumes.

  2. Threshold Management: Implement manual and automatic configurations for alert thresholds.

  3. Script Triggering: Create rules to trigger scripts/commands when timeouts exceed set limits.

  4. Event/Log Correlation: Enable cross-application event and log correlation for proactive incident prediction.

  5. Pattern Learning: Implement alerting based on learned patterns from system logs and resource usage.

  6. High Availability Support: Ensure the system supports high availability.

  7. Admin Functionality: Allow administrators to add new devices and services.

  8. Environment Compatibility: Ensure compatibility with public cloud, hybrid cloud, private cloud, cloud-native, and on-premise environments.

  9. Transaction Analysis: Develop capabilities for analyzing various system transactions.

  10. Data Correlation: Correlate data from different systems to identify root causes of issues.

  11. Real-time Analysis: Support real-time data analysis.

  12. Automatic Model Improvement: Implement automatic model updates to adapt to system changes.

  13. Seamless Integration: Integrate with existing infrastructure without causing disruptions.

  14. Data Warehouse Utilization: Enable use of data from data warehouses.

  15. Data Integration: Integrate with data warehouse and virtualization systems.

  16. Activity Log Management: Store and retrieve activity logs for administrators and users.

  17. Log Transmission: Transmit internal logs externally, including audit, system, and access logs.

  18. Data Archiving: Automate data archiving, purging, or cleansing according to retention policies.

  19. Backup Support: Support data backup using existing backup systems.

  20. Recovery Capability: Backup and restore configurations and data for system recovery.

  21. Statistical Forecasting: Implement statistical forecasting functionalities.


3. Project Achievements

This project has successfully developed a highly efficient system for detecting and alerting operational anomalies. The system offers flexible configuration options and can accurately analyze potential issues in advance. It can also learn usage patterns to provide timely alerts and operate effectively in various environments. Additionally, it analyzes data from multiple systems to identify root causes of problems and supports real-time data analysis.

Furthermore, the system can efficiently improve models and integrate with existing systems, ensuring comprehensive data management.

Importantly, this project has led to a research paper published in the prestigious international journal ICSEng 2022 in Tokyo, Japan. This publication highlights the project's success and its value in advancing research in this field.

For more detailed information about the research, please visit the following link: ML-Based System Failure Prediction Using Resource Utilization.


Back to Data Analytic & AI/ML Service Solution