Skip to main content

System Health Checks

Regularly performing these system health checks ensures high availability, optimal performance, and early detection of potential issues, minimizing service disruptions.

This guide provides a structured approach with specific commands for performing routine health checks on your ServiceOps environment. These checks are crucial for maintaining a stable and reliable system.

1. Application Services Health Check

  • Objective: To ensure all core ServiceOps application services are running correctly.

  • Frequency: Daily

  • Procedure:

    1. Check Core Services: Verify that the main ServiceOps services are active.

      Example: Check status of all key services with one command

      systemctl status ft-main-server ft-analytics-server elasticsearch.service

    2. Review System Logs: Check for any new errors since the last check.

      Example: Check main server logs in real-time

      tail -f /opt/flotomate/main-server/logs/system.log

      Example: Check analytics server logs in real-time

      tail -f /opt/flotomate/cm-analytics/logs/system.log


2. Database Health Check

  • Objective: To ensure the PostgreSQL database is running, accessible, and performing optimally.

  • Frequency: Daily

  • Procedure:

    1. Check PostgreSQL Service Status:

      Example: Check the status of the main PostgreSQL service

      systemctl status postgresql

    2. Test Database Connection: From the application server, attempt to connect to the database.

      Syntax: psql -h <DB_HOST_IP> -p <PORT> -U <USER> -d <DB_NAME>

      Example: psql -h 172.16.13.40 -p 5432 -U postgres -d serviceops

      Example: pg_lsclusters

    3. Check Database Logs: Review the PostgreSQL logs for any errors.

      Example: Watch the PostgreSQL log file for new entries Path may vary based on OS and version

      tail -f /var/log/postgresql/postgresql-16-main.log


3. Server Resource Health Check

  • Objective: To monitor server resources to prevent performance degradation and outages.

  • Frequency: Daily

  • Procedure:

    1. Check Disk Space: Verify that all server drives have adequate free space (>20%).

      df -h

    2. Check Disk I/O Performance: Ensure disk I/O speed is adequate (e.g., > 200 MB/s).

      Example: This creates a 1GB test file in /tmp. Use with caution.

      dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync

    3. Monitor CPU and Memory: Check the current CPU and memory utilization.

      Example: Display real-time process and resource usage (Press 'q' to exit)

      top

      Example: Get a snapshot of memory usage using the below command:

      free -h


4. Network Connectivity Health Check

  • Objective: To ensure the ServiceOps server can communicate with critical internal and external services.
  • Frequency: Daily
  • Procedure:
    1. Database Connectivity: Ping the database server from the application server.

      Syntax: ping <DATABASE_SERVER_IP>

      Example:

      ping 172.16.13.40
    2. Check Port Connectivity: Use telnet or nc to confirm the database port is open.

      Syntax: nc -zv <DATABASE_SERVER_IP> <PORT>

      Example: nc -zv 172.16.13.40 5432


5. Backup and Recovery Health Check

  • Objective: To ensure data can be recovered in case of a disaster.

  • Frequency: In case of disaster, perform the backup and recovery health check.

  • Procedure:

    1. Daily Verification: Confirm that the automated database backups completed successfully by checking the backup logs from the service.log file. You can view the file as a root user from the below path:

      /opt/flotomate/main-server/logs/common

    2. Weekly Backup Integrity Check: Use pg_restore to list the contents of a backup file. This verifies that the backup is readable without performing a full restore.

      Syntax: pg_restore --list <PATH_TO_BACKUP_FILE>

      Example:

      pg_restore --list /home/motadata/backupDB_12-03-2025/flotoitsmdb_dump | head -10
    3. Monthly Restore Test: Perform a test restore of the database to a separate, non-production environment.


6. Security Health Check

  • Objective: To ensure the system remains secure.

  • Frequency: Weekly

  • Procedure:

    1. SSL Certificates: Verify that all SSL/TLS certificates are valid and not nearing their expiration date.

      To check the SSL certificate details and expiry date, you can use the following openssl command. Replace your-domain.com with your actual domain.

      openssl s_client -connect your-domain.com:443 -servername your-domain.com | openssl x509 -noout -dates

      Alternatively, you can check the certificate directly in your web browser:

      • Access your application using HTTPS.
      • Click the lock icon in the address bar.
      • View the certificate details to verify its validity and domain match.
    2. Firewall Rules: Ensure the necessary ports are open and that no unauthorized ports are exposed.

      Example: Check the status and rules for UFW (Uncomplicated Firewall)

      sudo ufw status