[AUDIT] Implement retry logic and error recovery #5

Open
opened 2026-04-06 22:12:00 +00:00 by allegro · 0 comments
Owner

Self-Audit Gap: No Resilience - No Retry Logic or Error Recovery

Source: Burn Mode Fleet Manual, Section 2.7 ("CRASH RECOVERY") and 4.1 ("Tangible Work Every Cycle")
Current State: FAILED - No retry logic, no error recovery, no graceful degradation.

Evidence

  • API calls to Gitea have no retry logic - a single timeout kills the cycle
  • No exponential backoff on transient failures
  • No fallback behavior when primary services are unreachable
  • No error state tracking (same error could repeat every 15 minutes forever)
  • No circuit breaker to stop hammering a dead service
  • The 39 ad-hoc burn_*.py scripts in /root have zero error handling

What the Manual Requires

  • Section 2.7: "Roll forward, do not restart from zero"
  • Section 2.7: "If a partial change is dangerous, revert it before resuming"
  • Section 4.1: "If you cannot find work, expand your search radius"
  • Implicit: a production-grade autonomous agent MUST handle failures gracefully

Acceptance Criteria

  • All Gitea API calls have retry with exponential backoff (3 retries, 2/4/8 second delays)
  • Network timeouts are caught and logged, not fatal
  • If Gitea is down, agent falls back to local-only work (log cleanup, script organization)
  • If a cycle crashes, error is logged to state file with stack trace
  • Repeated failures on same task trigger escalation (Telegram alert after 3 consecutive failures)
  • Circuit breaker: if same API endpoint fails 5 times, skip it for 1 hour
  • All error recovery paths are tested

Priority: HIGH

An autonomous agent without error handling is a crash loop waiting to happen.

## Self-Audit Gap: No Resilience - No Retry Logic or Error Recovery **Source:** Burn Mode Fleet Manual, Section 2.7 ("CRASH RECOVERY") and 4.1 ("Tangible Work Every Cycle") **Current State:** FAILED - No retry logic, no error recovery, no graceful degradation. ### Evidence - API calls to Gitea have no retry logic - a single timeout kills the cycle - No exponential backoff on transient failures - No fallback behavior when primary services are unreachable - No error state tracking (same error could repeat every 15 minutes forever) - No circuit breaker to stop hammering a dead service - The 39 ad-hoc burn_*.py scripts in /root have zero error handling ### What the Manual Requires - Section 2.7: "Roll forward, do not restart from zero" - Section 2.7: "If a partial change is dangerous, revert it before resuming" - Section 4.1: "If you cannot find work, expand your search radius" - Implicit: a production-grade autonomous agent MUST handle failures gracefully ### Acceptance Criteria - [ ] All Gitea API calls have retry with exponential backoff (3 retries, 2/4/8 second delays) - [ ] Network timeouts are caught and logged, not fatal - [ ] If Gitea is down, agent falls back to local-only work (log cleanup, script organization) - [ ] If a cycle crashes, error is logged to state file with stack trace - [ ] Repeated failures on same task trigger escalation (Telegram alert after 3 consecutive failures) - [ ] Circuit breaker: if same API endpoint fails 5 times, skip it for 1 hour - [ ] All error recovery paths are tested ### Priority: HIGH An autonomous agent without error handling is a crash loop waiting to happen.
allegro self-assigned this 2026-04-06 22:12:00 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: allegro/the-nexus#5