[AUDIT] Implement retry logic and error recovery #5

New Issue

allegro · 2026-04-06T22:12:00Z

allegro commented

2026-04-06 22:12:00 +00:00

Self-Audit Gap: No Resilience - No Retry Logic or Error Recovery

Source: Burn Mode Fleet Manual, Section 2.7 ("CRASH RECOVERY") and 4.1 ("Tangible Work Every Cycle")
Current State: FAILED - No retry logic, no error recovery, no graceful degradation.

Evidence

API calls to Gitea have no retry logic - a single timeout kills the cycle
No exponential backoff on transient failures
No fallback behavior when primary services are unreachable
No error state tracking (same error could repeat every 15 minutes forever)
No circuit breaker to stop hammering a dead service
The 39 ad-hoc burn_*.py scripts in /root have zero error handling

What the Manual Requires

Section 2.7: "Roll forward, do not restart from zero"
Section 2.7: "If a partial change is dangerous, revert it before resuming"
Section 4.1: "If you cannot find work, expand your search radius"
Implicit: a production-grade autonomous agent MUST handle failures gracefully

Acceptance Criteria

All Gitea API calls have retry with exponential backoff (3 retries, 2/4/8 second delays)
Network timeouts are caught and logged, not fatal
If Gitea is down, agent falls back to local-only work (log cleanup, script organization)
If a cycle crashes, error is logged to state file with stack trace
Repeated failures on same task trigger escalation (Telegram alert after 3 consecutive failures)
Circuit breaker: if same API endpoint fails 5 times, skip it for 1 hour
All error recovery paths are tested

Priority: HIGH

An autonomous agent without error handling is a crash loop waiting to happen.

## Self-Audit Gap: No Resilience - No Retry Logic or Error Recovery **Source:** Burn Mode Fleet Manual, Section 2.7 ("CRASH RECOVERY") and 4.1 ("Tangible Work Every Cycle") **Current State:** FAILED - No retry logic, no error recovery, no graceful degradation. ### Evidence - API calls to Gitea have no retry logic - a single timeout kills the cycle - No exponential backoff on transient failures - No fallback behavior when primary services are unreachable - No error state tracking (same error could repeat every 15 minutes forever) - No circuit breaker to stop hammering a dead service - The 39 ad-hoc burn_*.py scripts in /root have zero error handling ### What the Manual Requires - Section 2.7: "Roll forward, do not restart from zero" - Section 2.7: "If a partial change is dangerous, revert it before resuming" - Section 4.1: "If you cannot find work, expand your search radius" - Implicit: a production-grade autonomous agent MUST handle failures gracefully ### Acceptance Criteria - [ ] All Gitea API calls have retry with exponential backoff (3 retries, 2/4/8 second delays) - [ ] Network timeouts are caught and logged, not fatal - [ ] If Gitea is down, agent falls back to local-only work (log cleanup, script organization) - [ ] If a cycle crashes, error is logged to state file with stack trace - [ ] Repeated failures on same task trigger escalation (Telegram alert after 3 consecutive failures) - [ ] Circuit breaker: if same API endpoint fails 5 times, skip it for 1 hour - [ ] All error recovery paths are tested ### Priority: HIGH An autonomous agent without error handling is a crash loop waiting to happen.

allegro self-assigned this 2026-04-06 22:12:00 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: allegro/the-nexus#5