Hippo MVP Design Document

AI-Generated Salient Insights - Minimal Viable Prototype

Core Hypothesis

Can AI-generated insights + reinforcement learning + embracing messiness actually surface more valuable knowledge than traditional structured memory systems?

The key insights:

Generate insights cheaply and frequently - let AI create many insights without perfect organization
Let natural selection through reinforcement determine what survives - user feedback shapes what becomes prominent
Embrace the mess - don't try to create highly structured taxonomies or perfect categorization
Trust temporal scoring - let time, usage patterns, and reinforcement naturally organize knowledge

This approach contrasts with traditional knowledge management that emphasizes upfront structure, careful categorization, and manual curation. Instead, Hippo bets that organic emergence through usage patterns can be more effective than imposed structure.

MVP Scope

What It Does

Automatic Insight Generation: AI generates insights continuously during conversation at natural moments (consolidation, "make it so", "ah-ha!" moments, pattern recognition)
Simple Storage: Single JSON file with configurable path
Natural Decay: Insights lose relevance over time unless reinforced
Reinforcement: During consolidation moments, user can upvote/downvote insights
Context-Aware Search: Retrieval considers both content and situational context with fuzzy matching

What It Doesn't Do (Yet)

Graph connections between insights
Complex reinforcement algorithms
Cross-session learning
Memory hierarchy (generic vs project-specific)
Automatic insight detection triggers

Temporal Scoring System

Core Concept

Insights are ranked using a composite relevance score that combines four factors based on research in information retrieval systems. This ensures recently accessed, frequently used, and important insights surface first while maintaining contextual relevance.

Composite Relevance Formula

relevance = 0.30 × recency + 0.20 × frequency + 0.35 × importance + 0.15 × context

Weighting Rationale:

Importance (35%): Highest weight - user feedback through reinforcement learning
Recency (30%): Second highest - recently accessed insights are more likely relevant
Frequency (20%): Regular usage indicates ongoing value
Context (15%): Situational matching for query relevance

Temporal Factors

Recency Score

Exponential decay based on days since last access:

recency = exp(-0.05 × days_since_last_access)

Recent access (day 0): score ≈ 1.0
One week old: score ≈ 0.7
One month old: score ≈ 0.2

Frequency Score

Uses 30-day sliding window to prevent dilution from ancient history:

frequency = total_accesses_in_last_30_days / 30

Normalized to 0-1 range with maximum reasonable frequency cap
Prevents "funny frequency behavior" where long gaps reduce scores

Active Day System

Time advances only when system is actively used, making scoring "vacation-proof":

Calendar days without usage don't advance temporal calculations
Ensures insights don't decay during periods of non-use
Maintains relevance relationships based on actual usage patterns

Reinforcement Learning

Importance Modification

Upvote: new_importance = min(1.0, current_importance × 1.5)
Downvote: new_importance = current_importance × 0.5
Decay: current_importance = base_importance × 0.9^days_since_reinforcement

Learning Principle

User feedback (upvotes/downvotes) directly modifies importance, which has the highest weight in relevance calculation. This creates a feedback loop where valuable insights become more prominent over time.

Search Architecture

Two-Phase Process

Scoring Phase: Compute relevance for all insights with minimal filtering
Filtering Phase: Apply user-specified relevance ranges and pagination

Distribution Metadata

Search returns relevance distribution across all insights for the given query/situation, helping clients understand what additional data exists beyond filtered results.

Semantic Matching

Content: Uses sentence transformers for semantic similarity with substring boost
Situation: Combines exact matching (high score) with semantic similarity fallback
Thresholds: Content and situation relevance must exceed 0.4 to be considered matches

Data Model

{
  "active_day_counter": 15,
  "last_calendar_date_used": "2025-07-26", 
  "insights": [
    {
      "uuid": "abc123-def456-789",
      "content": "User prefers dialogue format over instruction lists",
      "situation": ["design discussion", "collaboration patterns"],
      "base_importance": 0.8,
      "created_at": "2025-07-23T17:00:00Z",
      "importance_last_modified_at": "2025-07-25T10:30:00Z",
      "daily_access_counts": [
        [1, 3],   // Active day 1: 3 accesses
        [5, 2],   // Active day 5: 2 accesses  
        [15, 1]   // Active day 15: 1 access
      ]
    }
  ]
}

Key Design Principles

Active Day System: Time only advances when system is used, preventing decay during vacations or periods of non-use.

Bounded Storage: Access history limited to recent entries (typically 90) to prevent unbounded growth while maintaining sufficient data for frequency calculations.

Reinforcement Decay: Importance modifications decay over time, requiring ongoing reinforcement to maintain high relevance.

Situational Context: Multi-element situation arrays enable flexible matching against various contextual filters.

System Constants

Core parameters that tune the temporal scoring behavior:

Recency decay rate: 0.05 per active day
Frequency window: 30 active days
Upvote multiplier: 1.5×
Downvote multiplier: 0.5×
Relevance weights: 30% recency, 20% frequency, 35% importance, 15% context
Match thresholds: 0.4 for content and situation relevance
Maximum reasonable frequency: 10 accesses per day (for normalization)

Philosophy: Embracing Messiness

Traditional knowledge management systems emphasize structure: taxonomies, categories, tags, hierarchies. Hippo takes the opposite approach - embrace the mess and let value emerge organically.

Why Embrace Messiness:

Cognitive overhead: Structured systems require constant categorization decisions
Premature optimization: We often don't know what will be valuable until later
Natural emergence: Usage patterns reveal value better than upfront planning
Reduced friction: No need to "file" insights perfectly before storing them

How Messiness Works in Hippo:

Situational context instead of rigid categories - insights tagged with when/where they occurred
Fuzzy matching - "debugging React" can surface "debugging authentication" insights
Temporal scoring - let time and usage naturally separate wheat from chaff
Reinforcement learning - user feedback shapes what becomes prominent over time

The bet: A messy system with good search and temporal scoring will outperform a perfectly organized system that's too expensive to maintain.

Implementation Architecture

MCP Server Interface

Hippo implements the Model Context Protocol (MCP) providing tools for:

record_insight: Create new insights with content, situation, and importance
search_insights: Query insights with semantic and situational filters
modify_insight: Update content or apply reinforcement (upvote/downvote)

Storage Layer

JSON file storage: Single configurable file for persistence
In-memory operations: All temporal calculations performed in memory
Bounded growth: Access history automatically pruned to prevent unbounded storage

Search Engine

Semantic similarity: Uses sentence transformers for content matching
Situational matching: Combines exact and semantic matching for context
Composite scoring: Real-time relevance calculation using temporal factors
Distribution metadata: Provides relevance distribution for client insight

Testing Strategy

Integration Testing Philosophy

Tests validate behavior through stable MCP interfaces rather than internal implementation details:

Temporal scenarios: Create insights, advance time, verify scoring changes
Controllable time: Test time controller allows arbitrary day advancement
In-memory storage: Tests run without disk I/O for speed and isolation
Realistic workflows: Tests mirror actual usage patterns

Key Test Coverage

Recency decay: Validates exponential decay over time
Frequency windows: Confirms 30-day sliding window prevents dilution
Reinforcement learning: Verifies upvote/downvote effects on importance
Search distribution: Ensures metadata accurately reflects available data

Future Considerations

Potential Enhancements

Graph connections: Link related insights for enhanced discovery
Automatic triggers: Detect natural insight generation moments
Cross-session learning: Adapt scoring based on usage patterns
Memory hierarchy: Separate generic vs project-specific insights

Key Design Decisions

Active Day System

Time advances only when the system is actively used, making all temporal calculations "vacation-proof". This ensures insights don't decay during periods of non-use while maintaining meaningful temporal relationships.

Composite Relevance Scoring

Rather than simple recency or frequency ranking, Hippo uses a research-based weighted formula combining multiple factors. This provides more nuanced ranking that reflects actual insight value.

Reinforcement Learning Integration

User feedback directly modifies importance scores, which carry the highest weight in relevance calculation. This creates a feedback loop where valuable insights become more prominent over time.

Situational Context Matching

Insights include multi-element situation arrays enabling flexible contextual search. This allows matching against various aspects of when/where insights occurred.

Bounded Storage Growth

Access history is automatically pruned to prevent unbounded growth while maintaining sufficient data for accurate frequency calculations.

Research Foundation

The temporal scoring system is based on established research in information retrieval systems, specifically the principle that relevance should combine:

Temporal factors: Recency and frequency of access
Content factors: Semantic similarity and importance
Context factors: Situational relevance to current query

The specific weighting (30/20/35/15%) reflects the relative importance of these factors for knowledge management systems where user feedback (importance) should dominate over purely temporal factors.

Validation Approach

The system includes comprehensive integration tests that validate temporal behavior through realistic scenarios:

Create insights with known characteristics
Advance time using controllable test infrastructure
Verify that relevance scores change as expected
Confirm that reinforcement learning affects ranking appropriately

This testing approach ensures the temporal scoring system behaves correctly over time and validates the core hypothesis that AI-generated insights + user reinforcement can surface valuable knowledge effectively.

For detailed API specifications and implementation details, consult the source code and test suite.

Hippo, yet another MCP memory server