🆘 Flooding Rescue Voice Assistant - Personal Diary

Course: COMP4461 Human-Computer Interaction
Project: Project 2 - Flooding Rescue Voice Assistant
Role: Full-stack Developer & Chat Agent Designer
—

📋 Project Overview

This project is a Flooding Rescue Voice Assistant designed to help emergency dispatchers efficiently collect critical information from flood victims while providing real-time guidance and emotional support. The system integrates Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS) technologies to create a natural, conversational voice interface.

Key Features:

🎤 Real-time voice recognition and transcription
🤖 Intelligent conversation management with context awareness
🔊 Natural voice synthesis for system responses
📊 Automatic information extraction and prioritization
🚨 Dynamic urgency status assessment
🗺️ ETA calculation and safe location recommendations

🎯 My Contributions

1. Complete Chat Agent Architecture Design (100% ownership)

I was responsible for the entire conversational AI system design, from the initial concept to the final implementation. This included:

System architecture design: Designing the multi-layered conversation flow
Prompt engineering: Crafting effective system prompts for different scenarios
Information extraction logic: Implementing intelligent data parsing from natural language
State management: Building a robust session management system
Error handling: Ensuring graceful degradation when components fail

2. Frontend-Backend Integration (100% ownership)

I integrated the frontend user interface with the backend AI services, including:

WebSocket implementation for real-time audio streaming
REST API design for message handling and status updates
CORS configuration for cross-origin requests
SSL certificate generation for secure HTTPS connections
Audio file management for TTS output delivery

🔄 Chat Agent Design: The Iteration Journey

Version 1.0: Basic Streamlit Interface (`chat.py`)

Design Philosophy: Start simple with a GUI-based prototype

Key Features:

# Initial approach: Direct streaming TTS
def voice_chat_worker(config, stop_event):
    # Simple conversation loop
    while not stop_event.is_set():
        text = asr.get_result()
        if text:
            response = llm.chat(text)
            tts.streaming_synthesize(response)

User Need Consideration:

Visual feedback: Users need to see their conversation history
Control interface: Simple START/STOP buttons for easy operation
Status indicators: Real-time priority score and information collection status

Challenges Identified:

❌ Rigid conversation flow: No context awareness between turns
❌ Redundant questions: System repeatedly asks for already-provided information
❌ Limited scalability: Single-session only, no multi-user support

Version 2.0: Intelligent Context Management (`chat.py` - Enhanced)

Design Evolution: Add memory and context tracking

Key Improvements:

def extract_rescue_info(text: str) -> Dict[str, Any]:
    """Extract rescue-related information from user text."""
    # Comprehensive keyword matching for:
    # - Number of people (with "alone" detection)
    # - Injury status (with negation handling)
    # - Water level indicators
    # - Vulnerable populations
    # - Available resources

User Need Consideration:

Avoid repetition: Track what information has been collected
Smart inference: If user says “I’m alone”, infer num_people = 1
Negative detection: Understand “no one injured” vs “someone injured”

Example:

# Before: Dumb extraction
if "injured" in text:
    info["has_injury"] = True  # Wrong if user says "no one injured"

# After: Smart extraction with negation handling
no_injury_patterns = [
    "no one injured", "not injured", "we're fine", "not hurt"
]
if any(pattern in text_lower for pattern in no_injury_patterns):
    info["has_injury"] = False  # ✅ Correctly handles negation
elif any(keyword in text_lower for keyword in injury_keywords):
    info["has_injury"] = True

Challenges Identified:

❌ Pattern matching limitations: Can’t handle complex linguistic variations
❌ No web access: Limited to localhost
❌ Missing frontend separation: Hard to customize UI

Version 3.0: LLM-Powered Conversation (`final_version_with_frontend.py`)

Design Revolution: Replace rule-based logic with LLM intelligence

Architectural Shift:

def get_llm_rescue_response(user_text: str) -> Dict[str, Any]:
    """Use LLM to handle conversation and information extraction"""
    
    # Build conversation history
    messages = [{"role": "system", "content": system_prompt}]
    recent_history = session.conversation_history[-10:]  # Keep context
    for msg in recent_history:
        messages.append(msg)
    
    # Let LLM maintain conversation naturally
    response = llm.generate_from_messages(messages)

System Prompt Engineering:

system_prompt = """You are a professional flood rescue assistant.

CONVERSATION STYLE:
- Ask ONLY 1-2 questions at a time, like a real person would
- Keep your response brief (2-3 sentences maximum)
- Wait for the user's answer before asking the next question

INFORMATION TO GATHER (ask separately, ONE topic at a time):
1. Number of people
2. Injuries
3. Water level
4. Location safety

IMPORTANT RULES:
- NEVER ask multiple questions in one response
- Remember what the user already told you
- If user says "I'm alone", don't ask about children/elderly
"""

User Need Consideration:

Natural conversation: LLM creates human-like dialogue flow
Context retention: System remembers previous answers
Adaptive questioning: Adjusts questions based on user’s situation
Emotional intelligence: Balances information gathering with empathy

Separate Information Analysis:

def analyze_people_and_urgency(conversation_history):
    """Use independent LLM instance for background analysis"""
    
    analysis_prompt = f"""Extract from conversation:
    1. Number of adults
    2. Number of children
    3. Number of elderly
    4. Urgency status: LESS URGENT / URGENT / EMERGENT
    
    RULES:
    - If already known (not N/A), KEEP IT unless user corrects
    - Only update N/A values when new information found
    - Return 0 (not N/A) when user explicitly states none
    """
    
    # Create separate LLM to avoid polluting main conversation
    analysis_llm = create_llm(model="qwen-plus")
    response = analysis_llm.chat(analysis_prompt)

Why Separate LLM Instance?

🎯 Clean separation: Analysis doesn’t interfere with conversation
🎯 Incremental updates: Only update when new information appears
🎯 Avoid confusion: Main conversation LLM stays focused on user

Example Interaction:

User: "Yes, I need help"
Assistant: "Copy that. How many adults are with you?"

User: "5 adults"
Assistant: "Understood, 5 adults. Are there any children or elderly with you?"

User: "Yes, 1 child"
Assistant: "Got it, 1 child. Any elderly people?"

User: "No elderly"
Assistant: "Copy that. Is anyone injured or in immediate danger?"

Challenges Identified:

❌ Information drift: LLM sometimes “forgets” confirmed numbers
❌ Over-asking: Still asks about children/elderly even when user said “I’m alone”

Version 4.0: Staged Information Collection (`flooding_rescue_api_version.py`)

Design Refinement: Implement explicit conversation stages

Staged Collection Logic:

class RescueSession:
    def __init__(self):
        # Stage tracking flags
        self.adult_count_asked = False
        self.adult_count_confirmed = False
        self.children_elderly_asked = False
        self.children_elderly_confirmed = False
        self.info_confirmed = False  # Stop asking when complete

Stage 1: Adult Count

if not session.adult_count_confirmed:
    analysis_prompt = """Extract ONLY the number of ADULTS.
    
    RULES:
    - If user says "I'm alone": return ADULT=1
    - If mentions specific number: return that number
    - If says "we" without number: return N/A
    
    Respond: ADULT: [number or N/A]
    """
    
    # Once confirmed, move to Stage 2
    if adult_count == 1:
        # Special case: If alone, auto-set children=0, elderly=0
        session.children_elderly_confirmed = True

Stage 2: Children and Elderly

if adult_count_confirmed and not children_elderly_confirmed:
    analysis_prompt = """Extract CHILDREN and ELDERLY.
    
    CRITICAL RULES:
    - If mentions children but NOT elderly: CHILDREN=[n], ELDERLY=0
    - If mentions elderly but NOT children: ELDERLY=[n], CHILDREN=0
    - If says "no children": CHILDREN=0
    - Return N/A only if not mentioned at all
    """

User Need Consideration:

Efficiency: Stop analyzing once information is complete
Intelligence: Infer implicit information (alone → no children/elderly)
Consistency: Lock confirmed values to prevent drift
Performance: Skip unnecessary LLM calls with info_confirmed flag

Optimization:

def analyze_people_and_urgency(conversation_history):
    # STOP unnecessary processing
    if session.info_confirmed:
        print("✅ All information confirmed, skipping LLM analysis")
        return current_info
    
    # Stage-based analysis
    if not session.adult_count_confirmed:
        # Only analyze adults
    elif not session.children_elderly_confirmed:
        # Only analyze children/elderly
    else:
        # Only update urgency status

Version 5.0: Frontend-Backend Separation (Final Version)

Architectural Components:

Backend (FastAPI):
- WebSocket for real-time audio streaming
- REST API for message handling
- Session management
- AI model orchestration
Frontend (HTML/CSS/JavaScript):
- Modern responsive UI
- Web Audio API for microphone access
- Real-time transcription display
- Status panels and visualizations

Integration Points:

# Backend WebSocket endpoint
@app.websocket("/ws/audio")
async def websocket_audio_endpoint(websocket: WebSocket):
    await websocket.accept()
    
    # Create ASR instance for this connection
    local_asr = create_asr(model="paraformer-realtime-v2")
    local_asr.start()
    
    while not ws_closed:
        audio_data = await websocket.receive()
        local_asr.recognition.send_audio_frame(audio_data["bytes"])
        
        text = local_asr.get_result()
        if text:
            await websocket.send_json({
                "type": "transcript",
                "text": text
            })

// Frontend: Connect to WebSocket
const ws = new WebSocket('wss://localhost:8000/ws/audio');

// Send audio stream
navigator.mediaDevices.getUserMedia({ audio: true })
    .then(stream => {
        const audioContext = new AudioContext();
        const processor = audioContext.createScriptProcessor(4096, 1, 1);
        
        processor.onaudioprocess = (e) => {
            const audioData = e.inputBuffer.getChannelData(0);
            ws.send(audioData);  // Send to backend
        };
    });

// Receive transcriptions
ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === 'transcript') {
        displayTranscription(data.text);
    }
};

User Need Consideration:

Accessibility: Works on any device with a browser
Real-time feedback: Instant transcription display
Visual clarity: Separate panels for conversation, status, and actions
Network resilience: Handles disconnections gracefully

🖼️ Frontend User Interface

Main Interface Screenshot

Frontend Interface

Key UI Components

Conversation Area:
- Real-time message display
- User and assistant messages clearly distinguished
- Auto-scroll to latest message
Status Panel:
- People count (Adults / Children / Elderly / Total)
- Urgency status indicator (LESS URGENT / URGENT / EMERGENT)
- Information collection progress
Control Panel:
- Microphone toggle button
- Volume indicator
- Connection status
Guidance Panel:
- Safety tips
- Nearby safe locations
- ETA information (when available)

🤖 AI Usage Acknowledgment

AI Tools Used

Claude AI (Anthropic) - Used for:
- Code debugging and optimization suggestions
- Documentation writing assistance
- System prompt refinement ideas
LLM APIs - Integrated into the system:
- Qwen-Plus (Alibaba): Main conversation LLM
- Paraformer (Alibaba): Real-time ASR
- CosyVoice (Alibaba): Text-to-speech synthesis

Human Contributions

All core design decisions, architecture, and implementation were done by me:

✅ System architecture and conversation flow design
✅ State management and session handling logic
✅ All prompt engineering and testing
✅ Frontend-backend integration
✅ All code implementation and debugging
✅ User testing and iteration based on feedback

AI was used only as a coding assistant, not as the primary designer or implementer.

💭 Reflection and Learnings

What Went Well

Iterative Design Approach:
- Each version addressed specific shortcomings of the previous one
- Gradual complexity increase prevented overwhelm
- Regular testing at each stage ensured stability
User-Centered Thinking:
- Considered different user scenarios (alone vs. with family, injured vs. safe)
- Designed for high-stress situations (brevity, clarity, reassurance)
- Balanced information gathering with emotional support
Technical Integration:
- Successfully integrated multiple AI services (ASR, LLM, TTS)
- Achieved real-time performance with WebSocket streaming
- Handled edge cases and error states gracefully

Challenges Faced

LLM Hallucination:
- Problem: LLM sometimes “made up” information not stated by user
- Solution: Separate analysis LLM with strict extraction rules
- Lesson: Don’t trust LLM memory; use explicit state management
Context Window Limits:
- Problem: Long conversations exceeded LLM context limits
- Solution: Keep only recent 10 messages in conversation history
- Lesson: Design for realistic conversation lengths
Information Consistency:
- Problem: LLM changed confirmed values in later analysis
- Solution: Lock confirmed values with flags (adult_count_confirmed)
- Lesson: Human-like memory requires explicit tracking, not just prompts
Network Latency:
- Problem: TTS generation delay created awkward pauses
- Solution: Use streaming synthesis and audio buffering
- Lesson: Real-time systems need careful latency management

What I Would Do Differently

Earlier User Testing:
- I focused too much on technical implementation early on
- Should have tested with real users after Version 2.0 to identify usability issues
More Structured State Machine:
- The staged collection approach came late in the project
- A formal state machine design from the start would have saved iteration time
Better Error Messages:
- Current error handling is functional but not user-friendly
- Would add more specific, actionable error messages for users
Conversation Repair Strategies:
- Need better handling when LLM misunderstands user input
- Should implement confirmation questions for critical information

Future Improvements

Offline Mode: Cache essential functionality for no-internet scenarios
Visual Aids: Add map integration to show rescue team location
Database Integration: Store conversations for analysis and improvement
Multi-User Dashboard: Allow dispatchers to manage multiple rescue calls

Leyi Wu

🆘 Flooding Rescue Voice Assistant - Personal Diary

📋 Project Overview

🎯 My Contributions

1. Complete Chat Agent Architecture Design (100% ownership)

2. Frontend-Backend Integration (100% ownership)

🔄 Chat Agent Design: The Iteration Journey

Version 1.0: Basic Streamlit Interface (chat.py)

Version 2.0: Intelligent Context Management (chat.py - Enhanced)

Version 3.0: LLM-Powered Conversation (final_version_with_frontend.py)

Version 4.0: Staged Information Collection (flooding_rescue_api_version.py)

Version 5.0: Frontend-Backend Separation (Final Version)

🖼️ Frontend User Interface

Main Interface Screenshot

Key UI Components

🤖 AI Usage Acknowledgment

AI Tools Used

Human Contributions

💭 Reflection and Learnings

What Went Well

Challenges Faced

What I Would Do Differently

Future Improvements

Version 1.0: Basic Streamlit Interface (`chat.py`)

Version 2.0: Intelligent Context Management (`chat.py` - Enhanced)

Version 3.0: LLM-Powered Conversation (`final_version_with_frontend.py`)

Version 4.0: Staged Information Collection (`flooding_rescue_api_version.py`)