Infrastructure

Cognitive Load Engine "Intelligence That Respects Your Mind"

A production-ready AI orchestration system for CLT-aware signal/noise classification. Featuring Mixture of Experts (MoE) architecture, first-principles model selection, and cognitive-load-aware processing optimized for therapeutic journaling and neurodivergent care.

View Architecture

MoE Architecture

5+ Model Backends

REST + WebSocket API

User Input

CLE Frontend

WorkOrder

Planner

Code

Writer

Core Pattern

System Architecture

A modular pipeline that consumes user text and produces intelligent, context-aware responses.

CLE Frontend

Deterministic pipeline that consumes user text and produces a WorkOrder. Uses heuristics and pattern matching before any LLM is called.

Model Selection Engine

First-principles measurement-based system that assigns models to experts using deterministic probe suites (JSON adherence, coding sanity).

Local Model Runtime

ModelProvider abstraction layer supporting HuggingFace, llama.cpp (GGUF), vLLM, Ollama, and cloud APIs (Groq, OpenAI, Anthropic).

Orchestrator

Consumes WorkOrder and routes tasks through specialized Experts (Planner, Code, Writer) assigned by the Selection Engine.

Persistence Layer

Stores long-term memory, user profiles, and session history in SQLite. Thread-safe with memory-aware LRU eviction.

Therapist Dashboard

Surfaces risk signals and compresses clinical context for mental health professionals. HIPAA-aligned data handling.

Key Features

Production-Ready Intelligence

Enterprise-grade features for real-world deployment.

Mixture of Experts

Unlike monolithic LLMs, tasks are split into a graph. Specialists handle planning, coding, and writing with optimized model assignments.

Pluggable Runtimes

HF Transformers for dev, llama.cpp for efficiency, vLLM for serving, cloud APIs for scale. Hybrid local/cloud routing for privacy.

Progressive Disclosure

Adaptive streaming of text based on real-time cognitive load metrics. Information delivered at the pace you can process.

Attention Management

Cognitive-load-aware notification clustering by theme. Delivery scheduled based on user's current mental bandwidth.

Model Caching

Thread-safe ModelCache with memory-aware LRU eviction. Prevents 14GB reloads and manages VRAM/RAM lifecycle.

Resilience Patterns

Custom exception hierarchy for granular error recovery. Retry transient timeouts, escalate crisis-level content.

Supported Runtimes

Flexible Model Backends

Choose the right engine for your deployment scenario.

HuggingFace Transformers

Best for development and experimentation. 4-bit/8-bit quantization support.

Development

llama.cpp (GGUF)

Maximum efficiency and broad model support. Optimized for local inference.

Efficiency

vLLM / Ollama

Optimized for serving and local server integration. High-throughput inference.

Serving

Cloud APIs

Groq (~500 tok/sec), OpenAI, Anthropic. Hybrid routing between local and cloud.

Scale

Installation

Get Started

Modular installation via pyproject.toml for flexible dependency management.

pip install -e ".[api]" FastAPI + Uvicorn

pip install -e ".[llamacpp]" llama.cpp backend

pip install -e ".[full]" All backends + API

pyproject.toml

[project.optional-dependencies]
core = [
    "transformers",
    "torch",
    "accelerate"
]
api = [
    "fastapi",
    "uvicorn",
    "pydantic",
    "httpx"
]
llamacpp = ["llama-cpp-python"]
vllm = ["httpx"]
full = ["cle[core,api,llamacpp,vllm]"]

Observability

Integrated with W33KND

The CLE is instrumented with the CLE Observer for real-time monitoring and analysis.

Router Decisions

Real-time tracking of expert selection for specific inputs

Latency Tracking

Processing time across the pipeline for bottleneck identification

Refusal Analysis

Safety-triggered refusal analysis to tune policy strictness

Model Registry

Cost and performance tracking across model versions

Cognitive Load Engine "Intelligence That Respects Your Mind"

System Architecture

CLE Frontend

Model Selection Engine

Local Model Runtime

Orchestrator

Persistence Layer

Therapist Dashboard

Production-Ready Intelligence

Mixture of Experts

Pluggable Runtimes

Progressive Disclosure

Attention Management

Model Caching

Resilience Patterns

Flexible Model Backends

HuggingFace Transformers

llama.cpp (GGUF)

vLLM / Ollama

Cloud APIs

Get Started

Integrated with W33KND

Power Your AI with CLE