Infrastructure

Cognitive Load Engine "Intelligence That Respects Your Mind"

A production-ready AI orchestration system for CLT-aware signal/noise classification. Featuring Mixture of Experts (MoE) architecture, first-principles model selection, and cognitive-load-aware processing optimized for therapeutic journaling and neurodivergent care.

MoE Architecture
5+ Model Backends
REST + WebSocket API
User Input
CLE Frontend
WorkOrder
Planner
Code
Writer
Core Pattern

System Architecture

A modular pipeline that consumes user text and produces intelligent, context-aware responses.

1

CLE Frontend

Deterministic pipeline that consumes user text and produces a WorkOrder. Uses heuristics and pattern matching before any LLM is called.

2

Model Selection Engine

First-principles measurement-based system that assigns models to experts using deterministic probe suites (JSON adherence, coding sanity).

3

Local Model Runtime

ModelProvider abstraction layer supporting HuggingFace, llama.cpp (GGUF), vLLM, Ollama, and cloud APIs (Groq, OpenAI, Anthropic).

4

Orchestrator

Consumes WorkOrder and routes tasks through specialized Experts (Planner, Code, Writer) assigned by the Selection Engine.

5

Persistence Layer

Stores long-term memory, user profiles, and session history in SQLite. Thread-safe with memory-aware LRU eviction.

6

Therapist Dashboard

Surfaces risk signals and compresses clinical context for mental health professionals. HIPAA-aligned data handling.

Key Features

Production-Ready Intelligence

Enterprise-grade features for real-world deployment.

Mixture of Experts

Unlike monolithic LLMs, tasks are split into a graph. Specialists handle planning, coding, and writing with optimized model assignments.

Pluggable Runtimes

HF Transformers for dev, llama.cpp for efficiency, vLLM for serving, cloud APIs for scale. Hybrid local/cloud routing for privacy.

Progressive Disclosure

Adaptive streaming of text based on real-time cognitive load metrics. Information delivered at the pace you can process.

Attention Management

Cognitive-load-aware notification clustering by theme. Delivery scheduled based on user's current mental bandwidth.

Model Caching

Thread-safe ModelCache with memory-aware LRU eviction. Prevents 14GB reloads and manages VRAM/RAM lifecycle.

Resilience Patterns

Custom exception hierarchy for granular error recovery. Retry transient timeouts, escalate crisis-level content.

Supported Runtimes

Flexible Model Backends

Choose the right engine for your deployment scenario.

HuggingFace Transformers

Best for development and experimentation. 4-bit/8-bit quantization support.

Development

llama.cpp (GGUF)

Maximum efficiency and broad model support. Optimized for local inference.

Efficiency

vLLM / Ollama

Optimized for serving and local server integration. High-throughput inference.

Serving

Cloud APIs

Groq (~500 tok/sec), OpenAI, Anthropic. Hybrid routing between local and cloud.

Scale
Installation

Get Started

Modular installation via pyproject.toml for flexible dependency management.

pip install -e ".[api]" FastAPI + Uvicorn
pip install -e ".[llamacpp]" llama.cpp backend
pip install -e ".[full]" All backends + API
pyproject.toml
[project.optional-dependencies]
core = [
    "transformers",
    "torch",
    "accelerate"
]
api = [
    "fastapi",
    "uvicorn",
    "pydantic",
    "httpx"
]
llamacpp = ["llama-cpp-python"]
vllm = ["httpx"]
full = ["cle[core,api,llamacpp,vllm]"]
Observability

Integrated with W33KND

The CLE is instrumented with the CLE Observer for real-time monitoring and analysis.

Router Decisions

Real-time tracking of expert selection for specific inputs

Latency Tracking

Processing time across the pipeline for bottleneck identification

Refusal Analysis

Safety-triggered refusal analysis to tune policy strictness

Model Registry

Cost and performance tracking across model versions

Power Your AI with CLE

The brain behind the PRJCT LAZRUS ecosystem. Contact us to learn more.