Technical
    WebGPU
    WASM
    Whisper

    How Does Browser-Based AI Work? WebGPU, WASM, and Whisper Explained

    Published: March 16, 202610 min read

    A few years ago, running a neural network in a browser sounded absurd. Today, tools like TalkToTextly run the Whisper speech recognition model — a 74–300MB neural network — entirely in your browser tab, with no server involved. How is this possible? This article explains the technology stack: WebAssembly, WebGPU, transformer models, and how they come together to make browser-based AI real.

    The Short Version

    Browser-based AI works by compiling AI models to run as WebAssembly (compiled binary code) or using the browser's GPU via WebGPU. The model is downloaded once and cached, then all computation happens on your device — just like a native app, but inside a browser tab.

    The Traditional Problem: AI Needs Servers

    Historically, AI inference — running a model on new data — required expensive server infrastructure:

    • A user uploads audio to a server
    • The server (with GPU) runs the AI model
    • The result is sent back to the user

    This model has three problems: privacy (your data leaves your device), cost (GPU servers are expensive), and latency (network round-trip adds delay).

    Browser-based AI solves all three: computation moves to the user's device, eliminating data uploads, server costs, and network delays.

    WebAssembly (WASM): The Foundation

    WebAssembly is a binary instruction format that runs in all modern browsers. Think of it as a universal bytecode — code written in C, C++, Rust, or Go can be compiled to WASM and run in the browser at near-native speed.

    Why does this matter for AI? Neural networks are computationally intensive C++ programs at their core. Frameworks like llama.cpp and whisper.cpp are highly optimized C++ implementations of popular AI models. These can be compiled to WASM and run in the browser with minimal overhead.

    WASM Advantages

    • • Runs in all modern browsers (no plugins)
    • • Near-native execution speed
    • • Sandboxed — can't access the OS directly
    • • Deterministic — same input always gives same output
    • • Supports SIMD instructions for parallel computation

    WASM Limitations

    • • CPU only (no GPU access) — WebGPU fills this gap
    • • Limited memory (historically 4GB max)
    • • Slower than native for some operations
    • • Large models are slow to download

    TalkToTextly uses Whisper compiled to WASM via the @xenova/transformers library (Transformers.js), which ports Hugging Face's transformers to JavaScript/WASM. The model runs on your CPU using WASM SIMD instructions for acceleration.

    WebGPU: Bringing GPU Power to the Browser

    WebGPU is the successor to WebGL and gives web applications direct, low-level access to the device's GPU. Standardized in 2023 and now supported in Chrome, Edge, Firefox, and Safari (with a flag), WebGPU enables GPU-accelerated AI in the browser.

    Neural networks consist largely of matrix multiplications — exactly what GPUs are optimized for. A model that takes 60 seconds on CPU might take 5–10 seconds on GPU. WebGPU makes this possible without any server.

    WASM vs WebGPU for Whisper

    FactorWASM (CPU)WebGPU (GPU)
    SpeedModerateFast (3–5×)
    Browser SupportUniversalChrome/Edge (stable), Firefox/Safari (experimental)
    Model SizeBase/Small preferredMedium/Large feasible
    Battery ImpactHigher (CPU intensive)Lower (GPU more efficient for this)

    How Whisper Runs in the Browser: Step-by-Step

    1You open TalkToTextly

    The browser loads the JavaScript application. No model is loaded yet.

    2First use: Model download

    When you first transcribe, the browser downloads the Whisper model weights from a CDN. The base English model is ~40MB; the multilingual base is ~74MB. This is cached in your browser's IndexedDB for subsequent uses.

    3Audio preprocessing

    Your audio file is decoded by the Web Audio API into a raw PCM waveform. Whisper requires 16kHz mono audio, so the browser resamples your file automatically.

    4Log-Mel spectrogram computation

    The raw audio waveform is converted into a log-Mel spectrogram — a 2D representation of frequency content over time. This is the actual input to Whisper's encoder.

    5Encoder (feature extraction)

    Whisper's transformer encoder processes the spectrogram and extracts audio features as a sequence of embedding vectors. This runs as WASM code on your CPU (or WebGPU on your GPU).

    6Decoder (text generation)

    The transformer decoder generates text tokens one by one, using the encoder's output and previously generated tokens as context. This is autoregressive — each token depends on all previous ones.

    7Text returned to page

    The generated tokens are decoded into UTF-8 text and displayed in the browser. No server involved at any point.

    Web Workers: Keeping the Browser Responsive

    Running a neural network in the browser's main thread would freeze the UI — the page would become unresponsive during transcription. To prevent this, TalkToTextly runs Whisper inside a Web Worker: a background thread that can do heavy computation without blocking the main UI thread.

    The Web Worker communicates with the main page via message passing: "start transcription", "progress update (30%)", "transcription complete, here is the text." This is why you see a live progress bar while the AI works.

    Why This Matters for Privacy

    The entire pipeline described above happens on your device. Your audio file is:

    • Read locally by the browser's File API
    • Processed by WASM code running in the browser sandbox
    • Never serialized or transmitted over the network
    • Discarded when you close the tab

    This is fundamentally different from cloud transcription services where your audio is encrypted in transit but still processed on someone else's servers. See also: why local processing matters for privacy.

    Where Browser AI Is Heading

    WebGPU maturation

    As WebGPU becomes stable in Firefox and Safari, larger Whisper models will be feasible in-browser, closing the accuracy gap with server-side inference.

    Model quantization

    INT4 and INT8 quantized models are 4–8× smaller with minimal accuracy loss. A quantized Whisper large could fit in 200MB, making in-browser deployment of large models practical.

    Prompt APIs

    The Chrome Prompt API proposal would let browsers expose local LLMs to web apps, removing even the model download step by using models built into the browser or OS.

    Experience Browser-Based AI Transcription

    TalkToTextly puts the power of Whisper AI right in your browser. No server, no sign-up, no data upload.

    Featured on There's An AI For That