Web LLM lets you run LLMs natively in your frontend using the new WebGPU standard.

Click to see table of contents

I recently discovered a new project called Web LLM, which provides a runtime inside the browser for running LLMs. It’s great because it directly accesses the GPU via the new WebGPU standard, and it’s a huge step forward for privacy and cost savings, among other advantages.

Using Web LLM for your SaaS or enterprise project, you get several benefits:

  1. LLMs no longer run on expensive servers.
  2. Inference is done locally on users’ machines.
  3. It’s private because user data never leaves their browser.
  4. There is no vendor lock-in.

This is because the LLM weights are downloaded directly to the user’s machine, the LLM is run directly inside the browser, and inference happens in the browser itself. There is no need to purchase and maintain a subscription to an LLM API like OpenAI or Claude. There is no need to self-host your own LLM on your own servers, either. Nor does the user need to install any software on their machine, as is the case in LM Studio and Ollama. The user just needs a modern browser that supports WebGPU, and a machine that is powerful enough to run LLMs, and they’re good to go.

Best of all, it’s free open source software (FOSS) with an Apache 2.0 license, so you can use it without needing to pay for a license.

Diagrams: Before and After

Here are a couple of sequence diagrams to illustrate the difference between the current state and the new capability:

Baseline state diagram: Here’s a typical example of current LLM usage:

Over here, the browser sends an inference request to the backend, which then sends an inference request to the LLM API. The LLM API sends an inference response to the backend, which then forwards the inference response back to the browser.

sequenceDiagram
    Browser->>Backend: Inference Request
    Backend->>LLM API: Inference Request
    LLM API->>Backend: Inference Response
    Backend->>Browser: Inference Response

New capability diagram: Here’s how Web LLM works:

Now, using Web LLM, the browser sends an inference request to… itself! The LLM runs on the user’s own laptop, using the user’s own GPU. There is not even a need for a backend server, let alone an LLM API.

sequenceDiagram
    Browser->>Browser: Inference Request
    Browser->>Browser: Inference Response

Implications

This is a huge step forward for privacy and cost savings. It means that you can run LLMs on your users’ machines, without needing to send data to a server.

Huge cost savings for SaaS companies & unlimited scalability

The thing I’m most excited about is the cost savings for SaaS companies.

As long as the user has a machine that can run LLMs, the cost of running LLMs is now $0 for the company. The cost is now borne by the user (and is effectively $0 for them, too, since our SaaS will now be using the user’s own GPU).

This is a huge improvement from the current state, where we have to pay for the server costs of running LLMs.

This means that we can provide LLM-powered services to many more users, without needing to worry about the cost of running the LLMs. Scaling up is as simple as adding more users, and we don’t need to worry about the cost of running the LLMs.

Privacy

There are also privacy implications. Since the data never leaves the user’s machine, it’s much more private than the current state, where data is sent to a server for processing.

No vendor lock-in

Also, since all the models are open source, there is no vendor lock-in. You can run any model you want, without needing to pay for a license. You can even provide users with the ability to choose which model they want to use, or proactively switch between models based on the user’s needs.

No more need for Ollama to run local LLMs

This also means that we no longer need to use Ollama to run LLMs. We can run LLMs natively in the browser, without needing to send data to a server. The WebGPU standard lets the LLM punch through the browser and access the GPU directly.

How to install and use Web LLM

You can use Web LLM simply by npm install

npm i @mlc-ai/web-llm 

Add this in your HTML

<!-- index.html -->
<div id="progress"></div>
<script src="index.ts"></script>

And then, in your JavaScript code:

// index.ts
import * as webllm from "@mlc-ai/web-llm";

// execute on page load
document.addEventListener("DOMContentLoaded", async () => {
  // get the progress reporter element
  const progressReporter = document.getElementById("progress")!;

  // create a new chat module. tell it to print the progress of the initialization.
  const chat = new webllm.ChatModule();
  chat.setInitProgressCallback((report) => progressReporter.innerText = report.text);

  // load the RedPajama-INCITE-Chat-3B model. (There are others availalbe, too).
  // this is several gigabytes of data, so it may take a while to load. 
  // don't do this on a mobile connection...
  try {
    await chat.reload("RedPajama-INCITE-Chat-3B-v1-q4f32_1");
  } catch (e: any) {
    console.error(e);
    progressReporter.innerText = e.message || "An unexpected error occurred.";
    return;
  }

  // finally, generate a response to a question
  const QUESTION = "What is the capital of canada?";
  await chat.generate(QUESTION, (_step, message) => {
    progressReporter.innerText = `${QUESTION} -> ${message}`;
  });
});

Results

http://localhost:4000/assets/2024-02-23-running-llms-in-the-browser/web%20llm.gif

Current limitations

Since it’s a new project, Web LLM has a few limitations:

FAQs

Isn’t this useless for RAG?

This is a semi-valid concern. This client-side library will not be able to do inference close to the database. However, it is possible to run a RAG-like architecture on the client side. For example, you can run a RAG-like architecture on the client side by using a local SQLite database. You can also surface search results from a server-side database, and then run the LLM on the client side. This is a good way to get the best of both worlds: free LLM inference on the client side, and a full RAG pipeline with embeddings on the server side.

Also, RAG is not the only capability or architecture that LLMs enable. For example, think about the situation where you want a UI that is commanded by a chatbot assistant. “Hey Gmail, compose an email about _____ for me.” Or, another example, “Hey Figma, create a button that’s blue and has a border radius of 5px.” These things can be free for the company if they run on the client side.

How is this different from Ollama?

Ollama is a great project that exposes an LLM via an OpenAI-compatible API. It can either deploy on the server side or locally on your laptop. If deployed on the server side, the company needs to bear the costs of a powerful server instance, which can be $100’s to $1000’s of dollars per month. On the other hand, if deployed on the user’s laptop, Ollama requires local installation; this is a hard “no” for SaaS companies.

Web LLM is different because it runs LLMs entirely in the browser via the new WebGPU standard. It downloads the model weights directly from Hugging Face, and caches them in browser application storage. This means you get the best of both worlds: any average user with a powerful enough laptop will be able to run LLMs without installing any additional software; and, the company doesn’t need to bear the cost of running LLMs on a server, either.

Does the user need to install anything?

No. The user just needs a modern browser that supports WebGPU, and a machine that is powerful enough to run LLMs. Web LLM is a JavaScript library that can be added to any project with a simple npm install, and weights are downlodaded directly from Hugging Face and onto the user’s machine upon first use. There is no need for the user to install any software, and the user experience is not affected at all. Traditional SaaS companies will love this, as will enterprise companies.

Is there a live demo?

Yes! You can try out the live demo at https://webllm.mlc.ai. Note that right now (Feb 2024), it only supports the latest browsers on desktop only, excluding Firefox. (Chrome works for sure)