In today’s AI landscape, it’s incredibly powerful — but sometimes expensive or privacy-sensitive — to rely on cloud-hosted large language models (LLMs). This is where Ollama + Llama 3 shine: they let you run models locally on your hardware, without sending data to external servers. This tutorial shows you how to build an Ollama Llama 3 REST API using Node.js, TypeScript, and Express. By following along, you’ll learn how to stream AI responses in real time.

The streamdots repo is a simple Node.js + TypeScript + Express wrapper around a Llama 3 model served by Ollama. Read on to find out more about this and to get to the link of the fully working repo. Its goal is to make it easy for developers (especially web engineers) to spin up a local REST API for LLM completions, streaming responses, and prompt chaining.

In other words, it solves this problem:

  • Run LLM inference locally, reducing latency, cost, and data privacy risk
  • Expose a developer-friendly API over HTTP, so you can integrate the LLM into your own apps
  • Use modern TypeScript + Express, making it maintainable, testable, and scalable

Why Use Ollama + Llama 3 + Node + TypeScript + Express?

Before jumping into code, let’s talk about the stack and motivations.

What Is Ollama?

  • Ollama is a runtime for local LLMs. It allows you to pull models (e.g. Llama 3) and run them locally via a built-in HTTP API. nextwebspark.com+2Medium+2
  • It’s designed for open-source, on-device models, giving you privacy, lower cost, and control. Wikipedia
  • By default, you can run it via CLI (ollama run llama3) or run it as a persistent service (ollama serve) so you can call it from your own applications. nextwebspark.com

Why Llama 3?

  • Llama 3 (from Meta) is a modern, powerful open language model, suited for chat, instruct, or code-style generation.
  • It’s compatible with Ollama’s local inference. (You pull the model via Ollama, then serve it.)

Why Node.js + TypeScript + Express?

  • Node.js is great for building lightweight, asynchronous HTTP servers.
  • TypeScript adds type-safety, making it easier to maintain and refactor.
  • Express is a minimalist, flexible web framework, perfect for exposing LLM endpoints without too much boilerplate.

Ollama Llama 3 REST API Tutorial: How to Build the Local LLM API

Here’s a step-by-step guide on how to build (or understand) what the streamdots repo does. (If you already have code, I’ll walk through the main parts.)

Prerequisites

  1. Install Ollama from the official site and start it.
  2. Pull the Llama 3 model: ollama pull llama3 (You might need a variant, e.g. llama3:instruct depending on the model card.) Code Unboxing+1
  3. Start the model in server mode: ollama serve llama3 This will host a REST API on default port (e.g. http://localhost:11434). nextwebspark.com+1
  4. Make sure you have Node.js (v14+ ideally) and npm or yarn installed.

Step 1: Initialize the Node / TypeScript Project

mkdir streamdots
cd streamdots
npm init -y
npm install express typescript ts-node @types/node @types/express
Bash

Then, set up tsconfig.json (simplest version):

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "strict": true,
    "outDir": "./dist",
    "esModuleInterop": true
  },
  "include": ["src"]
}
JSON

Create a folder src/ and then create src/index.ts.


Step 2: Create a Basic Express Server

In src/index.ts:

import express, { Request, Response } from 'express';

const app = express();
const port = process.env.PORT || 3000;

app.use(express.json());

app.post('/generate', async (req: Request, res: Response) => {
  const { prompt } = req.body;
  if (!prompt) {
    return res.status(400).json({ error: 'Missing prompt' });
  }

  // We'll call Ollama here (next step)
  res.json({ result: `You said: ${prompt}` });
});

app.listen(port, () => {
  console.log(`LLM API listening at http://localhost:${port}`);
});
TypeScript

Run it (in dev) with:

npx ts-node src/index.ts

Step 3: Connect to Ollama’s API

To call Ollama from your Node server, you can use fetch or a library. There is a JavaScript client for Ollama: ollama-js-client. GitHub

Install it:

npm install ollama-js-client
Bash

Then modify your /generate route to call the local Llama 3 server:

import Ollama from "ollama-js-client";

const ollama = new Ollama({
  model: "llama3",
  url: "http://127.0.0.1:11434/api/",  // or correct base URL
});

app.post("/generate", async (req: Request, res: Response) => {
  const { prompt } = req.body;
  if (!prompt) {
    return res.status(400).json({ error: "Missing prompt" });
  }

  try {
    const response = await ollama.prompt(prompt);
    res.json({ result: response });
  } catch (err) {
    console.error("Error calling Ollama:", err);
    res.status(500).json({ error: "LLM error" });
  }
});
TypeScript

This uses the non-streaming version. If you want streaming (chunked responses), you can also do:

app.post("/generate-stream", (req: Request, res: Response) => {
  const { prompt } = req.body;
  const chunks: string[] = [];

  ollama.prompt_stream(prompt, (error, chunk) => {
    if (error) {
      console.error("Stream error:", error);
      res.write(`{"error":"${error.message}"}`);
      res.end();
      return;
    }
    if (chunk.done) {
      // end of stream
      const full = chunks.join("");
      res.write(JSON.stringify({ result: full }));
      res.end();
    } else {
      // a chunk of content
      chunks.push(chunk.content);
      // optionally flush to client
      res.write(chunk.content);
    }
  });
});
TypeScript

Step 4: Error Handling, Configuration, and Environment Variables

It’s good practice to make your code more robust and configurable.

  • Use .env (with dotenv package) to configure OLLAMA_URL, MODEL_NAME, TEMPERATURE, etc.
  • Wrap the Ollama client creation so you can swap model / host easily.
  • Add error handling for common scenarios (e.g. Ollama is not running, or model not found).
// config.ts
import dotenv from "dotenv";
dotenv.config();

export const OLLAMA_URL = process.env.OLLAMA_URL || "http://127.0.0.1:11434/api/";
export const MODEL_NAME = process.env.MODEL_NAME || "llama3";
export const TEMPERATURE = parseFloat(process.env.TEMPERATURE || "1.0");
TypeScript

Then use these in your server code when initializing the client.


Step 5: Putting It All Together – Final index.ts

import express, { Request, Response } from "express";
import Ollama from "ollama-js-client";
import { OLLAMA_URL, MODEL_NAME, TEMPERATURE } from "./config";

const app = express();
const port = process.env.PORT || 3000;

app.use(express.json());

const ollama = new Ollama({
  model: MODEL_NAME,
  url: OLLAMA_URL,
  options: {
    temperature: TEMPERATURE,
  },
});

app.post("/generate", async (req: Request, res: Response) => {
  const { prompt } = req.body;
  if (!prompt) {
    return res.status(400).json({ error: "Missing prompt" });
  }

  try {
    const response = await ollama.prompt(prompt);
    res.json({ result: response });
  } catch (err: any) {
    console.error("Error calling Ollama:", err);
    res.status(500).json({ error: err.message ?? "LLM error" });
  }
});

app.post("/generate-stream", (req: Request, res: Response) => {
  const { prompt } = req.body;
  const chunks: string[] = [];

  ollama.prompt_stream(prompt, (error, chunk) => {
    if (error) {
      console.error("Stream error:", error);
      res.write(JSON.stringify({ error: error.message }));
      res.end();
      return;
    }
    if (chunk.done) {
      const full = chunks.join("");
      res.write(JSON.stringify({ result: full }));
      res.end();
    } else {
      chunks.push(chunk.content);
      // optionally flush as you go
      res.write(chunk.content);
    }
  });
});

app.listen(port, () => {
  console.log(`🎯 LLM API listening at http://localhost:${port}`);
});
TypeScript

Further Enhancements (Beyond streamdots)

Once you have the basic API working, you can enhance it in several ways:

  1. RAG (Retrieval-Augmented Generation)
    • Integrate embeddings + a vector database (e.g. Pinecone, Qdrant)
    • On every prompt, fetch relevant context and prepend to prompt to give more knowledge
  2. Rate-limiting / Guardrails
    • Use safety models or prompt filters (e.g., Llama Guard) via Ollama or Llama Stack. Red Hat Developer
    • Implement usage monitoring so not every request is expensive compute
  3. Multi-model Support
    • Configure your server to support more than one model (e.g. llama3-instruct, or smaller / larger versions)
    • Expose API endpoint to list models
  4. Frontend / UI
    • Build a simple React / Next.js UI that calls your /generate or /generate-stream endpoints
    • Use websocket for streaming responses to UI in real time

Why is this useful for you?

  • You don’t need to know anything about container orchestration or deep ML — you just run Ollama and call HTTP.
  • Using TypeScript + Express makes the code accessible to most web developers.
  • You’re keeping your data local, which is a huge win for privacy.
  • This architecture is extendable: once you have the core, you can add context, tools, or more advanced AI features.

Conclusion

Building an Ollama Llama 3 REST API with Node.js, TypeScript, and Express is more than just a coding exercise—it’s a gateway into creating responsive, real‑time AI applications that you can fully control and self‑host. You can download the fully working repo from Github. By following this guide, you’ve learned how to set up Ollama, integrate Llama 3 locally, stream outputs token‑by‑token with Server‑Sent Events, and connect everything to a frontend for a seamless user experience.

This architecture empowers developers to experiment with AI chatbots, research assistants, and creative tools without relying on expensive cloud services, while keeping data private and costs manageable. From here, you can extend your project with richer UIs, metrics logging, authentication, and deployment strategies to scale your app into production.

Keep exploring, keep iterating, and check out related posts on AI Agents Explained, Streaming LLM Responses in Real Time, and Building Chatbots with OpenAI to continue your journey into modern AI development.

If you enjoyed this tutorial, you might also like these posts from mydaytodo.com/blog:

  • How to run Llama 3 locally with Ollama: step-by-step guide Code Unboxing+1
  • Using ollama-js-client in Node.js to talk to Ollama API GitHub
  • Implementing AI guardrails / LlamaStack with Node.js + Ollama Red Hat Developer
  • General intro to working with LLMs via Ollama in local workflows Medium+1


0 Comments

Leave a Reply

Verified by MonsterInsights