In today’s AI landscape, it’s incredibly powerful — but sometimes expensive or privacy-sensitive — to rely on cloud-hosted large language models (LLMs). This is where Ollama + Llama 3 shine: they let you run models locally on your hardware, without sending data to external servers. This tutorial shows you how to build an Ollama Llama 3 REST API using Node.js, TypeScript, and Express. By following along, you’ll learn how to stream AI responses in real time.
The streamdots repo is a simple Node.js + TypeScript + Express wrapper around a Llama 3 model served by Ollama. Read on to find out more about this and to get to the link of the fully working repo. Its goal is to make it easy for developers (especially web engineers) to spin up a local REST API for LLM completions, streaming responses, and prompt chaining.
In other words, it solves this problem:
- Run LLM inference locally, reducing latency, cost, and data privacy risk
- Expose a developer-friendly API over HTTP, so you can integrate the LLM into your own apps
- Use modern TypeScript + Express, making it maintainable, testable, and scalable
Why Use Ollama + Llama 3 + Node + TypeScript + Express?
Before jumping into code, let’s talk about the stack and motivations.
What Is Ollama?
- Ollama is a runtime for local LLMs. It allows you to pull models (e.g. Llama 3) and run them locally via a built-in HTTP API. nextwebspark.com+2Medium+2
- It’s designed for open-source, on-device models, giving you privacy, lower cost, and control. Wikipedia
- By default, you can run it via CLI (
ollama run llama3) or run it as a persistent service (ollama serve) so you can call it from your own applications. nextwebspark.com
Why Llama 3?
- Llama 3 (from Meta) is a modern, powerful open language model, suited for chat, instruct, or code-style generation.
- It’s compatible with Ollama’s local inference. (You pull the model via Ollama, then serve it.)
Why Node.js + TypeScript + Express?
- Node.js is great for building lightweight, asynchronous HTTP servers.
- TypeScript adds type-safety, making it easier to maintain and refactor.
- Express is a minimalist, flexible web framework, perfect for exposing LLM endpoints without too much boilerplate.
Ollama Llama 3 REST API Tutorial: How to Build the Local LLM API
Here’s a step-by-step guide on how to build (or understand) what the streamdots repo does. (If you already have code, I’ll walk through the main parts.)
Prerequisites
- Install Ollama from the official site and start it.
- Pull the Llama 3 model:
ollama pull llama3(You might need a variant, e.g.llama3:instructdepending on the model card.) Code Unboxing+1 - Start the model in server mode:
ollama serve llama3This will host a REST API on default port (e.g.http://localhost:11434). nextwebspark.com+1 - Make sure you have Node.js (v14+ ideally) and npm or yarn installed.
Step 1: Initialize the Node / TypeScript Project
mkdir streamdots
cd streamdots
npm init -y
npm install express typescript ts-node @types/node @types/express
BashThen, set up tsconfig.json (simplest version):
{
"compilerOptions": {
"target": "ES2020",
"module": "commonjs",
"strict": true,
"outDir": "./dist",
"esModuleInterop": true
},
"include": ["src"]
}
JSONCreate a folder src/ and then create src/index.ts.
Step 2: Create a Basic Express Server
In src/index.ts:
import express, { Request, Response } from 'express';
const app = express();
const port = process.env.PORT || 3000;
app.use(express.json());
app.post('/generate', async (req: Request, res: Response) => {
const { prompt } = req.body;
if (!prompt) {
return res.status(400).json({ error: 'Missing prompt' });
}
// We'll call Ollama here (next step)
res.json({ result: `You said: ${prompt}` });
});
app.listen(port, () => {
console.log(`LLM API listening at http://localhost:${port}`);
});
TypeScriptRun it (in dev) with:
npx ts-node src/index.ts
Step 3: Connect to Ollama’s API
To call Ollama from your Node server, you can use fetch or a library. There is a JavaScript client for Ollama: ollama-js-client. GitHub
Install it:
npm install ollama-js-client
BashThen modify your /generate route to call the local Llama 3 server:
import Ollama from "ollama-js-client";
const ollama = new Ollama({
model: "llama3",
url: "http://127.0.0.1:11434/api/", // or correct base URL
});
app.post("/generate", async (req: Request, res: Response) => {
const { prompt } = req.body;
if (!prompt) {
return res.status(400).json({ error: "Missing prompt" });
}
try {
const response = await ollama.prompt(prompt);
res.json({ result: response });
} catch (err) {
console.error("Error calling Ollama:", err);
res.status(500).json({ error: "LLM error" });
}
});
TypeScriptThis uses the non-streaming version. If you want streaming (chunked responses), you can also do:
app.post("/generate-stream", (req: Request, res: Response) => {
const { prompt } = req.body;
const chunks: string[] = [];
ollama.prompt_stream(prompt, (error, chunk) => {
if (error) {
console.error("Stream error:", error);
res.write(`{"error":"${error.message}"}`);
res.end();
return;
}
if (chunk.done) {
// end of stream
const full = chunks.join("");
res.write(JSON.stringify({ result: full }));
res.end();
} else {
// a chunk of content
chunks.push(chunk.content);
// optionally flush to client
res.write(chunk.content);
}
});
});
TypeScriptStep 4: Error Handling, Configuration, and Environment Variables
It’s good practice to make your code more robust and configurable.
- Use
.env(withdotenvpackage) to configureOLLAMA_URL,MODEL_NAME,TEMPERATURE, etc. - Wrap the Ollama client creation so you can swap model / host easily.
- Add error handling for common scenarios (e.g. Ollama is not running, or model not found).
// config.ts
import dotenv from "dotenv";
dotenv.config();
export const OLLAMA_URL = process.env.OLLAMA_URL || "http://127.0.0.1:11434/api/";
export const MODEL_NAME = process.env.MODEL_NAME || "llama3";
export const TEMPERATURE = parseFloat(process.env.TEMPERATURE || "1.0");
TypeScriptThen use these in your server code when initializing the client.
Step 5: Putting It All Together – Final index.ts
import express, { Request, Response } from "express";
import Ollama from "ollama-js-client";
import { OLLAMA_URL, MODEL_NAME, TEMPERATURE } from "./config";
const app = express();
const port = process.env.PORT || 3000;
app.use(express.json());
const ollama = new Ollama({
model: MODEL_NAME,
url: OLLAMA_URL,
options: {
temperature: TEMPERATURE,
},
});
app.post("/generate", async (req: Request, res: Response) => {
const { prompt } = req.body;
if (!prompt) {
return res.status(400).json({ error: "Missing prompt" });
}
try {
const response = await ollama.prompt(prompt);
res.json({ result: response });
} catch (err: any) {
console.error("Error calling Ollama:", err);
res.status(500).json({ error: err.message ?? "LLM error" });
}
});
app.post("/generate-stream", (req: Request, res: Response) => {
const { prompt } = req.body;
const chunks: string[] = [];
ollama.prompt_stream(prompt, (error, chunk) => {
if (error) {
console.error("Stream error:", error);
res.write(JSON.stringify({ error: error.message }));
res.end();
return;
}
if (chunk.done) {
const full = chunks.join("");
res.write(JSON.stringify({ result: full }));
res.end();
} else {
chunks.push(chunk.content);
// optionally flush as you go
res.write(chunk.content);
}
});
});
app.listen(port, () => {
console.log(`🎯 LLM API listening at http://localhost:${port}`);
});
TypeScriptFurther Enhancements (Beyond streamdots)
Once you have the basic API working, you can enhance it in several ways:
- RAG (Retrieval-Augmented Generation)
- Integrate embeddings + a vector database (e.g. Pinecone, Qdrant)
- On every prompt, fetch relevant context and prepend to prompt to give more knowledge
- Rate-limiting / Guardrails
- Use safety models or prompt filters (e.g., Llama Guard) via Ollama or Llama Stack. Red Hat Developer
- Implement usage monitoring so not every request is expensive compute
- Multi-model Support
- Configure your server to support more than one model (e.g.
llama3-instruct, or smaller / larger versions) - Expose API endpoint to list models
- Configure your server to support more than one model (e.g.
- Frontend / UI
- Build a simple React / Next.js UI that calls your /generate or /generate-stream endpoints
- Use websocket for streaming responses to UI in real time
Why is this useful for you?
- You don’t need to know anything about container orchestration or deep ML — you just run Ollama and call HTTP.
- Using TypeScript + Express makes the code accessible to most web developers.
- You’re keeping your data local, which is a huge win for privacy.
- This architecture is extendable: once you have the core, you can add context, tools, or more advanced AI features.
Conclusion
Building an Ollama Llama 3 REST API with Node.js, TypeScript, and Express is more than just a coding exercise—it’s a gateway into creating responsive, real‑time AI applications that you can fully control and self‑host. You can download the fully working repo from Github. By following this guide, you’ve learned how to set up Ollama, integrate Llama 3 locally, stream outputs token‑by‑token with Server‑Sent Events, and connect everything to a frontend for a seamless user experience.
This architecture empowers developers to experiment with AI chatbots, research assistants, and creative tools without relying on expensive cloud services, while keeping data private and costs manageable. From here, you can extend your project with richer UIs, metrics logging, authentication, and deployment strategies to scale your app into production.
Keep exploring, keep iterating, and check out related posts on AI Agents Explained, Streaming LLM Responses in Real Time, and Building Chatbots with OpenAI to continue your journey into modern AI development.
If you enjoyed this tutorial, you might also like these posts from mydaytodo.com/blog:
- How to Build a Neural Network with Brain.js
- Understanding Genetic Algorithms in Machine Learning
- Building an AI Model Using TensorFlow.js — Step-by-Step Guide
- Bots Trained to Play Like Humans Are More Fun — AI and Gaming
- How to run Llama 3 locally with Ollama: step-by-step guide Code Unboxing+1
- Using
ollama-js-clientin Node.js to talk to Ollama API GitHub - Implementing AI guardrails / LlamaStack with Node.js + Ollama Red Hat Developer
- General intro to working with LLMs via Ollama in local workflows Medium+1
0 Comments