ScyllaDB Docs Vector Search Retrieval-Augmented Generation (RAG) Get started with RAG

Get started with RAG¶

This tutorial shows you how to run a Retrieval-Augmented Generation (RAG) chatbot powered by ScyllaDB Vector Search and Groq.

What you’ll build¶

A movie chatbot that retrieves semantically relevant movie plots from ScyllaDB using vector similarity search and uses an LLM to generate responses grounded in that context. The retrieval step uses ScyllaDB’s local vector indexes to scope ANN queries to a specific genre partition, returning only relevant context to the LLM.

How filtered retrieval works¶

Without filtering, an ANN query scans all stored embeddings and returns the most similar results regardless of metadata. For RAG, this means the LLM may receive irrelevant context. For example, a query about 1980s horror movies could surface results from unrelated genres or eras.

ScyllaDB solves this with local vector indexes: a per-partition HNSW index that restricts the search space to rows matching a partition key before running ANN. The result is fast, scoped retrieval that gives the LLM only the context it needs.

	Global index	Local index
Scope	All rows in the table	Rows within one partition
WHERE clause	Requires `ALLOW FILTERING`	Uses the partition key — no `ALLOW FILTERING`
Performance	Slower — scans the entire index	Fast — searches only the target partition’s HNSW index
RAG use case	Full-corpus search	Per-genre, per-user, or per-category scoped retrieval

Prerequisites¶

You’ve read the Quick Start Guide to Vector Search
ScyllaDB Cloud cluster with vector search enabled
Python 3.11 or newer installed
Groq API key (free tier is sufficient)
Git installed

Clone the repository¶

Clone the repository and navigate to the project folder:

git clone https://github.com/scylladb/vector-search-examples.git
cd vector-search-examples/rag-movie-chatbot

Install dependencies¶

Install and sync dependencies with uv:

uv sync

This creates a virtual environment and installs:

scylla-driver — connects to ScyllaDB with DC-aware load balancing
sentence-transformers — generates 384-dimensional embeddings from text
groq — fast LLM inference for the generation step
streamlit — interactive web UI

If you don’t have uv installed, follow the uv installation guide.

Configure credentials¶

Copy the example environment file and fill in your credentials:

cp .env.example .env

Open .env and fill in your ScyllaDB Cloud connection details and Groq API key:

SCYLLADB_HOST=node-0.aws-us-east-1.xxxxxxxx.clusters.scylla.cloud
SCYLLADB_PORT=9042
SCYLLADB_USERNAME=scylla
SCYLLADB_PASSWORD=your-password
SCYLLADB_DATACENTER=AWS_US_EAST_1
SCYLLADB_KEYSPACE=recommend
GROQ_API_KEY=your-groq-api-key

Find your ScyllaDB Cloud credentials in the ScyllaDB Cloud console under your cluster’s Connect tab.

Set up the database¶

Create the schema¶

Run the migration script to create the keyspace, table, and vector index:

python db/migrate.py

You should see:

Creating keyspace and tables...
Migration completed.

Load sample data¶

With the schema in place, load the sample movie dataset. The data is shared with the movie recommendation example — run its loader from the repository root:

cd ../movie-recommendation
cp ../rag-movie-chatbot/.env .env
docker build -t movies-app .
docker run -d --rm -p 8000:8000 --network host --env-file .env --name movie-container movies-app
docker exec movie-container python src/load_data.py

This ingests approximately 30,000 movies from the TMDB dataset, generating 384-dimensional embeddings for each plot using the all-MiniLM-L6-v2 model.

Once the data is loaded, stop the container and return to the chatbot folder:

docker stop movie-container
cd ../rag-movie-chatbot

Run the application¶

Start the Streamlit app:

uv run streamlit run app.py

Then open your browser to http://localhost:8501.

You can now enter movie plots or descriptions, and the chatbot will:

Convert your input to a vector embedding
Run an ANN query against ScyllaDB to retrieve semantically similar movie plots
Pass the retrieved plots as context to the Groq LLM
Stream back a response grounded in actual movies from the database

Check the source code of the application for more details: https://github.com/scylladb/vector-search-examples/tree/main/rag-movie-chatbot

Understanding the database schema¶

The migration script creates the following schema in your ScyllaDB cluster:

CREATE KEYSPACE recommend 
WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': '3'};

CREATE TABLE recommend.movies (
    id           INT,
    release_date TIMESTAMP,
    title        TEXT,
    tagline      TEXT,
    genre        TEXT,
    imdb_id      TEXT,
    poster_url   TEXT,
    plot         TEXT,
    plot_embedding VECTOR<FLOAT, 384>,
    PRIMARY KEY (id)
);

CREATE INDEX IF NOT EXISTS ann_index 
  ON recommend.movies(plot_embedding) 
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Keyspace:

NetworkTopologyStrategy with replication_factor: 3 replicates data across three nodes for high availability.

Table:

plot_embedding VECTOR<FLOAT, 384> stores a 384-dimensional float vector for each movie plot, generated using all-MiniLM-L6-v2 from Sentence Transformers.
PRIMARY KEY (id) distributes rows evenly across the cluster.

Vector index:

CREATE INDEX IF NOT EXISTS ann_index 
  ON recommend.movies(plot_embedding) 
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

This creates an Approximate Nearest Neighbor (ANN) index on the plot_embedding column using HNSW, enabling fast similarity queries. DOT_PRODUCT is used because all-MiniLM-L6-v2 produces normalized (unit-length) vectors — for normalized vectors, dot product is equivalent to cosine similarity but computationally faster.

The retrieval query looks like this:

SELECT * FROM recommend.movies
ORDER BY plot_embedding ANN OF [0.12, -0.34, ...]
LIMIT 5;

For filtered retrieval scoped to a single genre, the schema uses a second table with genre as the partition key and a local vector index:

CREATE TABLE recommend.movies_by_genre (
    genre        TEXT,
    id           INT,
    release_date TIMESTAMP,
    title        TEXT,
    plot         TEXT,
    plot_embedding VECTOR<FLOAT, 384>,
    PRIMARY KEY (genre, id)
);

CREATE CUSTOM INDEX IF NOT EXISTS ann_index_by_genre
  ON recommend.movies_by_genre((genre), plot_embedding)
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

The local index syntax ((genre), plot_embedding) tells ScyllaDB to build a separate HNSW index per genre value. Queries on this table specify the partition key and need no ALLOW FILTERING:

SELECT * FROM recommend.movies_by_genre
WHERE genre = 'Science Fiction'
ORDER BY plot_embedding ANN OF [0.12, -0.34, ...]
LIMIT 5;

ScyllaDB routes the query to the correct shard and searches only that partition’s index — keeping retrieval fast as the dataset grows.

Next steps¶

The RAG tutorial walks through building the full filtering pipeline step by step: schema design, embedding-based retrieval, genre and decade filters, the RAG module, and the Groq integration.

Was this page helpful?