Was this page helpful?
Get started with semantic search¶
This tutorial shows you how to build a semantic search application with ScyllaDB Vector Search.
What you’ll build¶
A movie recommendation chatbot that uses ScyllaDB Vector Search to perform semantic similarity search between user queries and movie plot descriptions.
Prerequisites¶
You’ve read Quick Start Guide to Vector Search
ScyllaDB Cloud cluster with
vector searchenabledDocker installed
Git installed
Clone the repository¶
Clone the repository and navigate to the project folder:
git clone https://github.com/scylladb/vector-search-examples.git
cd vector-search-examples/movie-recommendation
Configure database credentials¶
Create a .env file from the example template:
cp .env.example .env
Edit .env and add your ScyllaDB Cloud credentials:
SCYLLADB_HOST=node-0.aws-us-east-1.xxxxxxxx.clusters.scylla.cloud
SCYLLADB_PORT=9042
SCYLLADB_USERNAME=scylla
SCYLLADB_PASSWORD=xxxxxxxxxxxxxx
SCYLLADB_DATACENTER=AWS_US_EAST_1
SCYLLADB_KEYSPACE=recommend
The SCYLLADB_KEYSPACE variable sets the keyspace name that will be created in your cluster.
Run the application¶
Build the Docker image¶
Build the Docker image from the project directory:
docker build -t movies-app .
This command builds a containerized version of the application with all dependencies. The Dockerfile uses Python 3.11, installs PyTorch CPU-only (avoiding large NVIDIA/CUDA packages), and sets up the FastAPI server.
Start the container¶
Run the container with your environment variables:
docker run -d --rm -p 8000:8000 --env-file .env --name movie-container movies-app
Breaking down this command:
--rm- Automatically removes the container when it stops-p 8000:8000- Maps port 8000 from the container to your local machine--env-file .env- Loads your ScyllaDB credentials from the.envfile--name movie-container- Names the container for easy referencemovies-app- The image name from the previous build step
When the container starts, it automatically:
Runs the migration script (src/migrate.py) to create the keyspace, table, and vector index.
Starts the FastAPI server on port 8000
Load sample data¶
With the container running, load the sample movie dataset:
docker exec movie-container python src/load_data.py
This ingests approximately 30,000 movies from the TMDB dataset. The data loading process:
Reads movie data from CSV files in
src/data/Generates 384-dimensional embeddings for each movie plot using the
all-MiniLM-L6-v2modelInserts movies with their embeddings into ScyllaDB
You’ll see progress output:
⏳ Ingestion started...
📄 Ingesting sample data 1/3 ...
55%|█████▍ | 5450/9999 [00:14<00:08, 518.68req/s]
The ingestion process takes a few minutes.
Access the application¶
Once data loading completes, open your browser to:
You can now:
Enter text descriptions to get movie recommendations (e.g., “a thriller about artificial intelligence”)
Browse the interactive API documentation at http://127.0.0.1:8000/docs
Check application health at http://127.0.0.1:8000/health
The application uses vector similarity search to find movies whose plot embeddings are closest to your query embedding, returning semantically relevant recommendations.
Check the source code of the application for more details: https://github.com/scylladb/vector-search-examples
Or for the specific movie-recommendation folder:
https://github.com/scylladb/vector-search-examples/tree/main/movie-recommendation
Understanding the database schema¶
When you run the Docker container, the migration script automatically creates the database schema. Here’s what gets created in your ScyllaDB cluster:
CREATE KEYSPACE IF NOT EXISTS example_ks
WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}
AND TABLETS = {'enabled': 'true'};
CREATE TABLE IF NOT EXISTS example_ks.movies (
id INT,
release_date TIMESTAMP,
title TEXT,
tagline TEXT,
genre TEXT,
imdb_id TEXT,
poster_url TEXT,
plot TEXT,
plot_embedding VECTOR<FLOAT, 384>,
PRIMARY KEY (id)
) WITH cdc = {'enabled': 'true'};
CREATE INDEX IF NOT EXISTS ann_index ON example_ks.movies(plot_embedding)
USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
Keyspace:
CREATE KEYSPACE IF NOT EXISTS example_ks- Creates a keyspace'class': 'NetworkTopologyStrategy'- Replication strategy'replication_factor': '3'- Data is replicated across 3 nodes for high availabilityTABLETS = {'enabled': 'true'}- Enables Tablets, required for Vector Search functionality
Table:
CREATE TABLE IF NOT EXISTS example_ks.movies- Creates the movies tableplot_embedding VECTOR<FLOAT, 384>- Vector column storing 384-dimensional float embeddings of movie plots (generated using all-MiniLM-L6-v2 from Sentence Transformers)PRIMARY KEY (id)- Setsidas the primary key for data distributioncdc = {'enabled': 'true'}- Enables Change Data Capture for streaming changes
Vector Index:
CREATE INDEX IF NOT EXISTS ann_index ON example_ks.movies(plot_embedding)
USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
This creates an Approximate Nearest Neighbor (ANN) index on the plot_embedding vector column, enabling fast similarity search queries. The index uses DOT_PRODUCT as the similarity function to measure how closely vectors match.
DOT_PRODUCT is optimal for this use case (and for LLM projects in general) because the all-MiniLM-L6-v2 model produces normalized embeddings (vectors with length of 1). For normalized vectors, dot product is equivalent to cosine similarity but computationally faster, making it the best choice. Other available similarity functions include COSINE and EUCLIDEAN.