Exploring Verba, the Golden RAGtriever

Verba is an open-source application designed to offer user-friendly interface for Retrieval-Augmented Generation (RAG). In a few easy steps, explore your datasets and extract insights with ease, either locally with HuggingFace and Ollama or through LLM providers such as OpenAI, Cohere, and Google.

Denis Rasulev · May 25, 2024

Verba is a new AI-powered tool that assists users in creating engaging and meaningful content. Designed to help with everything from brainstorming to drafting, it enables users to enhance their productivity and creativity. Whether you’re a writer or content creator, Verba brings fresh possibilities to streamline your creative process.

Introduction

Welcome to an exciting journey into the world of vector search with Verba!

Whether you’re a beginner eager to explore new tools or a seasoned developer looking to expand your toolkit, this blog post is crafted just for you. We’ll dive deep into what Verba is, its use cases, how to use it with Python, and compare it to a couple popular alternatives.

So, grab a cup of coffee, and let’s get started!

Why Vector Search?

Before diving into the specifics of Verba, let’s understand what vectorization is and what vector search is.

Vectorization is the transformation of input data (e.g. text or image) into vectors of real numbers that are understandable to machine learning models. There are many vectorization techniques, from simply counting the frequency of a term to using features that take complex context into account.

Vector search is a method of finding similar objects based on their vectors (numerical representations). It is particularly effective for tasks such as searching large text data sets, finding similar images, or even recommending products.

Why is vector search important? Here are a few reasons:

Scalability: Traditional search methods may not be able to handle large data sets. Vector search scales well when dealing with large amounts of data.
Accuracy: Through the use of embeddings, vector search can capture semantic meaning, resulting in more accurate search results.
Universality: Vector search can be used for images, audio and any other data that can be represented as vectors.

What is Verba?

Verba is an open-source tool developed by Weaviate to make vector search easier and more efficient. Verba leverages the capabilities of Weaviate, an open source vector database capable of storing both data objects and vector embeddings. This feature allows for highly sophisticated, seamless and efficient search. Verba combines the power of machine learning models to generate embeddings (vectors) and the robustness of Weaviate to perform fast and accurate searches.

Supported Models

Verba allows you to select different LLM (Large Language Models) providers depending on your specific use case:

Ollama: Local Embedding and Generation Models, e.g., Llama3
HuggingFace: Local Embedding Models, e.g., MiniLMEmbedder
Cohere: Embedding and Generation Models, e.g., Command R+
Google: Embedding and Generation Models, e.g., Gemini
OpenAI: Embedding and Generation Models, e.g., GPT-4o

Supported Data

Verba supports working with the following data types:

PDF: Import PDFs into Verba
CSV/XLSX: Import Table Data into Verba
Unstructured: Import Data through Unstructured.io
Multi-Modal: Import of Multi-Modal Data into Verba is planned.

Use Cases of Verba

Document Search One of the most common use cases for Weaviate Verba is document search. Whether you have a collection of research papers, news articles, or blog posts, Verba can help you find relevant documents based on a query.
Image Similarity Search If you have a dataset of images, you can use Verba to find visually similar images. This is particularly useful for applications like image-based recommendations or duplicate image detection.
Semantic Text Search Verba excels at semantic text search, where the goal is to find documents that are semantically similar to a given query. This goes beyond keyword matching and understands the context and meaning of the query.

Getting Started with Verba

Let’s break down how to get started with Weaviate Verba and use it in a Python project. We’ll walk through installation, basic usage, and a practical example.

Installation

First things first, you need to install Weaviate and the Verba library. You can do this using pip:

pip install weaviate-client
pip install weaviate-verba

Setting Up

To use Weaviate Verba, you need to have a Weaviate instance running. Local deployment is the most straightforward way to launch your Weaviate database for prototyping and testing. You can run Weaviate locally using Docker:

docker run -p 8080:8080 -p 50051:50051 semitechnologies/weaviate:latest

Detailed instructions on how to deploy Weaviate using Docker you will find in their Weaviate Docker Guide.

If you prefer a cloud-based solution, Weaviate Cloud Service (WCS) offers a scalable, managed environment. Learn how to set up a cloud cluster and get the API keys by following the Weaviate Cluster Setup Guide.

Using Verba

Before starting Verba you’ll need to configure access to various components depending on your chosen technologies, such as OpenAI, Cohere, and HuggingFace via an .env file. Create this .env in the same directory you want to start Verba in. 👉 Example .env.

Now, let’s see how to use Weaviate Verba with Python code. We’ll start with a simple example of indexing and searching text data.

Indexing Data:

import weaviate
from weaviate_verba import Verba

# Connect to Weaviate instance
client = weaviate.Client("http://localhost:8080")

# Initialize Verba
verba = Verba(client)

# Define a schema
schema = {
    "classes": [
        {
            "class": "Article",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["string"]
                },
                {
                    "name": "content",
                    "dataType": ["text"]
                }
            ]
        }
    ]
}

# Create the schema
client.schema.create(schema)

# Index some data
articles = [
    {"title": "Introduction to Machine Learning", "content": "Machine learning is a field of AI..."},
    {"title": "Deep Learning Basics", "content": "Deep learning is a subset of machine learning..."},
]

for article in articles:
    client.data_object.create(article, "Article")

Searching Data:

# Search for similar articles
query = "Basics of AI"
result = verba.search("Article", query)

for item in result:
    print(f"Title: {item['title']}, Content: {item['content']}")

Verba and Local Docs

Let’s expand on how to use Weaviate Verba to index and search locally saved documents. We’ll walk through the process of loading documents from your local file system, indexing them with Weaviate Verba, and then performing searches.

Make sure you have Weaviate and Verba installed, and a Weaviate instance running.

Loading and Indexing

Let’s assume you have a directory of text files you want to index. We’ll write a script to load these files, create a schema in Weaviate, and index the documents.

import os
import weaviate
from weaviate_verba import Verba

# Connect to Weaviate instance
client = weaviate.Client("http://localhost:8080")

# Initialize Verba
verba = Verba(client)

# Define a schema
schema = {
    "classes": [
        {
            "class": "Document",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["string"]
                },
                {
                    "name": "content",
                    "dataType": ["text"]
                }
            ]
        }
    ]
}

# Create the schema
client.schema.create(schema)

# Function to load documents from a directory
def load_documents_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(directory_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                documents.append({
                    "title": filename,
                    "content": content
                })
    return documents

# Load documents
directory_path = "/path/to/your/documents"  # Change this to your directory path
documents = load_documents_from_directory(directory_path)

# Index documents
for doc in documents:
    client.data_object.create(doc, "Document")

Searching Documents

Now that we have indexed our local documents, we can perform searches using Verba.

# Define a search query
query = "Machine learning concepts"

# Perform the search
results = verba.search("Document", query)

# Display the results
for result in results:
    # Print the first 200 characters of the content
    print(f"Title: {result['title']}, Content: {result['content'][:200]}...")

Full Script Example

Here’s the full script combining loading, indexing, and searching:

import os
import weaviate
from weaviate_verba import Verba

# Connect to Weaviate instance
client = weaviate.Client("http://localhost:8080")

# Initialize Verba
verba = Verba(client)

# Define a schema
schema = {
    "classes": [
        {
            "class": "Document",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["string"]
                },
                {
                    "name": "content",
                    "dataType": ["text"]
                }
            ]
        }
    ]
}

# Create the schema
client.schema.create(schema)

# Function to load documents from a directory
def load_documents_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(directory_path, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                documents.append({
                    "title": filename,
                    "content": content
                })
    return documents

# Load documents
directory_path = "/path/to/your/documents"  # Change this to your directory path
documents = load_documents_from_directory(directory_path)

# Index documents
for doc in documents:
    client.data_object.create(doc, "Document")

# Define a search query
query = "Machine learning concepts"

# Perform the search
results = verba.search("Document", query)

# Display the results
for result in results:
    # Print the first 200 characters of the content
    print(f"Title: {result['title']}, Content: {result['content'][:200]}...")

Explanation

Loading Documents: The load_documents_from_directory function reads text files from a specified directory and loads their content into a list of dictionaries.
Indexing Documents: Each document is indexed in Weaviate with the class name “Document”.
Searching: The search query is processed, and the results are printed, displaying the title and the first 200 characters of each document’s content.

By following these steps, you can efficiently index and search through locally saved documents using Weaviate Verba. This method can be adapted for various types of text data, enabling powerful and scalable search capabilities in your projects.

Pros and Cons of Verba

So, what is good and what could be better about this tool?

Pros of Verba

Ease of Use: Verba provides a high-level API that makes it easy to integrate vector search into your applications.
Flexibility: It supports various data types, including text, images, and more.
Performance: Built on Weaviate, Verba ensures fast and efficient searches, even with large datasets.
Open Source: Being open source means you can contribute to its development and customize it to your needs.

Cons of Verba

Setup Complexity: Initial setup, especially configuring Weaviate, can be complex for beginners.
Resource Intensive: Running Weaviate and performing vector operations can be resource-intensive, requiring significant computational power.
Learning Curve: Understanding vector search concepts and effectively using Verba may require a learning curve.

👉 Want to see Verba in action? Check live demo of Verba.

👉 Official blog post about Verba by Weaviate.

👉 YouTube video with features and capabilities of Verba.

Some Alternatives

Faiss

Faiss (Facebook AI Similarity Search) is a popular library for efficient similarity search and clustering of dense vectors.

Pros of Faiss:

Highly Optimized: Faiss is optimized for performance, handling large datasets efficiently.
Versatility: It supports a variety of indexing methods and search strategies.

Cons of Faiss:

Complexity: Faiss can be complex to set up and use, especially for beginners.
Limited Scope: It’s primarily focused on vector search and doesn’t provide a full-fledged search engine like Weaviate.

👉 Faiss GitHub repo

Elasticsearch

Elasticsearch is a widely used search engine that can be extended with the k-NN plugin to support vector search.

Pros of Elasticsearch with k-NN:

Scalability: Elasticsearch is highly scalable and can handle large-scale search applications.
Rich Features: Beyond vector search, it offers a wide range of search and analytics features.

Cons of Elasticsearch with k-NN:

Complexity: Setting up and configuring Elasticsearch with the k-NN plugin can be challenging.
Performance: While powerful, it may not match the performance of specialized vector search tools like Faiss or Weaviate.

👉 Ealsticsearch GitHub repo

Conclusion

Verba is a powerful tool that brings the capabilities of vector search to your fingertips. Its ease of use, flexibility, and performance make it a great choice for a variety of applications. While it has some setup complexity and resource requirements, the benefits it offers are well worth the effort.

For beginners, starting with Weaviate Verba can be an excellent introduction to the world of vector search. The sample code provided in this post should help you get started on your journey. As you become more familiar with the tool, you’ll discover its full potential and the wide range of applications it can support.

If you’re exploring alternatives, Faiss and Elasticsearch with the k-NN plugin are also excellent choices, each with its own strengths and weaknesses. Ultimately, the best tool for your needs will depend on your specific use case, performance requirements, and familiarity with the technology.

Happy coding, and may your searches always be efficient and accurate!

😎