Codegen - Manipulate Code at Scale

Codegen provides semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren’t present.

This is under active development. Interested in an application? Reach out to the team!

Basic Usage

Here’s how to create and use a semantic code search index:

# Parse a codebase
codebase = Codebase.from_repo('fastapi/fastapi', language='python')

# Create index
index = FileIndex(codebase)
index.create() # computes per-file embeddings

# Save index to .pkl
index.save('index.pkl')

# Load index into memory
index.load('index.pkl')

# Update index after changes
codebase.files[0].edit('# 🌈 Replacing File Content 🌈')
codebase.commit()
index.update() # re-computes 1 embedding

Searching Code

Once you have an index, you can perform semantic searches:

# Search with natural language
results = index.similarity_search(
    "How does FastAPI handle dependency injection?",
    k=5  # number of results
)

# Print results
for file, score in results:
    print(f"\nScore: {score:.3f} | File: {file.filepath}")
    print(f"Preview: {file.content[:200]}...")

The FileIndex returns tuples of (File, score)

The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.

Available Indices

Codegen provides two types of semantic indices:

FileIndex

The FileIndex operates at the file level:

Indexes entire files, splitting large files into chunks
Best for finding relevant files or modules
Simpler and faster to create/update

from codegen.extensions.index.file_index import FileIndex

index = FileIndex(codebase)
index.create()

SymbolIndex (Experimental)

The SymbolIndex operates at the symbol level:

Indexes individual functions, classes, and methods
Better for finding specific code elements
More granular search results

from codegen.extensions.index.symbol_index import SymbolIndex

index = SymbolIndex(codebase)
index.create()

How It Works

The semantic indices:

Process code at either file or symbol level
Split large content into chunks that fit within token limits
Use OpenAI’s text-embedding-3-small model to create embeddings
Store embeddings efficiently for similarity search
Support incremental updates when code changes

When searching:

Your query is converted to an embedding
Cosine similarity is computed with all stored embeddings
The most similar items are returned with their scores

Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.

Example Searches

Here are some example semantic searches:

# Find authentication-related code
results = index.similarity_search(
    "How is user authentication implemented?",
    k=3
)

# Find error handling patterns
results = index.similarity_search(
    "Show me examples of error handling and custom exceptions",
    k=3
)

# Find configuration management
results = index.similarity_search(
    "Where is the application configuration and settings handled?",
    k=3
)

The semantic search can understand concepts and return relevant results even when the exact terms aren’t present in the code.

Introduction

Tutorials

Building with Codegen

Semantic Code Search

Basic Usage

Searching Code

Available Indices

FileIndex

SymbolIndex (Experimental)

How It Works

Example Searches

Introduction

Tutorials

Building with Codegen

​Basic Usage

​Searching Code

​Available Indices

​FileIndex

​SymbolIndex (Experimental)

​How It Works

​Example Searches

Basic Usage

Searching Code

Available Indices

FileIndex

SymbolIndex (Experimental)

How It Works

Example Searches