Codegen provides semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren’t present.
Basic Usage
Here’s how to create and use a semantic code search index:
# Parse a codebase
codebase = Codebase.from_repo('fastapi/fastapi', language='python')
# Create index
index = FileIndex(codebase)
index.create() # computes per-file embeddings
# Save index to .pkl
index.save('index.pkl')
# Load index into memory
index.load('index.pkl')
# Update index after changes
codebase.files[0].edit('# 🌈 Replacing File Content 🌈')
codebase.commit()
index.update() # re-computes 1 embedding
Searching Code
Once you have an index, you can perform semantic searches:
# Search with natural language
results = index.similarity_search(
"How does FastAPI handle dependency injection?",
k=5 # number of results
)
# Print results
for file, score in results:
print(f"\nScore: {score:.3f} | File: {file.filepath}")
print(f"Preview: {file.content[:200]}...")
The
FileIndex
returns tuples of (
File,
score
)
The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.
Available Indices
Codegen provides two types of semantic indices:
FileIndex
The FileIndex
operates at the file level:
- Indexes entire files, splitting large files into chunks
- Best for finding relevant files or modules
- Simpler and faster to create/update
from codegen.extensions.index.file_index import FileIndex
index = FileIndex(codebase)
index.create()
SymbolIndex (Experimental)
The SymbolIndex
operates at the symbol level:
- Indexes individual functions, classes, and methods
- Better for finding specific code elements
- More granular search results
from codegen.extensions.index.symbol_index import SymbolIndex
index = SymbolIndex(codebase)
index.create()
How It Works
The semantic indices:
- Process code at either file or symbol level
- Split large content into chunks that fit within token limits
- Use OpenAI’s text-embedding-3-small model to create embeddings
- Store embeddings efficiently for similarity search
- Support incremental updates when code changes
When searching:
- Your query is converted to an embedding
- Cosine similarity is computed with all stored embeddings
- The most similar items are returned with their scores
Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.
Example Searches
Here are some example semantic searches:
# Find authentication-related code
results = index.similarity_search(
"How is user authentication implemented?",
k=3
)
# Find error handling patterns
results = index.similarity_search(
"Show me examples of error handling and custom exceptions",
k=3
)
# Find configuration management
results = index.similarity_search(
"Where is the application configuration and settings handled?",
k=3
)
The semantic search can understand concepts and return relevant results even when the exact terms aren’t present in the code.