View the full code in our examples repository
This example works with both Python and Typescript repositories without modification
Overview
The process involves three main steps:- Finding all functions in the codebase
- Extracting their implementations, dependencies, and usages
- Generating structured training data
Step 1: Finding Functions and Their Context
First, we will do a “graph expansion” for each function - grab the function’s source, as well as the full source of all usages of the function and all dependencies.See dependencies and usages to learn more about navigating the code graph
hop_through_imports
to resolve dependencies. When working with imports, symbols can be re-exported multiple times. For example, a helper function might be imported and re-exported through several files before being used. We need to follow this chain to find the actual implementation:
Step 2: Processing the Codebase
Next, we process all functions in the codebase to generate our training data:Step 3: Running the Generator
Finally, we can run our training data generator on any codebase.See parsing codebases to learn more
- Load the target codebase
- Process all functions
- Save the structured training data to a JSON file
You can use any Git repository as your source codebase by passing the repo URL
to Codebase.from_repo(…).
Using the Training Data
The generated data can be used to train LLMs in several ways:- Masked Function Prediction: Hide a function’s implementation and predict it from dependencies and usages
- Code Embeddings: Generate embeddings that capture semantic relationships between functions
- Dependency Prediction: Learn to predict which functions are likely to be dependencies
- Usage Pattern Learning: Train models to understand common usage patterns