Extract repo metadata using the Github API
When you inherit a large, unfamiliar codebase, the hardest part is not reading the code.
It’s knowing where to start.
What Do We Actually Want to Know?
Knowledge Graphs (KGs) are built in stages. So starting with basic repo questions lays the foundation for the next level of granularity akin to chain-of-thought reasoning:
- How big is the repo - file nos, size (kB)?
- What's the directory structure?
- What types of files are in this repository?
- Which is the language(s) likely contain the application logic?
- Can I detect a framework?
Answers to these questions come from two sources:
- Direct queries against repository metadata via the Github API, and
- Traversing the repo directory
Querying the Repo's Metadata
GitHub already knows a lot about a repository:
- Default branch
- Primary language
- Repo size
- Directory tree
Not only can we extract this info programmatically, we can also start to build the App KG schema.
Diagram: Schematic of what the metadata CLI commands do
sequenceDiagram
participant C as PYTHON CLI
participant E as GITHUB
participant F as NEO4J
C->>E: 1. Query repo for metadata
E->>C: 2. Traverse repo, download directory + file names
C->>C: 3. Classify file types
C->>F: 4. Create base KG (directories, file names and file types)
Supporting GitHub Repository
Source code details for all KG tools in this post are in the GitHub repo: Knowledge Graph Tools.
The GitHub repo: Knowledge Graph Tools has a range of CLI tools to create a codebase KG. To illustrate the functionality of each of these tools, I'll teardown the large open-source content management application Zotero and its' underlying codebase.
How big is the Zotero repo?
Repo size, measured in terms of the number of tokens, gives the first indication how feasible it is to feed an entire repo into an LLM in one-shot.
Although tokenising a codebase is LLM specific, we can get a preliminary estimate of the number of tokens by using the following rule-of-thumb:
tokens ≈ repo_size_kb × 250
The kB size of any repo is readily available from the repo metadata - which can be extracted using the following CLI tool command:
kgtools repo <repo> size
Running this for the Zotero repo gives us:
> kgtools repo zotero/zotero size
231586
Estimated tokens ≈ 57.9m
Clearly there is no LLM model that is can ingest the entire repo in a single-shot prompt.
Token estimation is non-trivial
Using LLMs in any code development workflow has cost implications. While some LLM vendors do provide their own token estimators, Peta Muir, shows how to build your own token estimator in her post Counting Claude Tokens Without a Tokenizer.
Other repo metadata stats
Extracting other metadata stats can add more context to the term repo size:
> kgtools repo --help
Usage: kgtools repo [OPTIONS] REPOSITORY COMMAND [ARGS]...
Repository operations.
Example: kgtools repo zotero/zotero description
Options:
--help Show this message and exit.
Commands:
default_branch
description
directory_count
file_count
file_types
get_directory_file_count
language
size
stars
stats
visibility
What does the repo file structure look like?
By traversing the repository tree, we can:
- Count file types - classifying files by extension
- Identify dominant languages
- Detect framework hints (e.g.,
package.json,pom.xml,requirements.txt) - Identify directories likely containing application logic
The CLI tool command kgtools repo <repo> file_types gives two outputs, the:
- Terminal output of file type counts for manual review, and
- A CSV file of
File Type,Directory, andFile Nameready to be uploaded into Neo4j
Running this CLI command for the Zotero repo gives us:
> kgtools repo zotero/zotero file_types
FILE_TYPES
----------
total_files: 3407
directory_count: 271
output_csv: repo_files.csv
File Type Distribution:
SVG: 764
FreeMarker_Java_template: 625
Javascript: 557
DTD_XML_def: 482
CSS: 199
Java_prop: 146
Other: 108
Header: 107
C++: 77
JSON: 63
XHTML: 52
PNG: 41
.
.
.
What This Reveals
Even before parsing individual files, repo metadata alone can tell us:
- Whether this is frontend-heavy (many
.js/.ts) - Backend-heavy (many
.py/.java) - Monorepo structure
- Whether multiple frameworks coexist
- Whether tests dominate the codebase
- Whether build tooling is complex
This immediately narrows down where the actual application logic likely lives.
For example:
- Many
.jsfiles +chrome/directory? Possibly browser-extension architecture. - Many
.javafiles +pom.xml? Likely Maven-based. - Many
.ts+angular.json? Likely Angular.
This is chain-of-thought reasoning — not reading code.
For Zotero, Javascript dominates the list of high-level, object-orientated
languages (557 files). Hence, there is most likely a package.json that
will tell us about any frameworks and packages used:
"dependencies": {
"prop-types": "^15.8.0",
"react": "^18.3.1",
.
.
},
Zotero uses React, but exactly how is the subject of the next level of
metadata detail.
Why This Matters
When documentation is incomplete — which is common in large projects — the source code becomes the documentation.
But raw source code is overwhelming.
Metadata extraction gives you:
- A map before you explore the terrain
- Quantitative insight into structure
- A deterministic starting point
It’s similar in spirit to static analysis tools like:
- Semgrep
- CodeQL
- Sourcegraph
Except here, we’re building our own structural layer.
Knowing what frameworks are used is helps in knowing what components we are likely to come across when we traverse the repo.
Ingesting what we know into a Knowledge Graph - Neo4j
A repository knowledge graph is built incrementally.
We don’t wait until everything is extracted. We start with what we know.
As metadata is collected, it is ingested into the graph. This allows us to immediately test whether the schema is useful.
Can it answer the questions we care about? Does it reveal structure? Does it expose relationships?
If not, the schema evolves.
The graph is both the model and the feedback loop.
The CLI tool command:
kgtools repo <repo> file_types
outputs the repo_files.csv.
The exact howto for ingesting this file depends on the type of your Neo4j instance.
This step turns raw metadata into something navigable.
A flat file becomes a graph.
A graph becomes structure.
And structure is what allows us to move from exploration to understanding. For example, a schema like the following:
flowchart TD
A((Directory))
B((FileName))
C((FileType))
A-->|CONTAINS|B
B-->|IS_TYPE_OF|C
lets us answer the following question:
"What files are co-located in the same folder? Tightly coupled files indicate a particular design philosopy"
We explore software patter recognition in subsequent posts.
Next: AST Traversal
Metadata tells us:
Where the logic probably lives.
The Abstract Syntax Tree tells us:
What the logic actually does.
In the next post, I’ll show how to:
- Parse likely application files
- Extract functions, imports, and call relationships
- Convert that into graph-ready data
- Prepare it for ingestion into Neo4j
That’s where structural intelligence really begins.
Closing Thought
When facing an unfamiliar codebase:
Don’t start by reading files.
Start by extracting structure.
Structure becomes a graph.
The graph becomes navigable intelligence.
And that’s how you turn a black box into something explorable.