Extract repo metadata using the Github API

When you inherit a large, unfamiliar codebase, the hardest part is not reading the code.

It’s knowing where to start.

What Do We Actually Want to Know?

Knowledge Graphs (KGs) are built in stages. So starting with basic repo questions lays the foundation for the next level of granularity akin to chain-of-thought reasoning:

How big is the repo - file nos, size (kB)?
What's the directory structure?
What types of files are in this repository?
Which is the language(s) likely contain the application logic?
Can I detect a framework?

Answers to these questions come from two sources:

Direct queries against repository metadata via the Github API, and
Traversing the repo directory

Querying the Repo's Metadata

GitHub already knows a lot about a repository:

Default branch
Primary language
Repo size
Directory tree

Not only can we extract this info programmatically, we can also start to build the App KG schema.

Diagram: Schematic of what the metadata CLI commands do

 sequenceDiagram   
    participant C as PYTHON CLI
    participant E as GITHUB
    participant F as NEO4J

    C->>E: 1. Query repo for metadata
    E->>C: 2. Traverse repo, download directory + file names
    C->>C: 3. Classify file types
    C->>F: 4. Create base KG (directories, file names and file types)

Supporting GitHub Repository

Source code details for all KG tools in this post are in the GitHub repo: Knowledge Graph Tools.

The GitHub repo: Knowledge Graph Tools has a range of CLI tools to create a codebase KG. To illustrate the functionality of each of these tools, I'll teardown the large open-source content management application Zotero and its' underlying codebase.

How big is the Zotero repo?

Repo size, measured in terms of the number of tokens, gives the first indication how feasible it is to feed an entire repo into an LLM in one-shot.

Although tokenising a codebase is LLM specific, we can get a preliminary estimate of the number of tokens by using the following rule-of-thumb:

tokens ≈ repo_size_kb × 250

The kB size of any repo is readily available from the repo metadata - which can be extracted using the following CLI tool command:

kgtools repo <repo> size

Running this for the Zotero repo gives us:

> kgtools repo zotero/zotero size                    
231586

Estimated tokens ≈ 57.9m

Clearly there is no LLM model that is can ingest the entire repo in a single-shot prompt.

Token estimation is non-trivial

Using LLMs in any code development workflow has cost implications. While some LLM vendors do provide their own token estimators, Peta Muir, shows how to build your own token estimator in her post Counting Claude Tokens Without a Tokenizer.

Other repo metadata stats

Extracting other metadata stats can add more context to the term repo size:

> kgtools repo --help
Usage: kgtools repo [OPTIONS] REPOSITORY COMMAND [ARGS]...

  Repository operations.

  Example:     kgtools repo zotero/zotero description

Options:
  --help  Show this message and exit.

Commands:
  default_branch
  description
  directory_count
  file_count
  file_types
  get_directory_file_count
  language
  size
  stars
  stats
  visibility

What does the repo file structure look like?

By traversing the repository tree, we can:

Count file types - classifying files by extension
Identify dominant languages
Detect framework hints (e.g., package.json, pom.xml, requirements.txt)
Identify directories likely containing application logic

The CLI tool command kgtools repo <repo> file_types gives two outputs, the:

Terminal output of file type counts for manual review, and
A CSV file of File Type, Directory, and File Name ready to be uploaded into Neo4j

Running this CLI command for the Zotero repo gives us:

> kgtools repo zotero/zotero file_types

FILE_TYPES
----------
total_files: 3407
directory_count: 271
output_csv: repo_files.csv

File Type Distribution:
  SVG: 764
  FreeMarker_Java_template: 625
  Javascript: 557
  DTD_XML_def: 482
  CSS: 199
  Java_prop: 146
  Other: 108
  Header: 107
  C++: 77
  JSON: 63
  XHTML: 52
  PNG: 41
  .
  .
  .

What This Reveals

Even before parsing individual files, repo metadata alone can tell us:

Whether this is frontend-heavy (many .js/.ts)
Backend-heavy (many .py/.java)
Monorepo structure
Whether multiple frameworks coexist
Whether tests dominate the codebase
Whether build tooling is complex

This immediately narrows down where the actual application logic likely lives.

For example:

Many .js files + chrome/ directory? Possibly browser-extension architecture.
Many .java files + pom.xml? Likely Maven-based.
Many .ts + angular.json? Likely Angular.

This is chain-of-thought reasoning — not reading code.

For Zotero, Javascript dominates the list of high-level, object-orientated languages (557 files). Hence, there is most likely a package.json that will tell us about any frameworks and packages used:

"dependencies": {
    "prop-types": "^15.8.0",
    "react": "^18.3.1",
.
.    
},

Zotero uses React, but exactly how is the subject of the next level of metadata detail.

Why This Matters

When documentation is incomplete — which is common in large projects — the source code becomes the documentation.

But raw source code is overwhelming.

Metadata extraction gives you:

A map before you explore the terrain
Quantitative insight into structure
A deterministic starting point

It’s similar in spirit to static analysis tools like:

Semgrep
CodeQL
Sourcegraph

Except here, we’re building our own structural layer.

Knowing what frameworks are used is helps in knowing what components we are likely to come across when we traverse the repo.

Ingesting what we know into a Knowledge Graph - Neo4j

A repository knowledge graph is built incrementally.

We don’t wait until everything is extracted. We start with what we know.

As metadata is collected, it is ingested into the graph. This allows us to immediately test whether the schema is useful.

Can it answer the questions we care about? Does it reveal structure? Does it expose relationships?

If not, the schema evolves.

The graph is both the model and the feedback loop.

The CLI tool command:

kgtools repo <repo> file_types

outputs the repo_files.csv.

The exact howto for ingesting this file depends on the type of your Neo4j instance.

This step turns raw metadata into something navigable.

A flat file becomes a graph.

A graph becomes structure.

And structure is what allows us to move from exploration to understanding. For example, a schema like the following:

 flowchart TD   
    A((Directory))
    B((FileName))
    C((FileType))

    A-->|CONTAINS|B
    B-->|IS_TYPE_OF|C

lets us answer the following question:

"What files are co-located in the same folder? Tightly coupled files indicate a particular design philosopy"

We explore software patter recognition in subsequent posts.

Next: AST Traversal

Metadata tells us:

Where the logic probably lives.

The Abstract Syntax Tree tells us:

What the logic actually does.

In the next post, I’ll show how to:

Parse likely application files
Extract functions, imports, and call relationships
Convert that into graph-ready data
Prepare it for ingestion into Neo4j

That’s where structural intelligence really begins.

Closing Thought

When facing an unfamiliar codebase:

Don’t start by reading files.

Start by extracting structure.

Structure becomes a graph.

The graph becomes navigable intelligence.

And that’s how you turn a black box into something explorable.