Confignet

Confignet is a lightweight, pluggable configuration file classifier built in Rust. It’s designed to identify CI/CD-related configuration files from a given project using a fast, Levenshtein-distance-based matching system over a CSV training set.

Built for integration into larger systems like dodo, Confignet allows intelligent automation pipelines to skip irrelevant files and focus only on what matters: CI/CD infrastructure.

  • 🧠 Zero-network AI
  • ⚡ Fast, accurate lookup
  • 🧩 Simple CSV-based extensibility
  • 📦 Available as a library or a CLI tool

Confignet is ideal for:

  • Classifying files detected by file-type scanners (e.g. Magika)
  • Filtering config files before parsing them
  • Auto-generating structured project pipelines

Getting Started

Installation

You can add Confignet to your project by adding this line to your Cargo.toml:

confignet = "0.1"

Or install the CLI tool locally:

cargo install --path .

CLI Usage

confignet <file_path> <mime_type>

Example:

confignet ./Cargo.toml toml

This will output:

{
  "file_name": "Cargo.toml",
  "file_path": "./Cargo.toml",
  "is_ci_cd": true
}

How It Works

Confignet is powered by a simple but effective heuristic system:

  1. The ConfigClassifier is built from a CSV of known config files with associated MIME types and their labels (e.g., ci_cd or non_config).
  2. When a file is passed to Confignet:
    • It extracts the filename from the full path.
    • It compares it against the CSV using Levenshtein distance on MIME-matched entries.
  3. If a best match is found, the classifier returns:
    • file_name: matched entry name from CSV
    • file_path: reconstructed absolute or relative path
    • is_ci_cd: boolean indicating whether the file is related to CI/CD

It is designed for speed, accuracy, and pluggability in environments like local inference pipelines.

Integration in Projects

Confignet is designed to be embedded easily.

As a Library

Import it in your Rust project:

#![allow(unused)]
fn main() {
use confignet::ConfigClassifier;

let classifier = ConfigClassifier::from_csv("data/labeled/ci_cd.csv")?;
let result = classifier.classify("Cargo.toml", "toml");
}

As a CLI in Automation Pipelines

Use Magika (or similar tool) to detect file types:

magika path/to/file | jq '.mimetype'

Then pass the result to Confignet:

confignet path/to/file toml

Pipe JSON output to your parser or decision logic.

In dodo

Confignet is integrated directly into dodo to:

  • Skip non-CI/CD files
  • Send CI/CD-related configs to parsers
  • Build dodo.toml incrementally

Classifier Format

The classifier CSV should follow this format:

file_name,mime_label,config_type
Cargo.toml,toml,ci_cd
.github/workflows/ci.yml,yaml,ci_cd
.dockerignore,text,non_config
  • file_name: file to match against
  • mime_label: MIME label from a scanner
  • config_type: either ci_cd or non_config

Tips:

  • Avoid duplicate file names unless necessary
  • Normalize paths (e.g. .github/workflows/*.yml)
  • Keep MIME labels lowercase and simplified

The CSV is extensible. The more diverse your dataset, the more robust your classification becomes.

API Reference

This page documents the public API of the Confignet library. If you are embedding Confignet into another tool (like Dodo), you’ll primarily interact with the ConfigClassifier type.


Structs

ConfigRecord

A deserialized record from the classifier CSV.

#![allow(unused)]
fn main() {
pub struct ConfigRecord {
    pub file_name: String,
    pub mime_label: String,
    pub config_type: String,
}
}

Fields:

  • file_name: The canonical file name for comparison (e.g. Cargo.toml)
  • mime_label: The mime-type label assigned to the file (e.g. toml, yaml)
  • config_type: Either a type like ci_cd, build, or non_config

This struct is used internally by the classifier.


ConfigClassifier

The main classifier struct that loads and queries classification rules.

#![allow(unused)]
fn main() {
pub struct ConfigClassifier {
    // Hidden internals
}
}

Constructor

#![allow(unused)]
fn main() {
pub fn from_csv<P: AsRef<Path>>(path: P) -> Result<Self>
}

Loads a ConfigClassifier from a given CSV file.

  • path: The path to the .csv file
  • Returns: Result<ConfigClassifier>

Usage:

#![allow(unused)]
fn main() {
let classifier = ConfigClassifier::from_csv("data/labeled/ci_cd.csv")?;
}

Method

#![allow(unused)]
fn main() {
pub fn classify(&self, file_name: &str, mime_label: &str) -> Option<ClassifiedResult>
}

Attempts to classify a file given its name and mime type.

  • file_name: Name of the file (e.g., main.rs, Dockerfile)
  • mime_label: Mime type label from tools like Magika (e.g., toml, json)
  • Returns: Option<ClassifiedResult>, or None if no suitable match is found

Example:

#![allow(unused)]
fn main() {
let result = classifier.classify("Cargo.toml", "toml");
}

Structs

ClassifiedResult

Returned from classify() if a match is found.

#![allow(unused)]
fn main() {
pub struct ClassifiedResult {
    pub file_name: String,
    pub is_ci_cd: bool,
}
}

Fields:

  • file_name: The best-matching canonical file name (e.g., from CSV)
  • is_ci_cd: Whether this file is used for CI/CD based on config_type

Internal Utilities

Confignet also includes a Levenshtein distance utility for fuzzy file matching:

#![allow(unused)]
fn main() {
fn levenshtein(a: &str, b: &str) -> usize
}

This is used internally in classify() to find the closest filename match in the dataset when multiple candidates exist with the same mime type.


Example Integration

#![allow(unused)]
fn main() {
use confignet::{ConfigClassifier, ClassifiedResult};

let classifier = ConfigClassifier::from_csv("data/labeled/ci_cd.csv")?;
let result = classifier.classify("Dockerfile.ci", "text");

match result {
    Some(r) => println!("File: {}, Is CI/CD? {}", r.file_name, r.is_ci_cd),
    None => println!("Unrecognized file"),
}
}

Troubleshooting

❌ Error: No match found

  • Ensure the MIME type is correct
  • Add more diverse entries to the CSV
  • Normalize file names

❌ Panic: Failed to extract file name

  • Ensure you are passing valid paths
  • Use PathBuf methods to extract names reliably

❌ Invalid CSV format

  • Check for unescaped commas or quotes
  • All rows must follow file_name,mime_label,config_type

❌ All results return is_ci_cd: false

  • Check your config_type column values
  • Add more known CI/CD examples to improve accuracy

✅ Tip

Use tools like magika, file, or xdg-mime to generate MIME labels.