ErgoCat Project Update--1/27/26

Clarifying architecture, hardening the edges

Jan 27, 2026

It’s been a while since I’ve given an honest project update. In my most recent post, I reported on a major pivot with the UploadHandler module of ErgoCat. To recap, this is the “front door” of the app, the place where the user supplies the image files that are surrogates for the notated music resource being cataloged. Everything that happens in the app stems from the image file input. After moving UploadHandler from a Power Apps screen to a separate desktop applet, this module now works reliably as intended, though the UI needs some polish. Most of my work since the last project update has focused on solidifying how the modules perform in concert: how they interact with the underlying data, where failures can occur, and what those interactions imply for overall performance. This has been a “99% perspiration” work phase, for sure.

Today’s post focuses on the rest of the pipeline post-upload, an updated explanation of how the layers of the app currently work together, and how chunks of text form the functional throughline of the entire ErgoCat transaction. I’ll also share a more detailed preview of the last major module to be built: AuthorityAgent.

Text Chunks: ErgoCat’s Unit of Currency

At its heart, ErgoCat is an attempt to automate the intellectual process of cataloging and speed up the human’s work by delegating to a machine what it can do reliably faster (and just as or nearly as accurately) than what the human cataloger can do on their own. In order to do this, the intellectual process has to be broken down into its constituent steps, and the human and machine’s roles must be clearly defined. Through this process of designing and building an app from scratch, I’ve discovered that the throughline to this process is the text chunk. This is the “atom” of cataloging physics, if you will. Allow me to elaborate…

Library resources present themselves to the user primarily using text1, in more or less predictable ways. The venues for this presentation for typical notated music resources (scores) include the title page, cover, caption/first page of music, contents page, etc. Each source has one or more blocks of text, which are the raw material by which the cataloger does their intellectual work. Cataloging standards guide the cataloger in which pieces of information from which sources should form the basis of a viable bibliographic description (i.e., the metadata package ingested into an online catalog). But it is up to the cataloger to perform a “technical read” of the resource and detect what those pieces are. Traditional cataloging workflows (even in the age of networked computers) are extremely hands-on, requiring hand keying, navigating around a fielded interface, etc. ErgoCat proposes a new division of labor between human and machine:

UploadHandler ingests images of these source venues. The human cataloger uploads them and determines which images belong to which source.
LayoutDetector uses OCR technology to read and transcribe the chunks of text on each source. The cataloger then reviews these transcribed chunks, decides which are bibliographically relevant, and makes merges and splits to form coherent intellectual units.
ChunkClassifier takes those human-filtered chunks of text and assigns them their bibliographic function (title, statement of responsibility, and so on), using a pre-trained transformer model2 (a foundational component within large language models (LLMs)); the cataloger reviews and corrects the model’s classification decisions.
MARCBuilder takes those classified and approved chunks and encodes them in a MARC (sub)field based on a detailed set of heuristics, taking into account the function and source of each chunk. This is where the “magic” of the machine side of the transaction really happens, and it’s currently handled in a deterministic way. No LLMs are involved here, so there no hallucinations. The heuristics in the MARCBuilder script are informed by years of music cataloging expertise and can be refined and adapted as needed to fit the cataloger’s environment. This includes accommodating other resource formats as well as the metadata encoding format itself. So, a BIBFRAMEBuilder could be built with the same logic, just different endpoints. (That’s a mountain to climb some other time.)
AuthorityAgent will read the same chunks processed by MARCBuilder, identify entities that can be represented by controlled terms and access points, and propose them to the cataloger for approval/rejection. The approved authority data is then run back through MARCBuilder and added to the draft MARC-friendly JSON representation, which the cataloger can then download (as proper MARC) and import to their production environment of choice. (More on the AuthorityAgent’s design below.)

If you’re a regular reader of this blog, please excuse the retreading of things I’ve covered in earlier posts, but I thought it helpful to reiterate the design of ErgoCat through the lens of the text chunk as organizing principle, and both the human and machine’s role in interpreting and processing it. This lens makes it easier to understand and justify the architecture of ErgoCat.

The Three Layers of ErgoCat

As someone new to software development, through this project I’m learning the fundamentals as I go. What I describe here may be 101-level stuff to the professional developer, but for my librarian-folk who are less versed and are interested in learning alongside me, perhaps this will be of some use.

My initial vision for ErgoCat was a more-or-less self-contained Power Apps solution that ingested image files and spit out marked-up metadata (through the steps described above, of course). Data was to be created natively and manipulated in that environment, until ready for user export. As I built the app piece by piece, however, I have learned that the structure of the app actually reveals a three-layered architecture. Here they are, from top to bottom.

UI Layer: this is where the human user’s intellectual decisions are made. A series of screens in Power Apps leads the user through a sequence of decisions about which text chunks to keep, what their function is, and where they are represented in the final metadata product. Crucially, the data the user reacts to and annotates/corrects/amends does not actually “live” here as I naively assumed they would at the outset. Rather, the UI layer comprises a set of “projections” of data from the Data Layer below as well as several touch points where the data layer can be updated with the user’s input. Any text that the user interacts with here is ephemeral and not persisted to the Data Layer until a save button or other mechanism makes that happen.
Data Layer: this is the substrate where all those text chunks actually “live.” The processes within the Machine Layer create and manipulate an array of artifacts, some ephemeral and some durable. The primary pipeline for chunks data is an Excel table that stores all chunks for a given “batch” (cataloging transaction) and stores annotations made by the UI Layer and Machine Layer. The data in this table persists until the user (so far, just me) decides it should age out. It is not the final home for the ErgoCat-created metadata, however. The MARCBuilder process creates a JSON document based on data from the Excel table, so it is a critical version-controlled staging area. Any further modifications after MARCBuilder runs the first time are in JSON format, and this will be the canonical source for MARC data extraction.
Machine Layer: this is the venue where the application really adds value. Within this layer are scripts that carry out the contracts of UploadHandler, LayoutDetector, ChunkClassifier, MARCBuilder and (shortly) AuthorityAgent. These scripts are in Python format and are orchestrated by a series of “helper” scripts (in either Python or Powershell format) that: launch the application; detect activity in the UI Layer and Data Layer and respond to it at various points; and, shut down the app.

This spatial metaphor need not be vertical (like a three-story building), though that’s the most salient image I hold in my head. The metaphor could also be that of a table with pieces of data on it, the human on one side, and the team of robots on the other, all collaboratively shaping those pieces into a coherent whole. Forgive the fanciful imagery; this blog is space for thinking about the possibilities of automation in the age of AI and so feel free to invent your own imagery! (Extra credit if you decide to describe it in the comments below.)

Work in the recent weeks has touched all three of these layers, with a view to tightening up the connections, clarifying the data model governing the Data Layer, and debugging/hardening potential points of failure. To summarize a few hard-won lessons from this phase (with some help from ChatGPT):

“Several bugs ultimately traced back to the same root cause: Power Apps galleries and Excel-backed tables were being treated—implicitly—as authoritative state. Reloads, race conditions, and partial writes made it clear that UI state is inherently fragile. This led to a clearer rule: Power Apps may display and edit, but all durable truth lives in versioned JSON artifacts outside the UI layer.”
“As more scripts, flows, and review steps came online, it became obvious that “who is allowed to write what, and when” mattered more than raw performance. Tightening write permissions—Watcher scripts owning machine updates, ReviewUI owning human edits, MARCBuilder owning MARC rendering—eliminated entire classes of silent failure and made the system’s behavior predictable and auditable.”
“Several bugs traced back to implicit assumptions about when one step was “done” and another could safely begin. Introducing explicit stage markers and handoff points—rather than relying on file arrival or UI timing alone—made each transition observable and debuggable, even at the cost of some added latency.”
“When failures occurred inside Power Apps or Power Automate, diagnostic information was often opaque or incomplete. Shifting core logic into scripts that emit plain-text logs and JSON artifacts made it possible to debug the system with standard tools, reinforcing the idea that the UI should never be the only window into system state.”

Indeed, latency still remains a chronic challenge. The UI Layer hangs way too long at certain steps while it awaits a file overwrite to clear or a new file to appear. Addressing this is more of that “99% perspiration” stuff that will consume much of my time for the foreseeable future. I have some ideas in the works, and a trusty ChatGPT assistant who’s helping me troubleshoot. I would be totally at sea without it.

Preview: AuthorityAgent and its Role as Proposer

Perhaps the most significant value proposition of ErgoCat is its projected ability to save the user time in querying authority sources for entities like composers, work titles, genre/form terms and medium of performance terms. How it accomplishes this is a fourfold process:

Scan candidate text chunks that are already selected for the MARC record earlier in the pipeline and identify entities of interest. This constraint is intentional: AuthorityAgent never invents data. It works exclusively from descriptive evidence the cataloger has already accepted.
Derive search terms from those entity-looking text strings and query the appropriate authority file. For example, “W.A. Mozart” should match the LCNAF authority record with LCCN n80022788 and authorized access point Mozart, Wolfgang Amadeus, ǂd 1756-1791.
Present the user (in the UI Layer) with candidate matches, their supporting evidence (i.e., the text chunk itself, its source and other pertinent details), confidence score, etc. If multiple matches are found for a particular entity that exceed the threshold confidence level, they are all presented to the user for their choice. The user approves or rejects each match.
Write the selected entities back to the reviewed JSON file that already contains the descriptive field the user has just approved. The MARCBuilder script runs in a special “authority mode” that encodes the entity fields (100, 240, 700, 382, 655, etc.) using appropriate syntax. Existing fields in the JSON file are left unchanged.

ChatGPT would like to add: “What matters most here is not the automation itself, but the boundary it respects. AuthorityAgent accelerates lookup and comparison work—the most time-consuming parts of authority control—while preserving the cataloger’s role as the final arbiter. Authority decisions remain explicit, inspectable, and reversible, exactly where they belong.”

The user can then review the amended draft MARC representation on a final Power Apps screen, make any further desired changes, and then call a script that extracts a true MARC encoding (derived from the MARC-like JSON). And that concludes the entire ErgoCat transaction!

I hope to be able to report a successful integration of AuthorityAgent into the production ErgoCat pipeline soon.

AI usage note: for this post, my text and ChatGPT’s text are presented distinctly. I also solicited and incorporated some light editing suggestions of my text and performed some slight edits to ChatGPT’s texts (to remove inaccuracies, redundancies, etc.).

Music resources also present musical notation, which is analogous to text in this framing. A future version of ErgoCat might be able to read the notation and extract bibliographic meaning; for now, this is squarely in the human’s intellectual domain.

The current preferred model for ErgoCat is DistilBERT Base Cased, an open-weight model available on the Hugging Face platform. I have fine tuned the model with a corpus of labeled examples from real-world catalog records.

Casey Mullin

Discussion about this post

Ready for more?