modern-tar | ayuhito

Why?

The NPM ecosystem has a dependency problem. If you look at the most popular libraries in the registry, many carry immense legacy baggage that all developers have to pay the cost for.

e18e.dev is a movement that aims to cleanup and modernize the ecosystem. Huge dependency trees slow down applications, degrade install times, and drastically increase the surface area for supply chain attacks.

As part of my contribution to the community, I wrote modern-tar to demonstrate that we can build a faster, leaner, standards-compliant TAR parser using zero dependencies and modern primitives that is also browser friendly.

The Architecture

Most historical TAR libraries in the JS ecosystem were tightly coupled to Node.js streams. While powerful, it is a heavy abstraction that also ties your business logic to a single runtime.

The Core Engine

To support both browsers and alternative runtimes that interact with the filesystem, I decoupled the parsing logic from the I/O layer entirely.

The core engine is a pure, synchronous state machine. It has no concept of I/O or runtimes. It simply consumes raw Uint8Array chunks and manages the internal parsing states of various TAR implementations.

This architectural decoupling is what makes the library so flexible. Moving the I/O responsibility to external wrappers allows us to easily maintain code for multiple runtimes such as the browser Web Streams API alongside the Node.js Writable API without code duplication or additional overhead.

Zero-Copy Ring Buffer

One of the biggest performance bottlenecks in parsing is memory allocations and fragmentation. A simple approach to handle incoming stream data chunks with is to constantly concatenate and reallocate buffers, however, that severely thrashes the garbage collector and spikes CPU usage.

To solve this, I implemented a custom ring buffer that stores references to incoming memory rather than copying it.

When the engine needs to read a 512-byte header, it calculates the boundaries and returns a zero-copy .subarray() view of the underlying memory. It only allocates new memory if a read request explicitly crosses a chunk boundary. This keeps the memory footprint flat and minimizes GC pressure.

Concurrency Model

The biggest challenge when packing or extracting thousands of files is that asynchronous I/O is non-deterministic, but also the TAR format is strictly sequential. We want to process files in parallel to saturate the CPU, but the final archive must be extracted in a specific order.

Failure to do so can lead to security vulnerabilities such as TOCTOU (Time-of-Check to Time-of-Use) race conditions, where a malicious archive might attempt to replace a safe directory with a symlink between the time it is validated and the time a file is written into it.

To eliminate these path traversal vulnerabilities without sacrificing parallel performance, modern-tar implements a path-caching mechanism backed by a map of promise chains. When multiple workers process files within the same directory tree, the first worker caches its directory creation promise. Subsequent workers operating on overlapping paths retrieve and synchronously await that exact promise chain.

Promises are highly optimized on JavaScript, leading to a parallelized implementation with strict serialization for intersecting directory trees, with barely any overhead.

Status

modern-tar is stable and actively maintained. It is a library adopted by many other behemoths in the industry and it is not a huge burden to maintain.

As for results, the benchmarks demonstrate that for archives with many small files, we are 1.4x faster than node-tar and 2.6x faster than tar-fs, all while maintaining a zero-dependency footprint and having full browser support.

#typescript #performance #web