Performance

ecsia ships a reproducible benchmark harness and publishes its actual output — including the cases where a single-purpose library beats us. The tables below are generated by pnpm bench:report, which writes both bench/RESULTS.json (the machine-readable artifact, with a full environment block) and the markdown this page includes. Re-running the command regenerates both, so the page can never disagree with the artifact.

Numbers will vary

These are wall-clock measurements (real elapsed time) on one machine at one moment. Your CPU, Node version, thermal state, and background load will move them — sometimes a lot. Treat the shapes (which loop is faster, where the worker curve crosses 1×) as the durable story; treat the absolute milliseconds as a snapshot. The environment block above each table records exactly what produced it.

Methodology

What each loop does. The single-thread iteration bench is the classic ECS hot loop — the small stretch of code that runs over every entity, every frame, and therefore dominates the cost. Here it adds each entity's Velocity to its Position, one frame per timed op. The inner loop is allocation-free — it creates no objects while running — so the measurement is storage and iteration cost, not garbage collection.

ecsia .each — the ergonomic accessor path: per-row proxy objects, write-log aware. Its readable e.position.x += … body can be compiled to the fast column loop with compile — same syntax, near-eachChunk speed.
ecsia eachChunk — the opt-in column cursor. ecsia stores each component field in its own contiguous array (a column), a layout called Structure-of-Arrays (SoA). eachChunk hands you those raw typed-array columns plus a row span, and the loop indexes Float32Array directly — bypassing the per-row accessor and the reactivity write-log push.
ecsia bindColumns — the bind-once fast path: the same raw columns as eachChunk, but resolved once up front instead of on every call, so the engine can compile the loop with them baked in. See Bind your loop once below.
miniplex — array-of-objects iteration.
bitECS — a raw SoA loop over a flat query result. The reference single-thread baseline.

Before timing, the harness cross-validates one full step of eachChunk and of bindColumns against .each at the bench's own N (which crosses the 1024-row column-growth boundary). A fast-but-wrong loop fails the run instead of silently reporting a misleading number.

Why per-entity work matters for the speedup bench. The worker-pool bench is not a toy doing trivial arithmetic per entity. Each entity runs an iterated damped-oscillator integrator (a small spring-physics simulation): hundreds of sub-steps of expensive math calls (sin/cos/exp) per frame. That is deliberate. Parallel speedup only appears once each wave — a batch of systems that can safely run at the same time because none writes data another touches — amortizes the coordination cost, meaning the per-frame work is large enough that the dispatch overhead (the fixed cost of handing work to worker threads) and the synchronization between waves stop mattering. A trivial body would be pure overhead and would never beat single-thread — which is exactly why benchmarks that show "linear scaling" on trivial work are misleading. We make the work heavy enough to be honest about where the crossover actually is.

Single-process discipline. Every bucket runs sequentially in one process. The iteration bench uses tinybench with a fixed time budget; the worker-pool bench is the only thing that spawns OS threads, and it does so one configuration at a time. No two measurements compete for cores.

Environment disclosure. bench/RESULTS.json and the table header carry the Node version, CPU model and logical core count, the date, and the commit SHA. A number without its machine is not a number.

Results

Environment. AMD Ryzen 9 7950X3D 16-Core Processor (32 logical cores) · Node v24.11.0 · 2026-06-07 · commit 9ffc021 · bitECS 0.4.0 · miniplex 2.0.0 · tinybench 6.0.2

Single-thread iteration

Each loop adds every entity's velocity to its position, over 50,000 entities per op. ns per entity is mean op time divided by entity count (nanoseconds per entity — lower is faster); ratio vs bitECS is bitECS ops/s ÷ this row's ops/s. The ecsia bindColumns row compiles a specialized loop per matched archetype (re-evaluating the factory into a fresh function so V8 keeps it on the fast path), which holds through storage growth with no pre-sizing; where dynamic compilation is forbidden (strict CSP / locked sandbox) it falls back to a plain interpreted loop.

loop	ops/s	ms/op	ns per entity	ratio vs bitECS
ecsia .each	1,974	0.5070	10.14	7.53x
ecsia eachChunk	13,663	0.0733	1.47	1.09x
ecsia bindColumns	20,569	0.0487	0.97	0.72x
miniplex	1,661	0.6076	12.15	8.95x
bitECS	14,871	0.0673	1.35	1.00x (baseline)

Tracked-write cost — the same .each loop with a .changed() filter attached and drained each frame (the change-tracking overhead you opt into for reacting to changes). The frame is scheduler-driven and the harness asserts the filter drains every one of the frame's writes before a number is published:

loop	ops/s	ms/op	ns per entity	ratio vs bitECS
ecsia .each + .changed()	124	8.0494	160.99	119.47x

Worker-pool speedup

Real node:worker_threads + Atomics. 8 independent Body groups × 1,024 entities (8,192 total), 512 sub-steps of expensive math (sin/cos/exp) per entity per frame, 60 frames. Speedup is single-thread wall-clock time ÷ this row's. byte-identical confirms the threaded run's sum-of-fields checksum equals the single-thread run's.

Single-thread baseline: 12001.4 ms.

workers	wall ms	speedup vs 1 thread	byte-identical
1	12163.3	0.99x	yes
2	6303.3	1.90x	yes
4	3319.9	3.62x	yes
8	1879.8	6.38x	yes

Honest analysis

bitECS wins the default-path comparison. Its flat SoA loop out-iterates both .each and eachChunk, and we do not pretend otherwise. If your entire workload is one tight integrate loop on a single thread and you never reach for bindColumns, bitECS is the fastest tool here.
ecsia bindColumns beats bitECS on this bench (~0.7×). It compiles a specialized loop per matched archetype — the same raw-typed-array shape bitECS uses with one less indirection (ecsia walks its rows densely where bitECS indexes through an entity list), kept on V8's fast path by re-evaluating your factory into a fresh function per archetype. Because that recompiles on growth, the edge holds with no pre-sizing and no after-growth penalty; the one rule is a self-contained factory (deps via ctx). See Bind your loop once.
ecsia .each beats miniplex. The ergonomic accessor path — proxies, write-log awareness, and all — still out-iterates miniplex's array-of-objects walk. You do not pay for ecsia's ergonomics by dropping below the closest ergonomic competitor.
ecsia eachChunk lands within ~1.1× of bitECS on a modern V8. The column cursor re-resolves its columns every call — that re-resolution is what keeps it safe under storage growth with zero setup; bindColumns is the version that compiles the loop and pulls ahead of bitECS.
The tracked-write row is the cost you opt into. Attaching a .changed() filter and draining it each frame is markedly more expensive than the bare integrate loop — that is the write-log doing real work so reactivity, deltas, and change observers are available. You pay it only when you ask for it; the plain .each and eachChunk rows are what you get when you don't.
The worker curve is the capability nobody else ships. No mainstream JS ECS ships a real worker_threads + Atomics auto-parallel scheduler. The speedup column crosses 1× between 1 and 2 workers: at one worker you pay dispatch and wave-sync overhead for nothing (it is slower than the single-thread executor — dispatch overhead is real and we show it), and the parallel win only materializes once a second worker shares the load. The byte-identical column confirms every threaded run produces byte-for-byte the same result as the single-thread run; the speedup is genuine parallelism, not a relaxed-correctness shortcut.
This holds at any column size. In a threaded world, each column lives in a SharedArrayBuffer — memory several threads can read and write at once — with address space reserved up front for INITIAL_ROWS 64 × GROWTH_RESERVE_FACTOR 16 = 1024 rows. When a column grows past that reservation, it re-backs: it moves to a new, larger SharedArrayBuffer. The pool then re-wraps every worker's view of that column at the wave fence (the synchronization point between waves) before the next dispatch; when nothing grew, this costs one generation check per wave. The earlier 0.1.0 pre-release per-column growth cap is retired — growing past 1024 rows is covered directly by packages/scheduler/test/worker-growth-boundary.test.ts (1024 in-place grow + 1025/1040 re-backing) and the above-reservation case in the heavy-pool smoke test.

Bind your loop once: `bindColumns`

eachChunk looks its columns up again on every call. That re-lookup is the safe default — a column's array can be replaced when storage grows — but it stops V8 from compiling your loop with the arrays baked in as constants, and that compilation is worth about 30% on the iteration bench. bindColumns gets it back: you hand the query the columns you want and a factory function; ecsia resolves the columns once and compiles a specialized loop per matched archetype (it re-evaluates your factory into a fresh function so V8 keeps each loop on its fast path). The result lands at ~0.7× bitECS on the iteration bench — faster than bitECS — and stays there even after storage grows, with no pre-sizing required. Where a runtime forbids dynamic compilation (a strict Content-Security-Policy, a locked sandbox), it transparently falls back to a plain interpreted loop; the codegen path is used only when it provably matches that interpreted result, so it can never change what your loop computes.

import { createWorld, defineComponent, write } from '@ecsia/kit'

const Position = defineComponent({ x: 'f32', y: 'f32' }, { name: 'position' })
const Velocity = defineComponent({ dx: 'f32', dy: 'f32' }, { name: 'velocity' })
const world = createWorld({ components: [Position, Velocity], maxEntities: 1 << 16 })

const q = world.query(write(Position), write(Velocity))

const run = q.bindColumns(
  [Position, 'x'], [Position, 'y'], [Velocity, 'dx'], [Velocity, 'dy'],
  ([px, py, dx, dy], meta) => (ctx: { dt: number }) => {
    const dt = ctx.dt        // per-frame inputs arrive via ctx — hoist them out of the loop
    const count = meta.count // the live entity count — read it inside the loop
    for (let i = 0; i < count; i++) {
      px[i] = px[i]! + dx[i]! * dt
      py[i] = py[i]! + dy[i]! * dt
    }
  },
)

run({ dt: 1 / 60 }) // call once per frame

For a vec field the view is the raw flat array — row r's axes live at [r * stride, (r+1) * stride), where the stride is the arity you declared (vec3 → 3). Read it once from meta.strides[specIndex] rather than hardcoding the number, so a later vec3 → vec4 change can't silently mis-index:

const s = meta.strides[0]            // the first spec's slots-per-row
for (let r = 0; r < meta.count; r++) pos[r * s] += dx[r] * dt

One rule makes the codegen path kick in, and it is the natural shape anyway:

Your factory must be self-contained — it may close over nothing from the surrounding scope. ecsia re-evaluates your factory's own source to compile each archetype's loop, and that fresh copy only sees globals. So pass per-frame inputs through the runner's ctx argument (hoist them to a local before the loop, as above) and define fixed constants inside the factory body. A factory that reaches outside itself still works — it just runs the plain interpreted loop instead of the compiled one. meta.count is read inside the loop (the live entity count); spawning and despawning never re-invoke the factory.

The trade-offs are the same as eachChunk: writes through the bound arrays bypass the write log, so .changed() filters and observers will not see them, and structural changes during run() follow the same collect-first, mutate-after rule as every other loop.

Compile the ergonomic path: `compile`

bindColumns is fast but makes you name every column and rewrite your loop against raw arrays. compile gets the same speed from the readable .each body you would write anyway: hand it a callback, and it reads that callback's source, rewrites each e.<component>.<field> to direct column indexing, and codegens the same specialized per-archetype loop. The result lands near eachChunk — roughly 6× faster than the proxy .each — while you keep writing e.position.x += ….

import { createWorld, defineComponent, write, read } from '@ecsia/kit'

const Position = defineComponent({ x: 'f32', y: 'f32' }, { name: 'position' })
const Velocity = defineComponent({ dx: 'f32', dy: 'f32' }, { name: 'velocity' })
const world = createWorld({ components: [Position, Velocity], maxEntities: 1 << 16 })

const q = world.query(write(Position), read(Velocity))

const run = q.compile<{ dt: number }>((e, ctx) => {
  e.position.x += e.velocity.dx * ctx.dt
  e.position.y += e.velocity.dy * ctx.dt
})

run({ dt: 1 / 60 }) // call once per frame

Two things set it apart from bindColumns:

It preserves reactivity. Unlike bindColumns and eachChunk, a component your body writes is recorded in the write log exactly as the accessor setter would record it, so .changed() filters and observers still fire. That bookkeeping is free when no consumer is registered (the same gate the accessor uses) and costs the write-log push only when one is — so compile is the fast path you can reach for even when you depend on change tracking.
It is a pure speedup that can never change your result. The analyzer is deliberately conservative: it compiles only straight-line numeric-scalar bodies, and for anything it cannot prove safe it transparently runs your unchanged callback through the normal .each. So a body with control flow (if/?/&&/return/a loop/a nested function), a string or comment, a non-numeric-scalar field (vec/bool/eid/string/object), a component the query does not require, any use of e other than e.<component>.<field>, a per-row write to ctx, a row-filtered query, or a runtime that blocks new Function (a strict Content-Security-Policy) all keep working — they just run the proxy loop. A property test asserts the compiled loop is byte-identical to .each under random spawn / despawn / write / growth churn, with and without a change consumer.

Call compile once and reuse the returned run every frame, the same as bindColumns. Structural changes during run() follow the same collect-first, mutate-after rule as every other loop.

Inside a system

A defineSystem has only a per-frame run — no separate setup step — so build the runner on the first frame and cache it in the system's closure (the same pattern applies to bindColumns). The query you need is the one run receives, so the lazy build is the natural place for it:

let move: ((ctx: { dt: number }) => void) | null = null

const Movement = defineSystem({
  name: 'Movement',
  read: [Velocity],
  write: [Position],
  run({ query, dt }) {
    move ??= query(read(Velocity), write(Position)).compile<{ dt: number }>((e, ctx) => {
      e.position.x += e.velocity.dx * ctx.dt
      e.position.y += e.velocity.dy * ctx.dt
    })
    move({ dt })
  },
})

The runner is built once, on the first frame, then reused — and because compile preserves the write log, a .changed()/observer system later in the schedule still sees the writes. (A worker-eligible system runs its separately-authored kernel on worker threads; compile is a main-thread run-body tool, exactly like each and bindColumns.)

Warm a cold archetype: `warm`

Every loop on this page is fastest over a hot archetype — one backed by contiguous typed-array columns. There's a cap on how many archetypes stay hot (maxHotArchetypes, default max(256, maxEntities >>> 8)). Once you exceed it, new archetypes are created cold: their rows live in a shared overflow store keyed by (entity, component) rather than in their own columns. Cold archetypes still iterate correctly — they just pay an extra indirection per field instead of a straight column walk.

If a profile shows a hot loop landing on a cold archetype, promote it once with world.warm. Pass the exact component set that names the archetype; warm allocates its columns and migrates the resident rows into contiguous storage:

import { createWorld, defineComponent } from '@ecsia/kit'

const Position = defineComponent({ x: 'f32', y: 'f32' }, { name: 'position' })
const Velocity = defineComponent({ dx: 'f32', dy: 'f32' }, { name: 'velocity' })

const world = createWorld({ components: [Position, Velocity] })

// Make {Position, Velocity} hot before the first heavy frame.
world.warm(Position, Velocity)

warm moves rows, so it's a structural operation — run it at a flush point (during setup or between frames), never inside a loop. It's idempotent: warming an already-hot archetype is a no-op. Most apps never approach the cap and never need it; reach for warm only when profiling points at a cold archetype on a hot path.

Reproduce

bash

pnpm build            # the harness imports the BUILT package dist (tsx-free)
pnpm bench:report     # regenerates bench/RESULTS.json + website/guide/_perf-tables.md

bench:report runs the bounded config the published tables use: iterate at N=50,000 (3 reps, 300 ms budget per task) and the worker pool at 1,024 entities/group across [1, 2, 4, 8] workers (clamped to your logical core count), 60 frames, fixed seed.

Heavier, longer-running variants drive the same builders at larger sizes for deeper console detail:

bash

pnpm bench:macro        # cross-library macro-benches (iterate + relations + parallel), full sizes
pnpm bench:macro:pool   # the worker-pool speedup sweep on its own, with the printed table

Regression guard

CI does not assert milliseconds — wall-clock time on a shared runner is noise. It guards performance two ways instead. First, correctness: the worker-pool smoke test asserts every threaded configuration is byte-identical to single-thread (the byte-identical column above), and the bench builders are cross-checked so neither eachChunk nor bindColumns can silently diverge from .each. Second, a dedicated bench job asserts each iteration path's ns/entity ratio against a same-run bitECS control stays under a committed ceiling — the ratio cancels runner drift, so it catches a real regression (say, bindColumns deopting) without flapping on a slow neighbor. The absolute published numbers still come from bench:report on a fixed machine.

Bundle size

The kernel — typed data, systems, queries, and the scheduler — is about 42 KB min+gzip; just a world and components (@ecsia/core alone) is ~32 KB, and importing everything from the umbrella is ~57 KB. ecsia is batteries-included, so it is larger than a minimal core like bitECS (~5 KB); the packages are sideEffects: false, so a bundler ships only the subsystems you import — relations, serialization, and topics drop unless used. A CI budget (pnpm size) holds these numbers in place so a change can't quietly inflate the install.

Performance ​

Methodology ​

Results ​

Single-thread iteration ​

Worker-pool speedup ​

Honest analysis ​

Bind your loop once: bindColumns ​

Compile the ergonomic path: compile ​

Inside a system ​

Warm a cold archetype: warm ​

Reproduce ​

Regression guard ​

Bundle size ​

See also ​