Arbor — Control 3D Generation with Explicit Geometry
Text and image conditioned 3D models now generate convincing assets, but they still offer little direct control over the space an object should occupy or avoid. In authoring, this spatial intent is often known before generation starts. Arbor introduces constraint meshes as a native 3D control interface. Hull regions mark where geometry should exist, avoidance regions mark space that should remain empty, and touch regions mark surfaces the generated object should contact. These meshes are not target evidence. They are local typed requirements that can include regions where no surface should appear.

Arbor keeps the text conditioned generator and the geometry encoders frozen. Given a text prompt and a typed constraint object, Arbor fuses the constraint meshes into one surface signal 0 and uses the frozen TRELLIS.2 OVoxel encoders to map shape, normals, and binary hull, avoidance, and touch channels into compact tokens 1. The shape and signal streams are projected into a geometry memory with 3D positions 2. A router gives each local region of the TRELLIS sparse structure grid nearby constraint tokens and learned global summary tokens 3. Inside each frozen denoising block, a learned residual geometry branch attends to this routed memory after text cross attention and before the feed forward update 4. Only the geometry projection, positional embedding, summary modules, and grounding adapters are trained.

Arbor is evaluated as a control interface for text conditioned 3D generation on automatic and artist curated Toys4K benchmarks. The videos use the paper color scheme: hull marks desired geometry, touch marks contact surfaces, avoidance marks empty space, and missing hull marks missing hull support in following views.
Controlled Generation. The main comparison tests whether a method can generate a plausible object from text while respecting typed 3D constraints. TRELLIS keeps a strong object prior, but has no mechanism for the typed geometry and only matches the guide by chance. Gradient and SpaceControl move mass toward the control object, yet this often comes at the cost of noisy geometry, missing structure, or a collapsed shape. Spice-E can preserve recognizable form, but treats the guide as a shape signal and does not reliably separate hull, avoidance, and touch roles. Arbor keeps both requirements visible. The outputs remain readable assets, and the following views show local roles respected in the intended regions.
The metric trends mirror this qualitative behavior. Arbor variants separate from baselines outside the Arbor family on the manual and automatic splits, and the Arbor family wins 59.2 percent of pairwise user choices in a 27 participant, 404 trial preference study.
Variation Under Fixed Constraints. A useful control interface should not collapse the generator to one fixed output. We fix one control object and vary the seed. Arbor changes unconstrained parts of the object while still satisfying the constraint. The variation track also compares to Point-E, SPAR3D, and Hunyuan3D-Omni, which receive image cues that Arbor does not. In this setting, stronger visual evidence is not enough by itself to combine control and variation.
Constraint Sweeps. Constraint sweeps test whether control stays smooth outside the discrete benchmark. The prompt is fixed while one region moves through position, scale, or orientation. Arbor follows the deformation without snapping to a small set of canonical layouts.
Open Positions
Interested in persuing a PhD in computer graphics?
Never miss an update
Join us on Twitter / X for the latest updates of our research group and more.
Recent Work
The authors thank Stability AI for hosting Jan-Niklas Dihlmann as an intern during this work. This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy - EXC number 2064/1 - Project number 390727645. This work was supported by the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP 02, project number: 276693517. This work was supported by the Tubingen AI Center. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Jan-Niklas Dihlmann.