Development
PTC Benchmark Coverage
Last updated 2026-04-14
PTC Benchmark Coverage
This file is the checked-in phase-2 coverage matrix for the realistic
programmatic tool-calling gallery under
examples/programmatic-tool-calls/.
The machine-readable source of truth lives in
benchmarks/ptc-portfolio.ts. The benchmark harness uses that metadata to
define the headline, broad, holdout, and sentinel layers.
Panel Summary
- headline panel:
6medium lanes, balanced2 / 2 / 2across analytics, operations, and workflows - broad panel:
12medium lanes, balanced4 / 4 / 4 - holdout panel: the remaining
12audited medium lanes - durable panel: the synthetic vendor durable lane plus real audited
plan-database-failoverandprivacy-erasure-orchestrationcheckpoints - gallery canary: all
24audited lanes with lower iteration count - public lane:
ptc_website_demo_small
Headline Seed Matrix
The release harness now keeps the nominal audited headline lanes and a checked-in skewed companion for each one. The skew lanes use the same guest source with a second deterministic seed that distorts cardinality, string noise, or writeback fanout without turning into a synthetic microbench.
| Headline Lane | Skew Metric | Skew Patterns |
|---|---|---|
analytics_revenue_quality |
ptc_analytics_revenue_quality_medium_skewed |
hotspot cardinality, duplicate-heavy collections, longer deal strings |
analytics_fraud_ring |
ptc_analytics_fraud_ring_medium_skewed |
hotspot cardinality, duplicate-heavy joins, noisier identity/note strings |
triage-multi-region-auth-outage |
ptc_triage-multi-region-auth-outage_medium_skewed |
duplicate-heavy alert dedupe, regional hotspot skew, longer error samples |
analyze-queue-backlog-regression |
ptc_analyze-queue-backlog-regression_medium_skewed |
hotspot shard skew, larger intermediate payloads, noisier dead-letter strings |
vendor-compliance-renewal |
ptc_vendor-compliance-renewal_medium_skewed |
larger intermediate payloads, duplicate-heavy evidence sets, lower signal-to-noise |
privacy-erasure-orchestration |
ptc_privacy-erasure-orchestration_medium_skewed |
writeback fanout skew, retention-hold hotspot, larger intermediate payloads |
Coverage Matrix
| Use Case | Category | Panel | Async Shape | Collections | Strings | Time | Writeback | Durable | Compaction |
|---|---|---|---|---|---|---|---|---|---|
analytics_revenue_quality |
analytics | headline+broad | fanout+derived | Map |
no | no | no | no | lower |
analytics_fraud_ring |
analytics | headline+broad | fanout+derived | Map+Set |
yes | yes | no | no | moderate |
analytics_supplier_disruption |
analytics | broad | fanout+derived | Map+Set |
no | no | no | no | moderate |
analytics_model_regression |
analytics | broad | fanout+derived | none | no | yes | no | no | lower |
analytics_market_event_brief |
analytics | holdout | fanout | Map |
yes | yes | no | no | lower |
analytics_enterprise_renewal |
analytics | holdout | sequential resume | none | no | no | no | yes | lower |
analytics_market_abuse_review |
analytics | holdout | sequential resume | none | yes | yes | no | yes | moderate |
analytics_capital_allocation |
analytics | holdout | fanout | Map |
no | no | no | no | lower |
triage-multi-region-auth-outage |
operations | headline+broad | fanout+derived | Map+Set |
yes | yes | no | no | moderate |
reconcile-marketplace-payouts |
operations | broad | fanout+derived | Map |
no | no | no | no | lower |
analyze-queue-backlog-regression |
operations | headline+broad | fanout+derived | Map |
yes | yes | no | no | moderate |
plan-database-failover |
operations | broad | sequential resume | none | no | yes | yes | yes | moderate |
guard-payments-rollout |
operations | holdout | fanout+derived | none | no | no | no | no | lower |
stabilize-oncall-handoff |
operations | holdout | fanout | Map |
yes | yes | no | no | moderate |
coordinate-warehouse-exception |
operations | holdout | fanout | Map |
no | yes | no | no | moderate |
assess-global-deployment-freeze |
operations | holdout | fanout+derived | none | no | yes | no | no | moderate |
security-access-recertification |
workflows | broad | fanout+derived | Map+Set |
no | no | yes | no | moderate |
vendor-compliance-renewal |
workflows | headline+broad | fanout | none | no | no | yes | no | moderate |
privacy-erasure-orchestration |
workflows | headline+broad | sequential resume | none | no | yes | yes | yes | moderate |
chargeback-evidence-assembly |
workflows | broad | fanout | none | no | yes | yes | no | moderate |
approval-exception-routing |
workflows | holdout | sequential resume | Set |
no | no | yes | yes | moderate |
vip-support-escalation |
workflows | holdout | fanout | none | yes | yes | yes | no | moderate |
payout-batch-release-review |
workflows | holdout | fanout | Map+Set |
no | no | yes | no | moderate |
enterprise-renewal-save-plan |
workflows | holdout | fanout | Map |
no | no | yes | no | moderate |
Sentinel Delegations
The audited gallery is intentionally the primary source of truth, but three shape families are still tracked separately instead of being hidden inside the real-gallery panels:
code_mode_search: large preloaded typed-API search surfaces, preload footprint, and result-size sensitivityresult_materialization: cases where guest work stays mostly fixed but boundary output expansion dominates the totallow_compaction_fanout: realistic fanout with weaker final compaction than the main gallery lanes
Those families stay out of the primary gallery scorecards so the benchmark portfolio does not stop reflecting real workloads, but they remain required evidence whenever a change claims to improve generic interpreter, boundary, or result-materialization behavior.
Durable Panel
The durable benchmark panel now covers three resume paths:
ptc_vendor_review_durable_{small,medium,large}: checkpoint atcheckpoint_vendor_reviewptc_plan-database-failover_durable_medium: checkpoint atrequest_operator_approvalptc_privacy-erasure-orchestration_durable_medium: checkpoint at the firstqueue_erasure_job
Those checkpoints keep the durable panel tied to real resumable workflow shapes instead of one vendor-review-only synthetic path.