Evaluation Protocol
STS 2026 is organized into three tasks: metal artifact CBCT teeth segmentation, CBCT-IOS registration and MMDental multimodal analysis. Together, they cover robust CBCT tooth analysis, cross-modal dental geometry fusion and multimodal clinical reasoning.
Evaluation Tasks
| Task | Submission | Assessment |
|---|---|---|
| Task 1: Metal Artifact CBCT Teeth Segmentation | Tooth segmentation masks for CBCT scans affected by metal artifacts. | DSC and HD95 measure segmentation overlap and boundary accuracy. |
| Task 2: CBCT-IOS Registration | Rigid transformation matrix aligning the IOS crown surface to the CBCT volume. | MTE and MRE measure geometric alignment accuracy. |
| Task 3: MMDental Multimodal Analysis | Model outputs generated from tooth CBCT images and expert medical records. | Task-specific multimodal diagnosis, reporting and reasoning metrics will be released with the protocol. |
Metrics
- Dice Similarity Coefficient (DSC): segmentation overlap.
- 95% Hausdorff Distance (HD95): segmentation boundary accuracy.
- Mean Translation Error (MTE): registration translation accuracy.
- Mean Rotation Error (MRE): registration rotation accuracy.
- Task 3 multimodal metrics will assess consistency between CBCT evidence and expert clinical records.
DSC and HD95 are intentionally kept separate for Task 1. HD95 is emphasized because metal artifact blooming mainly corrupts anatomical boundaries, and boundary fidelity is clinically important for artifact-affected CBCT analysis. DSC is used to report overlap quality.
Ranking Rules
- Task 1 reports Rank_DSC and Rank_HD95 for metal artifact CBCT teeth segmentation.
- Task 2 ranks teams by registration accuracy using MTE and MRE.
- Task 3 will use a separate leaderboard based on the released MMDental multimodal protocol.
- Missing or failed test-case results receive the worst possible score, such as DSC=0 or HD95=infinity.
Dataset Split
| Split | Cases | Provided Data |
|---|---|---|
| Training (Labeled) | 40 | CBCT, IOS, segmentation masks and registration matrices. |
| Training (Unlabeled) | 219 | Raw CBCT and IOS data for semi-supervised learning. |
| Validation | 20 | Raw CBCT and IOS data; ground truth withheld for server evaluation. |
| Test | 100 | Raw CBCT and IOS data; hidden ground truth for final ranking. |
| Total | 379 | All cases include metallic restorations and metal artifacts. |
The labeled training set and test set share the same metal artifact severity stratification: 30% mild, 40% moderate and 30% severe. This mirrored distribution keeps the task definition stable and evaluates robustness to metal artifacts rather than unexpected domain shift.
Statistical Analysis
The organizers will estimate 95% confidence intervals using bootstrap analysis, compare top teams with paired Wilcoxon signed-rank tests, and report variability with standard deviation, interquartile range and box-and-whisker plots. Additional analyses will include artifact severity stratification, semi-supervised learning efficacy and inter-rater reliability by artifact subgroup.
The latest public materials for STS 2026 will also be linked from the official GitHub repository.