Graphical User Interface (GUI) datasets are crucial for various downstream
tasks. However, GUI datasets often generate annotation information through
automatic labeling, which commonly results in inaccurate GUI element BBox
annotations, including missing, duplicate, or meaningless BBoxes. These issues
can degrade the performance of models trained on these datasets, limiting their
effectiveness in real-world applications. Additionally, existing GUI datasets
only provide BBox annotations visually, which restricts the development of
visually related GUI downstream tasks. To address these issues, we introduce
PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web
pages. PixelWeb is constructed using a novel automatic annotation approach that
integrates visual feature extraction and Document Object Model (DOM) structure
analysis through two core modules: channel derivation and layer analysis.
Channel derivation ensures accurate localization of GUI elements in cases of
occlusion and overlapping elements by extracting BGRA four-channel bitmap
annotations. Layer analysis uses the DOM to determine the visibility and
stacking order of elements, providing precise BBox annotations. Additionally,
PixelWeb includes comprehensive metadata such as element images, contours, and
mask annotations. Manual verification by three independent annotators confirms
the high quality and accuracy of PixelWeb annotations. Experimental results on
GUI element detection tasks show that PixelWeb achieves performance on the
mAP95 metric that is 3-7 times better than existing datasets. We believe that
PixelWeb has great potential for performance improvement in downstream tasks
such as GUI generation and automated user interaction.
Este artículo explora los viajes en el tiempo y sus implicaciones.
Descargar PDF:
2504.16419v1