Flaky tests produce inconsistent outcomes without code changes, creating
major challenges for software developers. An industrial case study reported
that developers spend 1.28% of their time repairing flaky tests at a monthly
cost of $2,250. We discovered that flaky tests often exist in clusters, con
co-occurring failures that share the same root causes, which we call systemic
flakiness. This suggests that developers can reduce repair costs by addressing
shared root causes, enabling them to fix multiple flaky tests at once rather
than tackling them individually. This study represents an inflection point by
challenging the deep-seated assumption that flaky test failures are isolated
occurrences. We used an established dataset of 10,000 test suite runs from 24
Java projects on GitHub, spanning domains from data orchestration to job
scheduling. It contains 810 flaky tests, which we levered to perform a
mixed-method empirical analysis of co-occurring flaky test failures. Systemic
flakiness is significant and widespread. We performed agglomerative clustering
of flaky tests based on their failure co-occurrence, finding that 75% of flaky
tests across all projects belong to a cluster, with a mean cluster size of 13.5
flaky tests. Instead of requiring 10,000 test suite runs to identify systemic
flakiness, we demonstrated a lightweight alternative by training machine
learning models based on static test case distance measures. Through manual
inspection of stack traces, conducted independently by four authors and
resolved through negotiated agreement, we identified intermittent networking
issues and instabilities in external dependencies as the predominant causes of
systemic flakiness.
Questo articolo esplora i giri e le loro implicazioni.
Scarica PDF:



