In conventional deep speaker embedding frameworks, the pooling layer
aggregates all frame-level features over time and computes their mean and
standard deviation statistics as inputs to subsequent segment-level layers.
Such statistics pooling strategy produces fixed-length representations from
variable-length speech segments. Tuttavia, this method treats different
frame-level features equally and discards covariance information. In this
paper, we propose the Semi-orthogonal parameter pooling of Covariance matrix
(SoCov) method. The SoCov pooling computes the covariance matrix from the
self-attentive frame-level features and compresses it into a vector using the
semi-orthogonal parametric vectorization, which is then concatenated with the
weighted standard deviation vector to form inputs to the segment-level layers.
Deep embedding based on SoCov is called “sc-vector”. The proposed sc-vector
is compared to several different baselines on the SRE21 development and
evaluation sets. The sc-vector system significantly outperforms the
conventional x-vector system, with a relative reduction in EER of 15.5% on
SRE21Eval. When using self-attentive deep feature, SoCov helps to reduce EER on
SRE21Eval by about 30.9% relatively to the conventional “mean + standard
deviation” statistics.
Questo articolo esplora i giri e le loro implicazioni.
Scarica PDF:
2504.16441v1