Restoring Natural Sound from Clipped Signals
How to restore audio from clipped signals
Early in 2012, Srdjan Kitić joined NXP Software on an industrial placement for his Master’s project. He was in his second year of a two year Master’s degree. The first year was spent at KTH in Stockholm and the second year was to be based at UCL in Louvain-La-Neuve. His chosen project was on unsupervised de-clipping of voice and music signals. From the outset, Srdjan proved to be an excellent student, bright, highly motivated and full of ideas. The project also enjoyed the expert supervision from UCL under the auspices of Professors Christophe De Vleeschouwer and Laurent Jacques. Within NXP Software we had an equally enthusiastic team overseeing the project comprising myself, Nilesh Madhu and Ann Spriet. Our regular project meetings were lively and challenging, and helped steer the project to a successful conclusion.
Simply put, the problem is the following: we have a signal that has been clipped, i.e. its peak levels have been limited as shown in the diagram below. This can occur when a very loud signal is captured with a microphone, it sounds distorted and unpleasant. To restore naturalness to the recording, we have to remove such distortions. De-clipping is the general term given to the process of removing the distortions introduced by such peak limiting. Usually, all we have to work with is the clipped signal itself. We have no further information, either on the characteristics of the recording system or of the recorded signal in its unclipped state. The de-clipping is thus ‘blind’. In such a case, the best we can do is the following:
- Signal portions that we reliably know are not clipped should be preserved
- Signal portions that are peak limited along positive y-axis should be reconstructed such that the reconstruction lies above the positive clipping threshold
- Signal portions that are peak limited along the negative y-axis should be reconstructed such that the reconstruction lies below the negative clipping threshold
The key challenge is the reconstruction of the clipped regions.
In Srdjan’s work, we solve this problem using a novel sparsity coding approach. While the full mathematical treatment of this approach can be found in Srdjan’s paper, below you can find a layman’s explanation of the underlying method.
For many signals of interest that occur in nature (e.g. music, voice), there is a lot of structure in the signal, for example the horizontal stripes you see in the diagram below. Furthermore, such signals contain a lot of redundancy and can be efficiently coded using only a few elements from a suitable dictionary. As an example: a sinusoidal wave can be described either by specifying all its time samples for the given duration, or by simply specifying its amplitude and frequency. The latter representation encodes all the time domain information in only two elements. Thus, for the sinusoidal signal, the amplitude and frequency form a ‘sparse’ representation. We exploit this sparsity and structure of the signals to help restore the damaged portions.
The first step is to recode the signal based on the structure:
- First we create a suitable dictionary for music and voice audio signals, in a domain where these signals can be represented in a sparse manner (usually, this is the spectral representation of the signal). This audio dictionary contains simple elements that when combined together can recreate the signal.
- We then iteratively look at the input (clipped) signal to we see which part of the structure best matches one of our dictionary entries. Having decided which of the dictionary entries to use, we modify our original signal by removing this part of the structure. In selecting the dictionary entries we impose the constraint mentioned above, namely that the reconstructed signals in the clipped portions must be outside the limits of the clipped signal.
- We repeat this many times, looking for the best (constrained) matches with the signal structure and our dictionary entries, until we obtain the optimal sparse representation of our signal. A major contribution of our approach is the online determination of the optimal sparsity of the signal (i.e. the determination of the maximum number of dictionary entries used in the sparse representation). This is a non-trivial problem and has not been tackled thus far, with most state-of-the-art approaches assuming that the sparsity was known beforehand.
We compared our approach against state-of-the-art approaches (both commercial and academic) and found our algorithm to significantly outperform the competition, both on instrumental measures as well as in blind listening tests.
In the following audio files you hear the effect of the processing and the obvious improvement it brings. Three files are presented: the original signal, the clipped version (artificially created by hard limiting) and the restored version using Srdjan’s processing algorithm.
The result of this work was a paper submitted to ICASSP 2013 in Vancouver where Srdjan was invited to present his work with a poster. The paper can be downloaded via this link or: http://arxiv.org/abs/1303.1023
Srdjan is continuing his studies at Inria in Rennes where he is studying for his doctorate. We wish him all the best and look forward to following his future work.