Presented to ICA99, January 1999.
In: J.-F. Cardoso, C. Jutten and P. Loubaton (eds.),
Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation: ICA'99,
Aussios, France, Jan. 1999, pp. 283-288.
A Bayesian Approach to Source Separation
|
Kevin H. Knuth |
INTRODUCTION
"La théorie des probabilités n'est autre que le sens commun fait calcul".
Probability theory is nothing but common sense reduced to calculation
- Pierre-Simon de Laplace 1819
The problem of source separation is by its very nature an inference problem. There is not enough information to deduce the solution, so one must use any available information to infer the most probable solution. We demonstrate that source separation problems are well suited for the Bayesian approach, which provides a natural and logically consistent method by which one can incorporate prior information to estimate the most probable solution.
It has been shown by Cox (1946) that the probability calculus developed by Laplace (1812), which consists of the familiar sum and product rules of probability and "Bayes' Theorem", (derived by Laplace) is the unique method by which logically consistent inferences can be performed where plausibility is represented by real numbers. As will be demonstrated, the techniques allow for one to incorporate prior knowledge about a problem into the inference procedure. Several techniques have been developed to aid in the derivation of objective prior probabilities used to represent such knowledge. These techniques consist of Jaynes' Principle of Maximum Entropy (Jaynes' 1978), Jaynes' Principle of Group Invariance (Jaynes' 1968), and marginalization.
The Bayesian methodology has several advantages. First, all of the assumptions that go into finding a solution are made explicit. This is essential, since many ad hoc algorithms have implicit assumptions that may restrict the usefulness of the technique in unpredictable ways. It also forces the researcher designing the algorithm to consider the validity of each assumption.
Second, all of the prior knowledge about a specific problem is expressed in terms of prior probabilities that must be evaluated. This provides one with the means to incorporate any additional relevant information into a problem. Often something as simple as symmetry can provide a valuable constraint. It is sometimes the case that inclusion of what appears to be minimal information, such as the form of the source amplitude density, can be a powerful asset. In these cases, a simple algorithm can exhibit surprising success.
The explicit nature of the assumptions and priors also facilitates the generalization of algorithms to new domains of application. This is extremely important since it is often difficult to generalize ad hoc algorithms to other applications.
We derive the Bell-Sejnowski ICA algorithm from first principles, i.e. Bayes' Theorem, and demonstrate how the Bayesian methodology makes explicit the underlying assumptions. We then further demonstrate the relative ease and power of the Bayesian approach by deriving two separation algorithms that incorporate additional prior information. One algorithm separates signals that are known a priori to be decorrelated and the other utilizes information about the signal propagation through the medium from the sources to the detectors.
BAYESIAN INFERENCE
|
"Observation always involves theory." - Edwin Hubble |
|
Bayes’ Theorem

The posterior probability is the answer to the question. It represents the degree to which we believe a given model, perhaps described by some model parameters, accurately describes the physical situation given the available data and all of our prior information.
The prior probability describes the degree to which we believe the model accurately describes reality based on all of our prior information.
The likelihood describes how well the model predicts the data.
The evidence describes the predictive power of the data based on the specific questions or hypotheses we are posing.
Bayes’ Theorem describes how our prior knowledge is modified by the acquisition of new information or data.
FORMULATING THE PROBLEM
We consider a mixing problem

which is assumed to be linear, stationary, and instantaneous; and in which the source signals are assumed to be independent. We can describe the mixing process by
x(t) = A s(t)
where A is called the mixing matrix.
Since we have far fewer knowns than unknowns we do not have enough information to deduce a solution. We must do our best to infer a solution.
We choose a model that consists of the source signals, s(t) and the matrix A. The data consists of x(t).
Bayes’ theorem reads:

We could try to find the most probable model.
In this case some simplifications can be made:

Since the mixing process does not depend on the source signals, we can factor the prior probability into two terms:

The prior P(A | I) describes our prior knowledge regarding the form of the mixing matrix. This can include information about the propagation of the signals through the medium, the geometric arrangement of the detectors, or anything else that is known about the signal propagation and detection process.
The prior P(s(t) | I) describes what is known about the source signals, such as the amplitude density, frequency content, and dynamical behavior.
In a noise-free situation, one needs only to estimate A (or its inverse, A-1 = W). We treat the source signals as nuisance parameters and marginalize by integrating over all possible values of the source signals

Marginalization results in a posterior probability of the only model parameter of interest, the mixing matrix.
It is often easier to estimate the values of the model parameters that maximize the logarithm of the posterior probability:


From this, one can derive all sorts of source separation algorithms by simply assigning probabilities that accurately represent one’s knowledge about a particular source separation problem.
With the lack of a mixing matrix prior, the equation above is equivalent to MacKay's (1996) maximum likelihood derivation.
BLIND SOURCE SEPARATION
Being blind to the details of the mixing is equivalent to a lack of knowledge about the form of the mixing matrix or values of its elements. We represent this ignorance by assigning a prior that is constant for all possible matrices A and zero otherwise:

The assignment of a delta-function likelihood expresses our belief that our linear, stationary, instantaneous model perfectly describes the physical situation:

Finally, if one has information regarding the form of the source amplitude density, one can use that function for the source prior:

By assigning these probabilities we obtain

With a change of variables, the delta functions allow us to evaluate the integrals

Note: W = A-1 and let ui = Wij xj.
To find the maximum of the logarithm of the posterior probability with respect to the separation matrix we look at the derivative

Writing this in matrix form and post-multiplying by WTW to make the equation covariant (Amari 1996) we obtain the Bell-Sejnowski stochastic gradient update rule (Bell & Sejnowski 1995):

SEPARATION OF DECORRELATED SOURCES
To demonstrate the use of the mixing matrix prior, we consider mixtures that are known to be decorrelated. This knowledge implies that the mixing matrix must be orthogonal. We assign a Gaussian density to the mixing matrix prior:

The standard deviation represents our uncertainty that the mixing matrix must be precisely orthogonal.
This assignment, in addition to the likelihood and source prior assignments for BSS above, yields a stochastic gradient algorithm the similar to Bell-Sejnowski ICA, but with an extra term that has a tendency to ensure orthogonality:

INVERSE SQUARE MIXTURES
Finally, we look at a more complicated example (Knuth 1998) where it is known that the signal amplitude is attenuated as the inverse square of the distance between the source and the detector. In this case we expect that the elements of the mixing matrix will have the particular form:

where
is the amplitude of source j and rij is the distance between detector i and source j. We assume some prior knowledge about the range of amplitudes of source j, specifically from b1j to b2j and represent the prior probability for the source amplitudes with a uniform prior over that range.
We also assume knowledge of the expected mean value for the possible location of source j and a corresponding expected squared deviation from the mean. For purposes of simplification we assign a Gamma prior to the prior describing the detector-source distance

where the mean distance between the detector and the source is given by
and the expected squared deviation of the distance from the mean is given by
.
The prior for an element of the mixing matrix is

where
is the Gamma function,
is the incomplete Gamma function

This prior can be incorporated in the general equation for the posterior probability and one can derive the stochastic gradient update rule

where

with
,
and
.
INVERSE SQUARE GEOMETRY
![]() |
Click on the waveform to play the sound file ...![]() |
Sources - Red
Detectors - Blue
The spheres are centered on the mean expected source positions and their radii denote the standard deviations.
Original Waveforms are emitted by the Sources
Mixed Waveforms are recorded by the Detectors
Signals propagate according to an inverse-square law.
Click on the waveform to play the sound file ...
ICA refers to Bell-Sejnowski ICA with a super-Gaussian source amplitude density prior.
BSL refers to the Bayesian method also using a super-Gaussian source amplitude density prior.
Sources 1 and 2 have sub-Gaussian amplitude densities and are not separated by ICA using the super-Gaussian source prior.
The Bayesian method, using additional information about the signal propagation, can better separate the signals although separation is still not perfect due to the inappropriate source density assumptions.
More information helps, but inaccurate prior information produces inaccurate results.
FURTHER EXPLANATION CAN BE FOUND ON:
http://huginn.com/knuth/bse.html
REFERENCES
Amari, S. 1996. Natural gradient works efficiently in learning. Neural Comp. 10:251-276.
Bell, A.J. and Sejnowski, T.J. 1995. An information-maximization approach to blind separation and blind deconvolution. Neural Comp. 7:1129-1159.
Cox, R.T. 1946. Probability, frequency and reasonable expectation. Am. J. Phys. 17:1-13. Expanded in The Algebra of Probable Inference, Johns Hopkins University Press, Baltimore (1961).
Jaynes, E.T. 1968. Prior probabilities. In: R. D. Rosenkrantz (ed.), Papers on Probability, Statistics and Statistical Physics, Dordrecht: D. Reidel Publishing Co., 1983:114-130.
Jaynes, E.T. 1978. Where do we stand on maximum entropy? In: R. D. Rosenkrantz (ed.), Papers on Probability, Statistics and Statistical Physics, Dordrecht: D. Reidel Publishing Co., 1983, pp. 210-314.
Knuth K.H. 1998. Bayesian source separation and localization. In: A. Mohammad-Djafari (ed.), SPIE'98 Proceedings: Bayesian Inference for Inverse Problems, San Diego, July 1998, pp. 147-158.
Laplace, P.S. 1812. Théorie analytique des probabilités, 2 vols, Courcier, Paris. Reprints of this work are available from: Editions, Culture et Civilisation, 115 Ave. Gabriel Lebron, 1160 Brussels, Belgium.
MacKay, D.J.C. 1996. Maximum likelihood and covariant algorithms for independent component analysis, Draft Paper, http://wol.ra.phy.cam.ac.uk/mackay/
ACKNOWLEDGEMENTS
Special thanks to Herbert G. Vaughan, Jr. for his support and guidance. This work was supported by
NIH 5P50 DC00223, NIH 5P30 HD01799 and NIH NIDCD 5 T32 DC00039-05