Please Note that this is an ARCHIVED Site
while it does contain useful information, it is not up-to-date.
Please click here for NEW information.

Bayesian Source Estimation (BSE)

Kevin H. Knuth, Ph.D. and Herbert G. Vaughan, Jr., M.D.
Dynamic Brain Imaging Laboratory
Departments of Neuroscience and Otolaryngology
Albert Einstein College of Medicine
Bronx NY 10461

Last Updated: 7 July 1999

My interest in the problem of source estimation has arisen from work performed with Herbert G. Vaughan, Jr. and others in the Dynamic Brain Imaging Laboratory on combining information from several different non-invasive human brain imaging techniques to delineate the neural processes that underlie human cognition and sensorimotor function. These techniques include electroencephalography (EEG), magnetoencephalography (MEG), anatomical magnetic resonance imaging (MRI) and functional MRI (MRI). While each of these techniques is useful, there is no single technique that provides the spatial and temporal resolution necessary to make inferences about the intracranial sources of activity. We have recently demonstrated (Knuth and Vaughan 1998) that exciting new algorithms in the area of blind source separation (BSS) are related to those currently used in electromagnetic source localization (ESL) by expressing both in a Bayesian inferential framework. This framework provides a methodology by which several different types of information can be combined to aid in making inferences about a problem.



This html manuscript contains the following basic information on Bayesian Source Estimation:


Description of the Problem

To understand human neurophysiology, we rely on several types of non-invasive neuroimaging techniques. The techniques used in this laboratory are primarily high-density (128 channel) electroencephalography (EEG) and anatomical and functional magnetic resonance imaging (MRI and fMRI). Through collaborations with other neuroimaging laboratories we often are able to obtain magnetoencephalographic (MEG) recordings in addition to the other imaging techniques. The difficulty arises when one has data from several different imaging techniques and wishes to use all of the available data to make inferences about the underlying neurophysiology of the human subject.

We begin by describing the situation for EEG or MEG recordings.
Please keep in mind that this is a brief and simplified explanation.
More details can be found in our publications at the end of this html document.
More recent material can be found here, and my complete publications here.

When a region of neural tissue (consisting of about 100,000 neurons) is synchronously active, detectable extracellular electric currents and magnetic fields are generated. These regions of activity can be modeled as "current dipoles" because they generate a dipolar electric current field in the surrounding volume of the head. These extracellular currents flow throughout the volume of the head and create potential differences on the surface of the head that can be detected with surface electrodes in a procedure called electroencephalography (EEG). One can also place superconducting coils above the head and detect the magnetic fields generated by the activity in a procedure called magnetoencephalography (MEG).

If one knows the positions and orientations of the sources in the brain, one can calculate the patterns of electric potentials or magnetic fields on the surface of the head. This is called the forward problem.

If instead all one has are the patterns of electric potential or magnetic fields then one must try to calculate the locations and orientations of the sources. This is called the inverse problem. Inverse problems are notoriously more difficult to solve than forward problems. In this case, given only the electric potentials and magnetic fields on the surface there is no unique solution to the problem. The only hope is that there is additional information available that can be used to constrain the infinte set of possible solutions to a single unique solution. This is where Bayesian statistics will be useful.

The idea is that one must use all the information available to solve the problem. We will demonstrate this by focusing on the inverse problem where we have information from one technique only, say EEG. (MEG is very similar to EEG and is actually easier to deal with in the sense that the forward problem is easier to deal with for MEG).

In the cartoon below, we show three neural sources, represented in this case by equivalent current dipoles (blue arrows), in the cortical grey matter of the brain. The electrodes on the surface of the head detect the potential differences due to the extracellular currents generated by these active sources. The purple arrows merely demonstrate that each electrode detects some of the current flow from each neural source. (The currents do not flow directly from the sources to the electrodes, but instead flow throughout the volume of the entire head.)

Neural sources in the brain

The problem of electromagnetic source localization deals with using the spatial patterns of potential difference on the scalp to estimate the most likely positions and orientations of the neural sources. This works quite well when there is only one source active. As soon as several sources are simultaneously active, the spatial patterns can become very complicated and can suggest many possible combinations of neural sources.

There is some information that is neglected by ESL techniques, and that is that the neural sources exhibit temporal behavior. In fact, each electrode detects a mixture of the time-varying activity of each neural source. If one succeeds in localizing the sources, one has also succeeded in separating their signals. Similarly, if one succeeds in separating their signals, localization will have been made much easier. This suggests that the problem of electromagnetic source localization (ESL) is related to the problem of blind source separation (BSS) where one separates the signals using the recorded mixtures and a little a priori information about the source signals. This relationship is described in more detail in Knuth and Vaughan 1998.

The diagram below demonstrates the similarity of our problem above to the problem of BSS. We have neural sources which exhibit a time course of activity described by s(t). This neural activity generates electric currents and magnetic fields which propagate through the volume of the head and are recorded by the detectors . The three detectors in this cartoon record mixtures of those source signals, described by x(t). These mixtures are related to the original source signals by the linear transformation A, called the mixing matrix, such that x(t) = A s(t). The linear transformation A depends on the geometry of the sources and detectors and the laws governing the propagation of the signals in the medium. The problem of BSS is to separate those mixed signals by finding an inverse transformation W, called the separation matrix into the estimates of the original source waveforms, described by y(t).

BSS Diagram

BSS utilizes the higher-order statistics of the signals themselves to estimate the mixing matrix A, up to a permutation and rescaling. On the other hand, ESL utilizes knowledge about the signal propagation and the geometry to estimate the parameters such as source locations and orientations, on which the elements of the mixing matrix explicitly depend.

Below we describe the methodology, based on Bayesian inference, with which we combine the problems of BSS and ESL into a single problem which we call Bayesian Source Estimation (BSE).

Back to Top

Description of the Methodology

The problems of source separation and localization are by their very natures problems of inductive inference. There is not enough information available to deduce a solution, so one must infer the most probable solution. The natural starting point to solving an inference problem is the application of Bayes' Theorem:

(1) .

where, model describes the model or model parameters used to describe the physical situation, data represents any data or new information obtained in an experiment, and I denotes any additional prior information one may have about the problem. Bayes' Theorem is easily derived from the relationship between joint and conditional probabilities. The probability on the left-hand side of the equation is called the posterior probability. It describes the degree to which one believes the model accurately describes the physical situation given the data and all prior knowledge. The first probability in the numerator is the likelihood. It describes how well the model can predict the data. The second probability in the numerator is the prior probability. It describes one's prior knowledge regarding the probability that the model adequately describes the physical situation of interest. Finally, in the denominator is the evidence. It describes the predicting power of the data.

Bayes' theorem can be viewed as describing how our prior knowledge changes due to the acquisition of new information (in this case the recording of some data).

We now recast Bayes' Theorem into a form useful for the source estimation problem. Our model of the physical situation will consist of the mixing matrix A and the original source waveforms s(t). The data we record consists of the sampled mixtures, x(t). We rewrite Bayes' Theorem as a proportionality by absorbing the evidence into a proportionality constant:

(2) .

Since the amplitude of the source signals do not affect the signal propagation, the prior probability, P(A, s(t) | I ), can be factored into two terms:

(3) .

Remember the term on the left-hand side of the equation is the posterior probability. It describes the probability that the given mixing matrix A and source waveforms s(t) accurately describe the physical situation. This posterior is proportional to the product of the likelihood that the data can be predicted by the model and the prior probabilities (called priors for short) of both the mixing matrix and the source waveforms. These priors are very important in that they encode our prior knowledge about the structure of the mixing matrix and the temporal behavior of the sources. In the examples to come we will demonstrate their usefulness.

For now we will look at the problem of blind source separation (BSS). The term "blind" refers to the fact that nothing is known about the mixing process. This is a statement of ignorance about the possible values and relationships among values of the elements of the mixing matrix. To express this ignorance, we assign a uniform probability density to the mixing matrix prior P(A | I ). This assignment reflects the fact that we believe any matrix is as likely to be the correct matrix as any other. The proportionality above reduces to:

(4) .

The next question is: What is known about the source behavior?
The answer to this question depends on the signals we are interested in. One may have very little information about the source behavior, or quite a lot. The difficulty lies in encoding this information into a probability distribution. Perhaps the most simple information consists of prior knowledge regarding the amplitude density of the source signals. This amplitude density can be approximated by a histogram of the digitized amplitudes of a signal. Below are a set of signals and their histograms.

Histograms of Digitized Source Waveform Amplitudes
Source Amplitude Densities
Super-Gaussian Source Density Model
Sub-Gaussian Source Density Model

Speech sounds, as shown in the figure above, have a tendency to have sharply peaked amplitude densities. These can be well-modeled by a Lapacian or other super-Gaussian functions, such as sigmoidal (or logistic), inverse tangent or hyperbolic tangent functions. The lower frequency waveforms are more difficult to model because they tend to have multi-modal probability densities. Modelling the individual peaks in these densities, especially when one does not know where they are or how many there are, proves to be very difficult. In these cases, a relatively flat sub-Gaussian model best describes one's prior knowledge. To deal with cases where sources may have either super- or sub-Gaussian amplitude densities, one effective solution utilizes a source amplitude model parameterized by the sign of the kurtosis, which is a binary-valued parameter. A positive kurtosis indexes a super-Gaussian density, and a negative kurtosis indexes a sub-Gaussian density. This was demonstrated by Lee and Sejnowski [1].

The difficulty is to find a probability distribution that provides the "most honest" representation of one's knowledge. There are techniques for doing so, namely Jaynes' principles of Group Invariance [2] and Maximum Entropy [3].

Once one's prior knowledge has been expressed as a prior probability, this can be substituted into the Equation above. The next trick is to assign the likelihood function. Assignment of a Gaussian could be used to express one's doubt as to the predictive power of the model in the case where it is known that there is additional noise or where simplifying approximations have been made. In the case of the Bell and Sejnowski Independent Component Analysis (ICA) algorithm [4], one assumes that the signals are noise-free. This is equivalent to having complete confidence that the linear mixing model, x(t) = A s(t), accurately describes the physical situation. To express this expectation as a probability, we use the delta function:

(5)
where its value is 1 if xi(t) = Aik sk(t) and 0 otherwise.

The final step, we will discuss here is marginalization. In our example, the model consists of a description of both the mixing matrix A and the source waveforms s(t). Because our model is linear, we don't really need to determine both aspects of the model. The mixing matrix is easiest to obtain, and we can eliminate the source waveforms from the model by marginalizing over them. This can be performed using the fact that for two hypotheses, A and B, we can write:

(6)

where the sum is performed over all possible states of the hypothesis, B. This results in an expression describing the probability of a hypothesis A, independent of the considerations of hypothesis B.

Marginalizing our model, Eqn. (3) over all possible waveforms s(t), we obtain

(7)

Note that I derived this from the equation that did not yet take into account our prior and likelihood assignments. This is a good general starting equation for the derivation of informed source separation algorithms. We now have an expression for the probability that any matrix A is an accurate description of our physical problem based on all of our prior information and data.

It is often more convenient to work with the logarithm of the probability. Taking the logarithm of the equation above results in

(8)

This is rather nice, since it neatly separates the probabilities into a sum of terms. The first is the logarithm of the integral of the likelihood with the source amplitude prior. This term acts as the logarithm of the likelihood function for the the posterior probability of the mixing matrix by itself. The second term is the logarithm of the mixing matrix prior. The constant C is the logarithm of the proporationality constant implicit in the Eqn. 7. It is the logarithm of the inverse of the evidence mentioned briefly in the beginning of the section.

Now we can substitute in our likelihood and priors. Since the mixing matrix prior is a constant for the BSS problem, we are left with
(9)

where we ignore the new constant, since it simply normalizes the posterior probability. Recall that the term on the right can be interpreted as the logarithm of the likelihood of the model when the model consists solely of the mixing matrix. This in fact is the approach taken by David MacKay in his maximum likelihood derivation of ICA [5]. Our assumption of statistical independence of the source signals, which up until now has not appearred in the formalism comes in by factorizing the source amplitude prior into a product of priors for each source l.

The final step is to find the matrix that maximizes the posterior probability. This is called the Maximum A Posteriori or MAP criterion. It results in a single solution. However, it is not always the best way to solve the problem, since the posterior probability is the real answer to our question. For simple problems, one can sometimes plot or write a formula for the posterior probability. This provides a complete description of the solution. Monte Carlo sampling of the posterior probability is useful when the dimensionality of the model space is too large to visualize.

Most BSS applications necessitate a single solution, and while it is not necessarily the best thing to do, we can provide one using the MAP criterion. The remainder of the work presented here falls along these lines, although we are currently working on utilizing Monte Carlo techniques to answer more difficult questions.

There are many ways to find a maximum of the posterior probability or, equivalently, its logarithm in Eqn. 9. The method implemented in the Bell-Sejnowski ICA is that of stochastic gradient ascent. The derivative of the logarithm of the probability is used to create an update rule for the matrix A or its inverse W. The data is presented to the update rule and corrections to the current guess are made which constantly increase the probability that the guess is correct. We will not discuss this derivation here since it is discussed in Bell and Sejnowski's paper [4], MacKay's paper [5], as well as several of ours [6-9].

The main idea here is that these problems are inference problems and are most effectively approached using Bayesian inference. Through the Bayesian methodology we are provided with a natural and logically consistent method by which we can include all the information available about a problem. Blind separation techniques work nicely in a wide varienty of cases, but if we have additional information, it would be very unwise not to utilize it. In the context of speech separation, of which several demonstrations are provided below, our brains utilize many types of addtional information. This information includes knowledge of the format frequencies of the speaker's voice, phonetic and syntactic information about the language being spoken, content of the conversation and visual information about the motions and positions of the mouth and lips. In addition, it is well known that primary auditory cortex is not provided with time-series data of the amplitudes of the sound waveforms. These considerations demonstrate the necessity of a formalism that allows for the incorporation of any type of information.

References for the Methodology

  1. Lee, T.-W. & Sejnowski, T.J. 1997. Independent Component Analysis for Sub-Gaussian and Super-Gaussian Mixtures, In: Proceedings of the 4th Joint Symposium on Neural Computation, Vol. 7, Institute for Neural Computation, pp. 132-140.
  2. Jaynes, E.T. 1968. Prior Probabilities In: R. D. Rosenkrantz (ed.), Papers on Probability, Statistics and Statistical Physics, Dordrecht: D. Reidel Publishing Co., 1983:114-130.
  3. Jaynes, E.T. 1978. Where do we stand on maximum entropy? In: R. D. Rosenkrantz (ed.), Papers on Probability, Statistics and Statistical Physics, Dordrecht: D. Reidel Publishing Co., 1983, pp. 210-314.
  4. Bell, A.J. & Sejnowski, T.J. 1995. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129-1159.
  5. MacKay, D.J.C. 1996. Maximum likelihood and covariant algorithms for independent component analysis. Draft Paper.
  6. Knuth, K.H. 1998. Difficulties applying recent blind source separation techniques to EEG and MEG. Presented at the MaxEnt97 workshop in Boise Idaho, August 1997. In: G.J. Erickson, J.T. Rychert and C.R. Smith (eds.), Maximum Entropy and Bayesian Methods, Boise 1997, Kluwer, Dordrecht, pp. 209-222.
  7. Knuth, K.H. 1998. Bayesian source separation and localization. SPIE'98 Proceedings: Bayesian Inference for Inverse Problems, A. Mohammad-Djafari (ed.), San Diego, July 1998, pp.147-158.
  8. Knuth, K.H. and Vaughan, H.G. Jr. 1998. Convergent Bayesian formulations of blind source separation and electromagnetic source estimation. Presented at the MaxEnt98 workshop in Munich, July 1998. To be published in: V. Dose, R. Fischer, W. von der Linden, and R. Preuss (eds.), Maximum Entropy and Bayesian Methods, Munich 1998, Dordrecht, Kluwer, 1998.
  9. Knuth, K.H. A Bayesian approach to source separation. In: J.-F. Cardoso, C. Jutten and P. Loubaton (eds.), Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation: ICA'99, Aussios, France, Jan. 1999, pp. 283-288.

Back to Top


Examples of the Application of BSS and BSE

This section contains a brief but growing example of the Bayesian Source Estimation algorithm.

The algorithm utilizes information about the signal propagation through the medium in addition to some basic information about the source locations to aid in the estimation of the original source signals. In this case, we assume that we know that the signals propagate according to an inverse square attenuation law. Information about the propagation of the sound signals equates to information about the possible form of the mixing matrix or values of its elements. Specifically, we expect that the elements of the mixing matrix will have the particular form:

(10)

where is the amplitude of source j and is the distance between detector i and source j.

In a recent paper [1], I derive the following prior probability for the mixing matrix elements Aij based on assumptions regarding the minimum and maximum possible amplitudes of the sources, mean expectations of the source positions and the corresponding expected square deviation from the mean:

(11)

where is the Gamma function, is the incomplete Gamma function, b1j denotes the minimum expected amplitude of source j, b2j denotes the maximum expected amplitude of source j and the mean distance between the source and detector is given by with an expected square deviation from the mean given by .

A final approximation consisting of an assumption of independence of the matrix elements simplifies the prior probability of the entire mixing matrix, A to:

(12)

Note that this assumption neglects all the information we may have about the source and detector geometry. This additional information can be accounted for by a more complicated expression by modifying the equation above.

By substituting Eqn. (12) above into the general Eqn. (3) for the posterior probability. One can derive a stochastic gradient algorithm analagous to the Bell-Sejnowski ICA algorithm, but utilizing our additional information about the signal propagation.



Demonstration

The true source locations are illustrated in the figure below by the RED numbered points.
The detector locations are illustrated by the BLUE numbered points.
We assume some knowledge about the source locations. The believed positions of the sources are described by the centers of the spheres in the figure, where the sphere radius describes the expected standard deviation of the believed locations from the true locations.
Note that in this example, we have rather precise knowledge of the location of source 3.

Geometry used in the Inverse Square Mixture Simulation
Source Geomtery Click on the waveform to play the sound file ...
Imagemap of Waveforms of Emitted and Recorded Signals.

The figure above shows the original waveforms emitted by sources 1 through 5.
The mixed waveforms represent the waveforms recorded by the detectors 1 through 5.

The results of applying the Bell-Sejnowski ICA algorithm with a super-Gaussian (sigmoidal) source amplitude density and the Bayesian Source Estimation algorithm with the same source amplitude density are shown in the next figure. ICA using the super-Gaussian density cannot separate the sources 1 and 2 because, as demonstrated above for source 1, "Hail", the super-Gaussian density does not accurately describe the source's true amplitude density. This lack of accurate information results in a diagonal solution in which the two separated signals consist of the sum and difference of re-scaled versions of the original source waveforms.

However, the additional accurate information allowed BSE algorithm to separate the two source signals even though the source amplitude density prior presumed incorrect information about the source signals. Overall the separated signals contained slightly more noise than ICA, but the diagonal solution was averted. We stress that additional prior information about the mixing is not a substitute for an inaccurate representation of prior information regarding source amplitude densities.

Click on the waveform to play the sound file ...
Imagemap of waveforms comparing ICA and BSE

Examples utilizing both super-Gaussian and sub-Gaussian source density models will be included soon.

References for the Example

  1. Knuth, K.H. 1998. Bayesian source separation and localization. SPIE'98 Proceedings: Bayesian Inference for Inverse Problems, A. Mohammad-Djafari (ed.), San Diego, July 1998, pp.147-158.

Back to Top

Our Publications on Bayesian Source Estimation

Knuth K.H. 1999. A Bayesian approach to source separation. In: J.-F. Cardoso, C. Jutten and P. Loubaton (eds.), Proceedings of the First International Workshop on Independent Component Analysis and Signal Separation: ICA'99, Aussios, France, Jan. 1999, pp. 283-288. [arXiv link] [pdf (218 kb)]

Knuth K.H., Vaughan H.G., Jr. 1999. Convergent Bayesian formulations of blind source separation and electromagnetic source estimation. In: W. von der Linden, V. Dose, R. Fischer and R. Preuss (eds.), Maximum Entropy and Bayesian Methods, Munich 1998, Dordrecht. Kluwer, pp. 217-226. [pdf (204 kb)]

Knuth K.H. 1998. Bayesian source separation and localization. In: A. Mohammad-Djafari (ed.), SPIE'98 Proceedings: Bayesian Inference for Inverse Problems, San Diego, July 1998, pp. 147-158. [arXiv link] [pdf (363 kb)]

Knuth K.H. 1998. Difficulties applying recent blind source separation techniques to EEG and MEG. In: G.J. Erickson, J.T. Rychert and C.R. Smith (eds.), Maximum Entropy and Bayesian Methods, Boise 1997, Kluwer, Dordrecht, pp. 209-222. [pdf (421 kb)]

Knuth K.H. and Schwartz B.J. 1996. Independent component analysis applied to auditory evoked gamma band and 40 Hz steady state fields, Society for Neuroscience Abstracts 1996, Vol. 22.

Back to Top

Links to other Researchers involved in BSS, ESL and Bayesian Statistics

BSS and ICA Links
(This is a short list of meta-links)

Bayesian Links
(Again a short list)

ESL Links
(Again a short list)

Back to Top

Kevin H. Knuth's Home Page