Main

April 19, 2003

results w/ new musicseer data

the table is really big and ugly, so click on the extended entry to see it:

Continue reading "results w/ new musicseer data" »

April 11, 2003

disk full; expediency

Rob filled up blush2, and I just got 3 albums from Brian for nelly furtado, simple minds, & liz phair, so have to put them somewhere else for the time being.

I'll leave them in my home directoyr, and redirect DATAROOT there, and hand-create a .list file. When space frees up, I should put them in data/music, and run filldb on them.

April 09, 2003

New Artist Set

Ta-da, presenting the new artist set, christened aset400.3. It has 400 artists, chosen to be in the intersection of all the following sets:
  • Musicseer survey data (all-responses, in particular)
  • Opennap, erdos, n2 SIM matrices
  • AOTM lists (enough to be useful)
  • Audio we have (or can get soon)

It's basically the old topset I had at NEC and used for the ICME paper, but I dropped 5 artists (Eros Ramazzotti, Anouk, Laibach, Rockapella, and SR-71) because there wasn't enough of something (AOTM or all-responses), and added 5 more (Liz Phair, Simple Minds, PJ Harvey, Nelly Furtado, and The Verve) that fit the bill. Sorry Dan, Tori is forever banished because she's not in the opennap/erdos set.

Beth, the htk cepstra files for all of these except the following 3 are in the htk directory: furtado_nelly phair_liz simple_minds

Brian, when you come up for air, can you send us audio for these guys? I have some liz phair but I think you probly have more. Also I'd still like to fill in the gaps with the missing artists I sent last time; I have enough songs to get by, but I'd like to have the full albums if you got em. and good luck tomorrow!!! Dan, be gentle.

Continue reading "New Artist Set" »

April 03, 2003

another (minor) disaster

In trying to evaluate the old non-audio SIMs (opennap, erdos, etc) against the AOTM data for the ADVENT talk, I found that I had an ID mapping problem. the audio SIMs use topset-400, but the mapping I had from those artists to the old opennap artist IDs was wrong for 67 of the artists (because at some point last summer I merged database entries for these artists, and the wrong ID got kept).

but why are the results for the old stuff - erdos, opennap, n2 - different? esp. n2 is now worse than random!!?

I figured out the first part: the numbers for erdos etc changed when I replaced the buggy topset-400 id mapping because of another bug in the scoring code. When the code saw a response that it didn't have a sim value for, it was returning from the function Response() instead of "next'ing" to the next SIM-type for the same user judgment. So if erdos was after one of the ank14 things, e.g., then a bunch of judgments would get ignored. Now that it's fixed, the numbers look like the numbers from the old "Quest for ground truth" paper, which is good.

But what about n2? Looking more closely at the SIM file, I'm not sure that it's wrong after all. It looks horrible; maybe the question is, how was it getting decent scores before?! Under AOTM eval, it also does extremely badly: .12 while the other ones do about .36 (random is .07). Maybe I have the wrong SIM file somehow? I'll get Beth to run it and see what she comes up with...

I fixed the ICME paper and resubmitted it.

April 02, 2003

bug fixes.

Of course everything was wrong. in the AOTM eval i found 2 bugs, and some other issues.

Bug 1: I wasnt treating distance matrices differently from similarity matrices.
Bug 2: I didnt understand what I was doing with some index sorting business (cdix) in the ranking eval, so I was getting basically random results.

Now fixed, the results make sense. I threw out the ranking agreement eval and went with the Information Retrieval style eval: treat the top N most probable co-occuring artists in AOTM playlists as ground truth similar hits, and treat the ranked SIM row as "retrieval results". The rank of each hit gets an exponentially decreasing score with halflife 20, then take the mean. So for N=10, optimal is .8598. This gives me a score for each row, and then I take the mean for an overall score.

Using N=10, random permutations score like this histogram. It's not Gaussian (because it bumps up against zero on the left), but I treat it like a Gaussian and do the same confidence test that I tried before. (If I make the cutoff larger, e.g. 20, it starts to look Gaussian, and as cutoff gets bigger the variance shrinks.)

So here are the results using the AOTM IR-type evaluation with N=10 and halflife=20, and normalizing by the prior:

ank14mfc20pca14
D-ALA.2520.2157.2287
D-centroid.2862.2462.2457
The 95% significance level is .1644 and the 99.5% significance level is about .2157, the lowest score we see. So that's comforting. Of course, being significantly better than random is the least we should expect, in fact we want a much more demanding evaluation. But at least we're sane now.

Related to a suggestion of Dan's, I wanted to see if the score was correlated with the prior probability of the artist (i.e., popularity). So I made some plots, and it looks pretty much uncorrelated.

AOTM @ NEC

to get this done by Friday, I decided to try the AOTM eval on the NEC data, since the models and sim matrices are already done. I had to compute a new conditional density matrix, and first figure out the mapping between AOTM regularized ID's (brian's artist.hash) and the topset IDs 1:400. Used basically the same code as before (plistprint), but instead of looking up canon_name in the DB, just get that from the file topset-400.artists. It put out plid.artists and then i ran cmn to get cond_density.

The good thing about this is that the SIm matrix and the Cond density matrix will have corresponding IDs, so no more of that tuna2plid nastiness. I just have to run it in scoreCD now.

April 01, 2003

PCA

The PCA experiment is done. Here's the results:

ank14v1mfc20mfc-pca14
D-centroid4.074.794.80
D-ALA4.374.574.68

PCA doesn't help, in fact it makes it worse. So it's not just the dimensionality reduction that makes ank14 better. More support for the anchor model approach.

Also, its interesting that D-ALA does better on mfc's but worse on anchor space. Not sure what to make of that. I guess it's consistent with what I found looking at the anchorspace plots: anchor space distributions look singly Gaussian, while cepstra-space distributions have a bit more shape, especially the first few coefficients.

back to NEC

I went back to the machines and the data at NEC, because I was revising the ICME paper. The re-ripping process ended up with a slightly different data set than what's at NEC, so I wanted to stick with that. Also those machines rock.

I moved the ALA code over there, and added some code to evaluate-responses to handle ALA SIM matrices, and ran the eval. Turns out that ALA does worse than centroids! The numbers are all in the . revised ICME paper. I suspect that it's because the distributions are actually pretty unimodal, as I've seen making the anchorspace visualization slides. So maybe training GMMs isn't the thing to do after all. Why the distributions should be so single Gaussian-like is a mystery at this point.

Then I did what we should have done the first time: trained models on the cepstra directly. They did worse than the anchor models, and even a little worse than the random anchors. So that's good news. But it's not quite a fair test: the anchor models have 14-dimensions, and the cepstra models have 20. So I'm in the middle of reducing the 20 cepstra dimensions to 14 via PCA, and then I'll train models on that and evaluate.

March 24, 2003

First test: failed

I implemented a rank-agreement score to compare the ranking-by-audio-similarity to the ranking-by-AOTM-cooccurrence. Say we have N item, and two rankings of those items, where a ranking is just a permutation of the items. One of the rankings, say A, is the reference ranking, and the other, say B, is the ranking to be evaluated. For example, for each artist X, A is a list of the artists in decreasing order of conditional co-occurrence probability given artist X. B is the artists ordered by similarity to X under whatever metric we're testing. Note that B doesn't have to a complete ordering, we could take the top 10 hits or something.

The rank-agreement score R is:
R = w'*r
where w is a vector of weights that sum to one and are exponentially decreasing, and r is the "rank permutation" vector: r(i) is the index of element B(i) in A. For example if the sim-based distance puts Eden's Crush as 2nd-closest to reference Aaliyah, but ranked by AOTM co-occurrence, Eden's crush is 100th closest to Aaliyah, then r(2)=100.

The weights are calculated according to the exponential
w = exp(-(log(2)/halflife)*(0:(length(A)-1)));
I used halflife=20 for now.

The optimal value of this metric is when A and B agree completely, i.e. are equal, so then r=1:N. For 414 artists and halflife=20, then optimal_R=29.4. So for the evaluation, I find the rank-agreement score R conditioned on each of the 414 artists, and take opt_R/R for each, and average. So perfection is 1.

To do a statistical significance test (I hope I passed my midterm), we need to know what this score will do under a null hypothesis that the ranking to be evaluated (B) is a random permutation. So I created 10,000 random permutations r of 1:414 and computed w'*r for each. The histogram of that, with the mean in red, is shown here. It looks roughly normal (but a little right-tailed), so I use a basic one-sided significance test with the Normal distribution (probly should use t-test since both mean and variance are unknown, but close enough for now.) The 95% significance point for this test is .1590, but the mean score under the ALA-based SIM metric that I'm using now was .143. Way below 95% significance. In fact, the p-value is about .54, which is completely insignificant.

I also tried normalizing the conditional probability by the prior, to adjust for popularity effects. The test also failed, with p-value .52, even worse than the non-normalized case.

Next:

  • I'm not sure that I should be using the entire rank ordering, perhaps only the top 10 or 20 are important? But the halflife should take care of that, downweighting the later ranks. I'll play around with the halflife.
  • There are a number of artists that don't appear in the AOTM lists, so the conditional densities for those are empty, and I need to deal with that.
  • Finally, I can retrain the GMMs with more iterations to see if that helps, and also perhaps try it at the song leve.

March 21, 2003

ALA implemented, AOTM eval started

Yay. I implemented Vasconcelos' Asympotic Likelihood Approximation to the KL-divergence between GMMs. Then I created a distance matrix for the 414 "tuna.artists" (really the artists in the playola DB after re-ripping), using the ALA on GMMs fit to the anchorspace points. One caveat is that I only trained the models with 20 EM iterations because I was impatient, so I'll have to do it again with longer training later.

Then I compared the distance metric to the AOTM data by plotting the conditional co-occurrence densities (conditioned on each artist), but sorted according to the audio-based ALA distance.

Some example results:
http://blush.ee.columbia.edu/~madadam/tmp/cd-sorted-ala-aguilera.jpg
http://blush.ee.columbia.edu/~madadam/tmp/cd-sorted-ala-coldplay.jpg
http://blush.ee.columbia.edu/~madadam/tmp/cd-sorted-ala-abdul.jpg

In the plots, the list of artists in the upper right corner is the top 5 artists sorted by ALA-distance. The top 5 sorted by co-occurrence probability are also labeled.

I think I need to normalize these to get rid of popularity effect. If we really want to compare the distance-based ranking with the co-occurrence-based ranking, which is what this plot essentially does, then i should normalize the probabilites by popularity. otherwise, e.g. radiohead often has a high probability, which doesn't necessarily mean that radiohead is similar to the conditioned artist.

next:
- musicseer eval on ALA
- try to quantify this AOTM eval. not sure about the fitting-the-exponential idea I had before, there's really not much reason to believe it should behave exponentially. what I really want to do, I guess is ranking comparison between the distance-based ranking and the conditional density ranking, but normalized as I mentioned.
- retrain the GMMs with longer EM iterations

March 20, 2003

AOTM & audio

I matched the regularized art of the mix lists to the 414-artist playola DB
to see what kind of overlap we have. the results:

16% of the songs are by playola artists
7% of the songs are in our DB
35% of the lists have two or more songs in our DB.
346/417 "new playola" artists are represented.

this is good news! i think with these numbers, we have enough data to
explore the relationship between the audio-based sim metric and the AOTM
lists. Here's what I'm planning on doing:

Let's assume that songs that co-occur in a playlist are similar, i.e., the
probability of co-occurrence is some function of similarity. So I'd like to
see a plot of simlarity vs. (empirical) conditional probability. I''m
hoping it looks like an exponential density - probability of seeing
something very similar is high, and it quickly falls off as dissimilarity
(distance) increases. The question is, how to use this as a quantitative
measure of how good the similarity metric is? perhaps we fit an exponential
to the plot, and look at the rate of decay - a faster rate means that
cooccurrence probability falls off faster with similarity, so the similarity
metric is better.

anyway, it's something to try.

Cheating

I was just going through Steve's code that does the musicseer evaluation and I discovered something bad: the results I had that showed that randomly-trained anchor models do as badly as a random sim metric is wrong. It turns out that steve code looks for the word "rand" in the name of a SIM file, and creates its own random SIM metric if it finds it. So my file SIM_ankrand12 was hitting this case and the actual contents of the file were being overridden with random numbers. So obviously this metric did as bad as random, it was random.

when i fix it, the real results aren't so good. The random anchors do almost as well as the "true" genre-based anchors, in some cases:


Mode & ank14v1Centroid & ankrand12Centroid & erdos & rand \\ \hline
Survey, all (6102 resp, 8.97 av.choices) & 3.9736 & 4.3296 & 3.8270 & 5.4193 \\ \hline
Survey, known (4739 resp, 3.59 av.choices) & 4.4577 & 4.7374 & 4.0704 & 5.4425 \\ \hline
Game, all (7124 resp, 11.10 av.choices) & 4.4532 & 4.6012 & 4.4940 & 5.4964 \\ \hline
Game, known (6244 resp, 4.72 av.choices) & 4.8654 & 4.9094 & 4.8661 & 5.4522 \\

What does this mean? Where does the improvement over randomness come from, when you train anchor models on random training labels? some thoughts:
- Perhaps the neural nets are learning something useful, even though they were given random training labels. I trained the random anchors by selecting several random artists and giving each net several songs by that artist. perhaps there is a bias in this process, and some anchors actually learn characteristics of a "dominant" artist in their training set.
- It may be an effect of the centroid. These results use the highly sophisticated "centroid" method of comparing distributions in anchor space, if you recall. Even on mfcc features that would probably do better than random. That should really be the experiment - modeling the distribution in mfcc space the same way as I model the anchor space distribution and comparing those. which, actually, is basically what we decided to do anyway, i.e. comparing to Beth's method.

March 05, 2003

GMM training: memory issues

there's a lot of data. To train a single artist GMM, i load all the anchorspace data points into matlab, and let it rip. there are 300k - 3M points per artist, average about 500k. i'm doing this on cerise, and it doesn't like it, even with 1G memory. 500k x 14 dimensions = 7M numbers. that should only be about 56M of RAM with double precision (8 bytes per number), but the matlab processes are growing to over a gig of swap. Each process is only using ~ 15% CPU, which makes me think it's thrashing the swap. Does about 7 artists per hour.

- trying with netlab implementation: less memory, fuller CPU usage. but then madonna kicks in - back to thrashing, it seems. here are some options i can think of:
  • downsampling: training on 1/5 of the anchorspace points, randomly sampled.
  • train at song-level; combine song-level GMMs into an artist-level model somehow?
  • batch training for EM?
any other ideas?