« CNN censorship? probly just mistake | Main | Behind Enemy Lines »

GMM training: memory issues

there's a lot of data. To train a single artist GMM, i load all the anchorspace data points into matlab, and let it rip. there are 300k - 3M points per artist, average about 500k. i'm doing this on cerise, and it doesn't like it, even with 1G memory. 500k x 14 dimensions = 7M numbers. that should only be about 56M of RAM with double precision (8 bytes per number), but the matlab processes are growing to over a gig of swap. Each process is only using ~ 15% CPU, which makes me think it's thrashing the swap. Does about 7 artists per hour.

- trying with netlab implementation: less memory, fuller CPU usage. but then madonna kicks in - back to thrashing, it seems. here are some options i can think of:
  • downsampling: training on 1/5 of the anchorspace points, randomly sampled.
  • train at song-level; combine song-level GMMs into an artist-level model somehow?
  • batch training for EM?
any other ideas?

Comments

I think subsampling is fine - there's a huge amount of redundancy in
the features from any given song.

It would be interesting to understand why it's thrashing. I think my
GMM code is full-covariance, which is generally NOT what you do.

Also, the intermediate variables include p(k|X) i.e. nmix x nsamples,
which could be getting quite large.

DAn.

subsampling works fine. I don't think it should effect the densities that much, I could do an experiment on a mid-size artist to see - all points vs. subsampled, and compare results (w/ ALA?)

I'm subsampling by 1/5.

now the question is how many iterations to train for?

with 20 iterations, takes ~ 14 hours (7x2) to train tuna set (414 artists)

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)