Dominion Strategy Forum

Miscellaneous => General Discussion => Topic started by: GeoLib on April 04, 2015, 01:02:57 am

Title: Sample Uniformly from ~Normal Distribution
Post by: GeoLib on April 04, 2015, 01:02:57 am
So I'm trying to do this thing for my research and struggling with finding the best way to do it. But I thought maybe some of the people here might know of a good way (especially since we seem to have so many statisticians).

So I have some data that is approximately uniformly distributed wrt to a variable x. I would like a sample of that data that is approximately uniformly distributed wrt x on the range from the minimum to maximum. Specifically, I want to sample without replacement about a third of this data. I tried to bin things, but in order to have enough samples in each bin, I can only use 3 bins, which is not ideal. Is there a good way to do this? Obviously, since I want to draw without replacement I can't really get a uniform distribution, since the ends will quickly be depleted, but something "sort of uniform" would be good. Any thoughts?
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: blueblimp on April 04, 2015, 02:47:52 am
So I have some data that is approximately uniformly distributed wrt to a variable x.
I don't understand what this means. Specifically, I'm not familiar with the "distributed with respect to a variable" terminology. Could you be a bit more specific about the motivating problem maybe?
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: Titandrake on April 04, 2015, 04:27:39 am
If I'm understanding it right...you have a list of values in your variable x, and want to pick a third of it without replacement? (I'm assuming you have a list because it otherwise doesn't make much sense to sample without replacement.)

Assuming that's true, you could look into the Knuth shuffle? http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle. Assuming you can mess with the given list, you can do
the permutation generation, but stop after picking the first 1/3rd elements.

Code: [Select]
i = 0
while you want more samples:
  r = random index from i to n-1
  swap elements a[i] and a[r]
  output a[i] as a sample
  i++

Depending on the language you're using, there's probably a library that does this for you as well. I know Python in particular has random.shuffle to get a random permutation of a list, so you can do that then chop off the first third as your samples.
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: DStu on April 04, 2015, 04:57:24 am
So I have some data that is approximately uniformly distributed wrt to a variable x.
I don't understand what this means. Specifically, I'm not familiar with the "distributed with respect to a variable" terminology. Could you be a bit more specific about the motivating problem maybe?

I would guess he means he has data whose distribution is the distribution of the variable x.

But in this case I'm a bit sceptical that what you want is to draw without replacement. This is of course only guessing without knowing the problem.
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: GeoLib on April 04, 2015, 11:21:49 am
So obviously my explanation was unclear.

I have a bunch of chemical compounds with a specific property I'm trying to predict (real-valued). The set of compounds is approximately normally distributed in this variable x (not sure I'm saying this right. If I were to bin them and plot frequency vs. x, I get something that looks apprximately like a normal distribution). I want to select a test set of these compounds, but I would like it to be approximately uniformly distributed in this variable (so if I plotted frequency vs. x of the test set, it would be approximately uniformly distributed in the range of possible values). This isn't actually possible to do, since I want a third of the compounds, and there aren't enough at the ends, but something closer would be good.

So I think Titandrake's solution gives me a uniform random sample of the list, which is what I was doing previously.

The reason for wanting a different set is that the compounds we're actually interested in are the ones at the extreme values, so we want to make sure that when we say our model is effective, it isn't just good at predicting things with middling values because those are the bulk of the test set.

Does that make sense?

The solution I came up with last night was to generate a uniform random variable and select the compound with the x value closest to that value, which I think is probably fine. If someone has a better idea though, that would also be good since it's somewhat awkward as is.
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: DStu on April 04, 2015, 04:10:18 pm
What you can do if you want to be exact is http://en.wikipedia.org/wiki/Rejection_sampling. But might be inefficient as hell, also because there are not enough values at the edge.

RS is a technique to sample from a distribution, starting from another one which is easy to sample. Given you know the density between those two.  And the density has an upper limit.
In your case, you can easily sample from your distribution which is approximately Gaussian, by the technique Titandrake already mentioned: Just draw the sample at a random index.
Your target distribution is uniform.
You also know the (approximate) density between those two: phi(x) := 1/sqrt(2pisigma²)exp(-x²/sigma²).  On a bounded intervall, it is also bounded (but the constant might be bad, which would make the algo inefficient). Call the constant M.

So what you do to draw one sample:

1.) Draw uniform sample z from distribution x:
2.) accept z with probability phi(z)/M. Otherwise, goto 1)

To draw 1/3N samples, you do this 1/3N times.
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: Witherweaver on April 04, 2015, 04:34:39 pm
Have you looked into psuedorandom or quasirandom sequences?  The idea is that they're low discrepency, so you better span the space.  I think that's what you're trying to do with the bins.
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: GeoLib on April 04, 2015, 04:51:45 pm
What you can do if you want to be exact is http://en.wikipedia.org/wiki/Rejection_sampling. But might be inefficient as hell, also because there are not enough values at the edge.

RS is a technique to sample from a distribution, starting from another one which is easy to sample. Given you know the density between those two.  And the density has an upper limit.
In your case, you can easily sample from your distribution which is approximately Gaussian, by the technique Titandrake already mentioned: Just draw the sample at a random index.
Your target distribution is uniform.
You also know the (approximate) density between those two: phi(x) := 1/sqrt(2pisigma²)exp(-x²/sigma²).  On a bounded intervall, it is also bounded (but the constant might be bad, which would make the algo inefficient). Call the constant M.

So what you do to draw one sample:

1.) Draw uniform sample z from distribution x:
2.) accept z with probability phi(z)/M. Otherwise, goto 1)

To draw 1/3N samples, you do this 1/3N times.

Hmmm... This does look like approximately what I want, though I think the method I came up with is actually easier to implement. Thank you though!

Have you looked into psuedorandom or quasirandom sequences?  The idea is that they're low discrepency, so you better span the space.  I think that's what you're trying to do with the bins.

Well python's random module uses pseudorandom numbers, but I think quasirandom is actually what you're talking about with low discrepancy. That might be worth checking out.


I've gotten a sample with x approximately uniformly distributed from the method I mentioned. I'm grumpy though because it means our model can't train on anything very far from the mean because they're all in the test set >:(

Thank you to everyone who made a suggestion
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: DStu on April 04, 2015, 04:54:11 pm
Hmmm... This does look like approximately what I want, though I think the method I came up with is actually easier to implement. Thank you though!

It might even be that your method does the right thing.  I'm not in the mood to calculate that though...
Title: Re: Sample Uniformly from ~Normal Distribution
Post by: Witherweaver on April 04, 2015, 05:11:28 pm
Yeah, I mean quasi.