Quantiles with Rscript
quantiles rscriptWorking with Rscript without a strong statistics background can be a little daunting. Some of its functions seem to have pretty mysterious properties, and the documentation can be a challenge to grok without prior knowledge. One such function I’ve recently encountered is quantile.
The wikipedia article defines quantiles as:
cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
If you think of a dataset as a stream, a quantile is a point that divides the stream into a segment. There are several kinds, but perhaps the most common is the quartile. Quartiles are cutpoints that divide datasets into four parts. So, if you have a dataset of [1, 2, 3, 4, 5, 6, 7, 8]
, the first, second, third, and fourth quartiles are 1
, 2
, 3
, 4
, respectively.
I ran into an issue where the code would process and filter out records from data that I was trying to seed for testing. The codebase was very opaque about the reason; all it would tell me was that it was trying to perform processing on an empty dataset. After some digging, I ran into code that looked something like this:
doFoo(quantile(data, 0.98))
Playing with the repl didn’t make things much clearer. I’d create a dataset of one element, and quantile
would return the same value:
data = c(1)
print(quantile(data, 0.98))
# 1
I then figured that, if a quantile’s a cutpoint in the data, it only makes sense that the 0.98th cutpoint would be the sole element in the data. So I added more datapoints; true enough, calling the quantile
function led to a different result:
data = c(1, 2, 3, 4, 5, 6, 7, 8)
print(quantile(data, 0.98))
# 7.86
If a quantile is a cutpoint in the data, then there can only be one cutpoint in a dataset of size 1
. With a much larger dataset, the quantile becomes something more distinct, since there are more datapoints to work with. So it turns out that all I had to do was seed more data.