Performs random sampling and permuting of vectors, dataframes and matrices

draws a pseudorandom sample from a vector, dataframe or matrix on the serverside or - as a special case - randomly permutes a vector, dataframe or matrix.

ds.sample(
  x = NULL,
  size = NULL,
  seed.as.integer = NULL,
  replace = FALSE,
  prob = NULL,
  newobj = NULL,
  datasources = NULL,
  notify.of.progress = FALSE
)

Arguments

x: Either a character string providing the name for the serverside vector, matrix or data.frame to be sampled or permuted, or an integer/numeric scalar (e.g. 923) indicating that one should create a new vector on the serverside that is a randomly permuted sample of the vector 1:923, or (if [replace] = FALSE, a full random permutation of that same vector. For further details of using ds.sample with x set as an integer/numeric please see help for the sample function in native R. But if x is set as a character string denoting a vector, matrix or data.frame on the serverside, please note that although ds.sample effectively calls sample on the serverside it behaves somewhat differently to sample - for the reasons identified at the top of 'details' and so help for sample should be used as a guide only.
size: a numeric/integer scalar indicating the size of the sample to be drawn. If the [x] argument is a vector, matrix or data.frame on the serverside and if the [size] argument is set either to 0 or to the length of the object to be 'sampled' and [replace] is FALSE, then ds.sample will draw a random sample that includes all rows of the input object but will randomly permute them. If the [x] argument is numeric (e.g. 923) and size is either undefined or set equal to 923, the output on the serverside will be a vector of length 923 permuted into a random order. If the [replace] argument is FALSE then the value of [size] must be no greater than the length of object to be sorted - if this is violated an error message will be returned.
seed.as.integer: this is precisely equivalent to the [seed.as.integer] arguments for the pseudo-random number generating functions (e.g. also see help for ds.rBinom, ds.rNorm, ds.rPois and ds.rUnif). In other words the seed.as.integer argument is either a a numeric scalar or a NULL which primes the random seed in each data source. If <seed.as.integer> is a numeric scalar (e.g. 938) the seed in each study is set as 938*1 in the first study in the set of data sources being used, 938*2 in the second, up to 938*N in the Nth study. If <seed.as.integer> is set as 0 all sources will start with the seed value 0 and all the random number generators will therefore start from the same position. If you want to use the same starting seed in all studies but do not wish it to be 0, you can specify a non-zero scalar value for <seed.as.integer> and then use the <datasources> argument to generate the random number vectors one source at a time (e.g. ,datasources=default.opals[2] to generate the random vector in source 2). As an example, if the <seed.as.integer> value is 78326 then the seed in each source will be set at 78326*1 = 78326 because the vector of datasources being used in each call to the function will always be of length 1 and so the source-specific seed multiplier will also be 1. The function ds.rUnif.o calls the serverside assign function setSeedDS.o to create the random seeds in each source
replace: a Boolean indicator (TRUE or FALSE) specifying whether the sample should be drawn with or without replacement. Default is FALSE so the sample is drawn without replacement. For further details see help for sample in native R.
prob: a character string containing the name of a numeric vector of probability weights on the serverside that is associated with each of the elements of the vector to be sampled enabling the drawing of a sample with some elements given higher probability of being drawn than others. For further details see help for sample in native R.
newobj: This a character string providing a name for the output data.frame which defaults to 'newobj.sample' if no name is specified.
datasources: specifies the particular opal object(s) to use. If the <datasources> argument is not specified the default set of opals will be used. The default opals are called default.opals and the default can be set using the function ds.setDefaultOpals. If the <datasources> is to be specified, it should be set without inverted commas: e.g. datasources=opals.em or datasources=default.opals. If you wish to apply the function solely to e.g. the second opal server in a set of three, the argument can be specified as: e.g. datasources=opals.em[2]. If you wish to specify the first and third opal servers in a set you specify: e.g. datasources=opals.em[c(1,3)]
notify.of.progress: specifies if console output should be produce to indicate progress. The default value for notify.of.progress is FALSE.

Value

the object specified by the <newobj> argument (or default name 'newobj.sample') which is written to the serverside. In addition, two validity messages are returned indicating whether <newobj> has been created in each data source and if so whether it is in a valid form. If its form is not valid in at least one study - e.g. because a disclosure trap was tripped and creation of the full output object was blocked - ds.dataFrameSort() also returns any studysideMessages that may explain the error in creating the full output object. We are currently working to extend the information that can be returned to the clientside when an error occurs.

Details

Clientside function ds.sample calls serverside assign function sampleDS. Based on the native R function sample() but deals slightly differently with data.frames and matrices. Specifically the sample() function in R identifies the length of an object and then samples n components of that length. But length(data.frame) in native R returns the number of columns not the number of rows. So if you have a data.frame with 71 rows and 10 columns, the sample() function will select 10 columns at random, which is often not what is required. So, ds.sample(x="data.frame",size=10) in DataSHIELD will sample 10 rows at random(with or without replacement depending whether the [replace] argument is TRUE or FALSE, with False being default). If x is a simple vector or a matrix it is first coerced to a data.frame on the serverside and so is dealt with in the same way (i.e. random selection of 10 rows). If x is an integer not expressed as a character string, it is dealt with in exactly the same way as in native R. That is, if x = 923 and size=117, DataSHIELD will draw a random sample in random order of size 117 from the vector 1:923 (i.e. 1, 2, ... ,923) with or without replacement depending whether [replace] is TRUE or FALSE. If the [x] argument is numeric (e.g. 923) and size is either undefined or set equal to 923, the output on the serverside will be a vector of length 923 permuted into a random order. If the [x] argument is a vector, matrix or data.frame on the serverside and if the [size] argument is set either to 0 or to the length of the object to be 'sampled' and [replace] is FALSE, then ds.sample will draw a random sample that includes all rows of the input object but will randomly permute them. This is how ds.sample enables random permuting as well as random sub-sampling. When a serverside vector, matrix or data.frame is sampled using ds.sample 3 new columns are appended to the right of the output object. These are: 'in.sample', 'ID.seq', and 'sampling.order'. The first of these is set to 1 whenever a row enters the sample and as a QA test, all values in that column in the output object should be 1. 'ID.seq' is a sequential numeric ID appended to the right of the object to be sampled during the running of ds.sample that runs from 1 to the length of the object and will be appended even if there is already an equivalent sequential ID in the object. The output object is stored in the same original order as it was before sampling, and so if the first four elements of 'ID.seq' are 3,4, 6, 15 ... then it means that rows 1 and 2 were not included in the random sample, but rows 3, 4 were. Row 5 was not included, 6 was included and rows 7-14 were not etc. The 'sampling.order' vector is of class numeric and indicates the order in which the rows entered the sample: 1 indicates the first row sample, 2 the second etc. The lines of code that follow create an output object of the same length as the input object (PRWa) but they join the sample in random order. By sorting the output object (in this case with the default name 'newobj.sample) using ds.dataFrameSort with the 'sampling.order' vector as the sort key, the output object is rendered equivalent to PRWa but with the rows randomly permuted (so the column reflecting the vector 'sample.order' now runs from 1:length of obejct, while the column reflecting 'ID.seq' denoting the original order is now randomly ordered. If you need to return to the original order you can simply us ds.dataFrameSort again using the column reflecting 'ID.seq' as the sort key: (1) ds.sample('PRWa',size=0,seed.as.integer = 256); (2) ds.make("newobj.sample$sampling.order","sortkey"); (3) ds.dataFrameSort("newobj.sample","sortkey",newobj="newobj.permuted") The only additional detail to note is that the original name of the sort key ("newobj.sample$sampling.order") is 28 characters long, and because its length is tested to check for disclosure risk, this original name will fail using the usual value for 'nfilter.stringShort' (i.e. 20). This is why line 2 is inserted to create a copy with a shorter name.

Author

Paul Burton, for DataSHIELD Development Team, 15/4/2020