ds.tapply.Rd
Apply one of a selected range of functions to summarize an outcome variable over one or more indexing factors. The resultant summary is written to the client-side.
ds.tapply(
X.name = NULL,
INDEX.names = NULL,
FUN.name = NULL,
datasources = NULL
)
a character string specifying the name of the variable to be summarized.
a character string specifying the name of a single factor or a list or vector of names of up to two factors to index the variable to be summarized. For more information see Details.
a character string specifying the name of one of the allowable
summarizing functions. This can be set as:
"N"
(or "length"
), "mean"
,"sd"
, "sum"
,
or "quantile"
.
For more information see Details.
a list of DSConnection-class
objects obtained after login. If the datasources
argument is not specified
the default set of connections will be used: see datashield.connections_default
.
ds.tapply
returns to the client-side an array of the summarized values.
It has the same number of dimensions as INDEX.
This function is similar to a native R function tapply()
.
It applies one of a selected range of functions to each cell
of a ragged array, that is to each (non-empty)
group of values given by each unique combination of a series of indexing factors.
The range of allowable summarizing functions for DataSHIELD ds.tapply
function
is much more restrictive than for the native R tapply
function.
The reason for this is the protection against disclosure risk.
Should other functions be required in the future then, provided they are non-disclosive, the DataSHIELD development team could work on them if requested.
To protect against disclosure the
number of observations in each summarizing group in each source is calculated
and if any of these falls below the value of nfilter.tab
(the minimum allowable non-zero count
in a contingency table) the tapply analysis of that source will return only an error message.
The value of nfilter.tab
is can be set and modified only by the data custodian. If an
analytic team wishes the value to be reduced (e.g. to 1 which will allow any output
from tapply to be returned) this needs to formally be discussed and agreed
with the data custodian.
If the reason for the tapply analysis is, for example, to break
a dataset down into a small number of values for each individual and then to flag up
which individuals have got at least one positive value for a binary outcome variable, then
that flagging does not have to be overtly returned to the client-side. Rather, it can be
written as a vector to the server-side at each source (which, like any other server-side
object, cannot then be seen, abstracted or copied). This can be done using
ds.tapply.assign
which writes the results as a newobj
to the server-side
and does not test the number of observations in each group against nfilter.tab
.
For more information see the help option of ds.tapply.assign
function.
The native R
tapply function has optional arguments such as na.rm = TRUE
for FUN = mean
which will exclude any NAs from the outcome variable to be summarized.
However, in order to keep DataSHIELD's ds.tapply
and ds.tapply.assign
functions straightforward, the
server-side functions tapplyDS
and tapplyDS.assign
both starts by stripping
any observations which have missing (NA) values in either the outcome variable or in
any one of the indexing factors. In consequence, the resultant analyses are always based
on complete cases.
In INDEX.names
argument the native R tapply function
can coerce non-factor vectors into factors. However, this does not always work when
using the DataSHIELD ds.tapply
or ds.tapply.assign
functions so if you are concerned that
an indexing vector is not being treated correctly as a factor,
please first declare it explicitly as a factor using ds.asFactor
.
In FUN.name
argument the allowable functions are: N or length (the number
of (non-missing) observations in the group defined by each combination of indexing
factors); mean; SD (standard deviation); sum; quantile (with quantile probabilities set at
c(0.05,0.1,0.2,0.25,0.3,0.33,0.4,0.5,0.6,0.67,0.7,0.75,0.8,0.9,0.95)).
Server function called: tapplyDS
if (FALSE) { # \dontrun{
## Version 6, for version 5 see the Wiki
# connecting to the Opal servers
require('DSI')
require('DSOpal')
require('dsBaseClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1",
url = "http://192.168.56.100:8080/",
user = "administrator", password = "datashield_test&",
table = "CNSIM.CNSIM1", driver = "OpalDriver")
builder$append(server = "study2",
url = "http://192.168.56.100:8080/",
user = "administrator", password = "datashield_test&",
table = "CNSIM.CNSIM2", driver = "OpalDriver")
builder$append(server = "study3",
url = "http://192.168.56.100:8080/",
user = "administrator", password = "datashield_test&",
table = "CNSIM.CNSIM3", driver = "OpalDriver")
logindata <- builder$build()
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
# Apply a Function Over a Server-Side Ragged Array
ds.assign(toAssign = "D$LAB_TSC",
newobj = "LAB_TSC",
datasources = connections)
ds.assign(toAssign = "D$GENDER",
newobj = "GENDER",
datasources = connections)
ds.tapply(X.name = "LAB_TSC",
INDEX.names = c("GENDER"),
FUN.name = "mean",
datasources = connections)
# Clear the Datashield R sessions and logout
datashield.logout(connections)
} # }