Converts a server-side numeric vector into a factor

This function assigns a server-side numeric vector into a factor class.

ds.asFactor(
  input.var.name = NULL,
  newobj.name = NULL,
  forced.factor.levels = NULL,
  fixed.dummy.vars = FALSE,
  baseline.level = 1,
  datasources = NULL
)

Arguments

input.var.name: a character string which provides the name of the variable to be converted to a factor.
newobj.name: a character string that provides the name for the output variable that is stored on the data servers. Default asfactor.newobj.
forced.factor.levels: the levels that the user wants to split the input variable. If NULL (default) a vector with all unique levels from all studies are created.
fixed.dummy.vars: boolean. If TRUE the input variable is converted to a factor but presented as a matrix of dummy variables. If FALSE (default) the input variable is converted to a factor and assigned as a vector.
baseline.level: an integer indicating the baseline level to be used in the creation of the matrix with dummy variables. If the fixed.dummy.vars is set to FALSE then any value of the baseline level is not taken into account.
datasources: a list of DSConnection-class objects obtained after login. If the datasources argument is not specified the default set of connections will be used: see datashield.connections_default.

Value

ds.asFactor returns the unique levels of the converted variable in ascending order and a validity message with the name of the created object on the client-side and the output matrix or vector in the server-side.

Details

Converts a numeric vector into a factor type which is represented either as a vector or as a matrix of dummy variables depending on the argument fixed.dummy.vars. The matrix of dummy variables also depends on the argument baseline.level.

ds.asFactor.R and its associated serverside functions asFactorDS1 and asFactorDS2 are to be used when you have variable that has up to 40 unique levels across all sources combined. If one of the sources does not contain any subjects at a particular level, that level will still be created as an empty category. In the end all sources thus include a factor variable with consistent factor levels across all sources - one level for every unique value that occurs in at least one source. This is important when you wish to fit models using ds.glm because the factor levels must be consistent across all studies or the model will not fit.

But in order for this to be possible, all sources have to share all of the unique values their source holds for the variable. This allows the client to create a single vector containing all of the unique factor levels across ALL sources. But this is potentially disclosive if there are too many levels. There are therefore two checks on the number of levels in each source. One is simply a test of whether the number of levels exceeds a value specified by the Roption value 'nfilter.max.levels' which is set by default to 40, but the data custodian for the source can choose any alternative value he/she chooses. The second test is of whether the levels are too dense: ie do the number of levels exceed a specified proportion of the full length of the relevant vector in the particular source. The max density is set by the Roption value 'nfilter.levels' which takes the default value 0.33 but can again be modified by the data custodian.

In combination, these two checks mean that if a factor has 35 levels in a given study where the total length of the variable to be converted to a factor is 1000 individuals, the ds.asFactor function will process that variable appropriately. But if it had had 45 levels it would have been blocked by 'nfilter.max.levels' and if the total length of the variable in that study had only been 70 subjects it would have been blocked by the density criterion held in 'nfilter.levels'.

If you have a factor with more than 40 levels in each source - perhaps most commonly an ID of some sort that you need to provide as an argument to eg a tapply function. Then you cannot use ds.asFactor. Typically in these circumstance you simply want to create a factor that is appropriate for each source but you do not need to ensure that all levels are consistent across all sources. In that case, you can use the ds.asFactorSimple function which does no more than coerce a numeric or character variable to a factor. Because you do not need to share unique factor levels between sources, there is then no disclosure issue.

To understand how the matrix of the dummy variable is created let's assume that we have the vector (1, 2, 1, 3, 4, 4, 1, 3, 4, 5) of ten integer numbers. If we set the argument fixed.dummy.vars = TRUE, baseline.level = 1 and forced.factor.levels = c(1,2,3,4,5). The input vector is converted to the following matrix of dummy variables:

DV2	DV3	DV4	DV5
0	0	0	0
1	0	0	0
0	0	0	0
0	1	0	0
0	0	1	0
0	0	1	0
0	0	0	0
0	1	0	0
0	0	1	0
0	0	0	1

For the same example if the baseline.level = 3 then the matrix is:

DV1	DV2	DV4	DV5
1	0	0	0
0	1	0	0
1	0	0	0
0	0	0	0
0	0	1	0
0	0	1	0
1	0	0	0
0	0	0	0
0	0	1	0
0	0	0	1

In the first instance the first row of the matrix has zeros in all entries indicating that the first data point belongs to level 1 (as the baseline level is equal to 1). The second row has 1 at the first (DV2) column and zeros elsewhere, indicating that the second data point belongs to level 2. In the second instance (second matrix) where the baseline level is equal to 3, the first row of the matrix has 1 at the first (DV1) column and zeros elsewhere, indicating again that the first data point belongs to level 1. Also as we can see the fourth row of the second matrix has all its elements equal to zero indicating that the fourth data point belongs to level 3 (as the baseline level, in that case, is 3).

If the baseline.level is set to be equal to a value that is not one of the levels of the factor then a matrix of dummy variables is created having as many columns as the number of levels. In that case in each row there is a unique entry equal to 1 at a certain column indicating the level of each data point. So, for the above example where the vector has five levels if we set the baseline.level equal to a value that does not belong to those five levels (baseline.level=8) the matrix of dummy variables is:

DV1	DV2	DV3	DV4	DV5
1	0	0	0	0
0	1	0	0	0
1	0	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	1	0
1	0	0	0	0
0	0	1	0	0
0	0	0	1	0
0	0	0	0	1

Server functions called: asFactorDS1 and asFactorDS2

Author

DataSHIELD Development Team

Examples

if (FALSE) { # \dontrun{

  ## Version 6, for version 5 see Wiki
  # Connecting to the Opal servers

  require('DSI')
  require('DSOpal')
  require('dsBaseClient')

  builder <- DSI::newDSLoginBuilder()
  builder$append(server = "study1", 
                 url = "http://192.168.56.100:8080/", 
                 user = "administrator", password = "datashield_test&", 
                 table = "CNSIM.CNSIM1", driver = "OpalDriver")
  builder$append(server = "study2", 
                 url = "http://192.168.56.100:8080/", 
                 user = "administrator", password = "datashield_test&", 
                 table = "CNSIM.CNSIM2", driver = "OpalDriver")
  builder$append(server = "study3",
                 url = "http://192.168.56.100:8080/", 
                 user = "administrator", password = "datashield_test&", 
                 table = "CNSIM.CNSIM3", driver = "OpalDriver")
  logindata <- builder$build()

  # Log onto the remote Opal training servers
  connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D") 

  ds.asFactor(input.var.name = "D$PM_BMI_CATEGORICAL", 
              newobj.name = "fact.obj", 
              forced.factor.levels = NULL, #a vector with all unique levels 
                                           #from all studies is created
              fixed.dummy.vars = TRUE, #create a matrix of dummy variables
              baseline.level = 1,
              datasources = connections)#all the Opal servers are used, in this case 3 
                                        #(see above the connection to the servers) 
  ds.asFactor(input.var.name = "D$PM_BMI_CATEGORICAL", 
              newobj.name = "fact.obj", 
              forced.factor.levels = c(2,3), #the variable is split in 2 levels
              fixed.dummy.vars = TRUE, #create a matrix of dummy variables
              baseline.level = 1,
              datasources = connections[1])#only the first Opal server is used ("study1")

   # Clear the Datashield R sessions and logout  
   datashield.logout(connections) 
} # }