Function Repository Resource:

LongFormDataset

Source Notebook

Converts tabular data into a long-form dataset

Contributed by: Anton Antonov

ResourceFunction["LongFormDataset"][data]

converts the argument data into a long-form dataset with columns "AutomaticKey","Variable" and "Value".

ResourceFunction["LongFormDataset"][data, idcols]

converts the argument data into a long-form dataset with columns idcols and columns "Variable" and "Value" derived from the columns of data that are not in idcols.

ResourceFunction["LongFormDataset"][data, idcols, valcols]

converts the argument data into a long-form dataset with columns idcols and columns "Variable" and "Value" derived from valcols.

Details and Options

The so called "long form" and "wide form" conversions of tabular data are fundamental "data wrangling" operations.
In long form conversion, the columns of the original dataset are seen in three groups: identifier columns, variable columns and ignored columns.
The conversion into long form converts the set of specified variable columns into two columns: one with variable names and the other with variable values.
The first argument of ResourceFunction["LongFormDataset"] can be a two-dimensional (non-hierarchical) dataset, a two dimensional full array (matrix), a list of associations or an association of associations.
The conversion of a dataset into a long form allows column names (of variables) to be treated as data.
If no identifier columns are given, ResourceFunction["LongFormDataset"] uses the order indexes of the rows as identifiers.
ResourceFunction["LongFormDataset"] takes the following options:
"AutomaticKeysTo""AutomaticKey"the column name if automatic keys are used
"VariablesTo""Variable"the column name of the column that has the variable names as values
"ValuesTo""Value"the column name of the column that has the variable values as values

Examples

Basic Examples (4) 

Here is a simple dataset:

In[1]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
   All, Prepend[#, "ID" -> Increment[k]]& ]]
Out[1]=
Image

Here is a long form conversion with a specified identifier column and variable columns:

In[2]:=
ResourceFunction["LongFormDataset"][dataset, "ID", {"a", "b"}]
Out[2]=
Image

Here is another long form derived with an automatically determined identifier column and variable columns:

In[3]:=
ResourceFunction["LongFormDataset"][dataset]
Out[3]=
Image

Here is a conversion to long form with identifier columns "a" and "b":

In[4]:=
ResourceFunction["LongFormDataset"][dataset, {"a", "b"}]
Out[4]=
Image

Scope (5) 

The first argument can be a matrix:

In[5]:=
mat = RandomReal[{100, 200}, {3, 6}];
ResourceFunction["LongFormDataset"][mat]
Out[2]=
Image

The first argument can be a list of associations:

In[6]:=
alist = AssociationThread[Range[Length[#]], #] & /@ mat;
ResourceFunction["LongFormDataset"][alist]
Out[2]=
Image

The second and third arguments can be Automatic:

In[7]:=
ResourceFunction["LongFormDataset"][dataset, Automatic, {"a", "b"}]
Out[7]=
Image
In[8]:=
ResourceFunction["LongFormDataset"][dataset, "a", Automatic]
Out[8]=
Image

The options "IdentifierColumns" and "VariableColumns" can be used instead of a second argument and third argument respectively:

In[9]:=
ResourceFunction["LongFormDataset"][dataset, "IdentifierColumns" -> "ID", "VariableColumns" -> {"a", "b"}]
Out[9]=
Image

The second and third arguments can be column indexes:

In[10]:=
ResourceFunction["LongFormDataset"][dataset, 2, {1, 3}]
Out[10]=
Image

Automatic keys are derived if the dataset argument has row keys:

In[11]:=
SeedRandom[12];
dataset2 = Dataset[AssociationThread[RandomWord[Length[#]], Normal[#]] &@
   dataset]
Out[8]=
Image

Here is an automatic conversion to long form:

In[12]:=
ResourceFunction["LongFormDataset"][dataset2]
Out[12]=
Image

The row keys are ignored if the identifier column is specified:

In[13]:=
ResourceFunction["LongFormDataset"][dataset2, "ID"]
Out[13]=
Image

Options (3) 

AutomaticKeysTo (1) 

The option "AutomaticKeysTo" can be used to specify the name of the column that corresponds to the automatically determined identifier:

In[14]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
    All, Prepend[#, "ID" -> Increment[k]]& ]];
ResourceFunction["LongFormDataset"][dataset, "AutomaticKeysTo" -> "SpecialID"]
Out[8]=
Image

VariablesTo (1) 

The option "VariablesTo" specifies the name of long form's column that has, as values, the names of the variable columns:

In[15]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
    All, Prepend[#, "ID" -> Increment[k]]& ]];
ResourceFunction["LongFormDataset"][dataset, "VariablesTo" -> "VAR"]
Out[8]=
Image

ValuesTo (1) 

The option "ValuesTo" specifies the name of the column that has the values in the variable columns:

In[16]:=
dataset = Block[{ds, k}, ds = Dataset[{
Association["a" -> "x", "b" -> 5], 
Association["a" -> "y", "b" -> 6], 
Association["a" -> "x", "b" -> 10], 
Association["a" -> "y", "b" -> 100], 
Association["a" -> "z", "b" -> Missing[]]}]; k = 1; ds[
    All, Prepend[#, "ID" -> Increment[k]]& ]];
ResourceFunction["LongFormDataset"][dataset, "ValuesTo" -> "VAL"]
Out[8]=
Image

Applications (9) 

Column names as data (5) 

The main advantage of the long form conversion is that variable column names become data, that is, values in a certain column. This can be demonstrated by making an association of time series objects for the rows of multiple time series data given in wide form.

Here is a randomly generated table (dataset) with multiple time series data:

In[17]:=
dsTSWide = Dataset[
Association[
  2018 -> Association[
    "Apr" -> 83.49, "Aug" -> 0, "Dec" -> 84.59, "Feb" -> 0, "Jan" -> 0, "Jul" -> 0, "Jun" -> 93.32000000000001, "Mar" -> 215.34, "May" -> 89.65, "Nov" -> 168.51999999999998`, "Oct" -> 0, "Sep" -> 116.71000000000001`], 2019 -> Association[
    "Apr" -> 23.61, "Aug" -> 0, "Dec" -> 133.88, "Feb" -> 12.74, "Jan" -> 152.55, "Jul" -> 0, "Jun" -> 93.34, "Mar" -> 59.69, "May" -> 126.05000000000001`, "Nov" -> 94.46000000000001, "Oct" -> 69.95, "Sep" -> 37.27], 2020 -> Association[
    "Apr" -> 94.48, "Aug" -> 38.480000000000004`, "Dec" -> 77.44, "Feb" -> 12.64, "Jan" -> 33.86, "Jul" -> 141.04000000000002`, "Jun" -> 50.36, "Mar" -> 0, "May" -> 0, "Nov" -> 12.65, "Oct" -> 0, "Sep" -> 26.13]]]
Out[17]=
Image

Each row of that random time series dataset corresponds to a yearly sequence of monthly values. The interpretation of the dataset reveals heterogeneous semantics of its values: (1) each row corresponds to a year specified as a key and (2) the columns have short month names.

Convert the dataset into long form using the row keys as identifiers:

In[18]:=
dsLong = ResourceFunction["LongFormDataset"][dsTSWide, "AutomaticKeysTo" -> "Year", "VariablesTo" -> "Month"]
Out[18]=
Image

Add the column "ObservationTime" derived from the columns "Year" and "Month":

In[19]:=
dsLong = dsLong[All, Join[#, <|"ObservationTime" -> AbsoluteTime[ToString[#Year] <> "-" <> #Month]|>] &]
Out[19]=
Image

Split the long form dataset by year and make time series with the columns "ObservationTime" and "Value":

In[20]:=
timeSeriesCollection = GroupBy[Normal@dsLong, #Year &, TimeSeries[Values /@ #[[All, {"ObservationTime", "Value"}]]] &]
Out[20]=
Image

Plot the obtained time series:

In[21]:=
DateListPlot /@ timeSeriesCollection
Out[21]=
Image

Combinations of the heterogenous data (4) 

Using long forms simplifies the programmatic manipulation of heterogenous data.

Get datasets with different numbers of rows and columns that correspond to items and variables of different kinds:

In[22]:=
datasets = Association@
   Map[# -> ResourceFunction[
        "ExampleDataSpecToDataset"][{"Statistics", #}] &, \
{"AnimalWeights", "EmployeeAttitude", "OrangeTreeGrowth"}];
Magnify[#, 0.6] & /@ datasets
Out[18]=
Image

Convert all datasets into long form datasets:

In[23]:=
datasets = ResourceFunction["LongFormDataset"] /@ datasets;
Magnify[#, 0.6] & /@ datasets
Out[20]=
Image

For each long form dataset, change the automatic key column to include the name of that dataset:

In[24]:=
datasets = KeyValueMap[
   Function[{k, v}, v[All, Prepend[#, "AutomaticKey" -> k <> "-" <> ToString[#AutomaticKey]] &]], datasets];
Magnify[#, 0.6] & /@ datasets
Out[25]=
Image

Join all long form datasets into one dataset and show a sample:

In[26]:=
SeedRandom[321];
Sort@RandomSample[Join @@ datasets, 8]
Out[27]=
Image

Properties and Relations (2) 

Here is a random sparse matrix:

In[28]:=
SeedRandom[12];
smat = SparseArray[
  RandomReal[{0, 1}, {4, 5}] /. (x_?NumberQ /; x < 0.5) -> 0]
Out[29]=
Image

Compare the array rules of this sparse matrix with its long-form representation:

In[30]:=
ArrayRules[smat]
Out[30]=
Image

Here is the long-form representation:

In[31]:=
ResourceFunction["LongFormDataset"][Normal@smat][Select[#Value > 0 &]]
Out[31]=
Image

Note that since using Normal on the sparse matrix makes a dense matrix with zeroes, those zeroes are filtered out from the displayed long form.


Cross tabulation can be seen under certain conditions as conversion to wide form. Therefore, certain long form conversions can be reverted into the original forms using cross tabulation.

Here is an example using the resource function over a subset of the Lake Mead elevation levels data:

In[32]:=
dsLakeData = ResourceFunction["ExampleDataSpecToDataset"][{"Statistics", "LakeMeadLevels"}][[1 ;; 4, 1 ;; 4]];
dsLong = ResourceFunction["LongFormDataset"][dsLakeData, "Year", "VariablesTo" -> "Month", "ValuesTo" -> "Elevation"];
dsCrossTbl = ResourceFunction["CrossTabulate"][dsLong];
In[33]:=
AssociationThread[{"Orignal", "Long form", "Cross tabulation"}, Magnify[#, 0.8] & /@ {dsLakeData, dsLong, dsCrossTbl}]
Out[33]=
Image

Possible Issues (2) 

If the first argument is a dataset without column names, the second and third arguments are expected to consist of column indices.

Here is a correct specification:

In[34]:=
ResourceFunction["LongFormDataset"][dataset[Values], 2, {1, 3}]
Out[34]=
Image

Here are some incorrect specifications:

In[35]:=
ResourceFunction["LongFormDataset"][dataset[Values], 2, {1, 5}]
Image
Out[35]=
Image
In[36]:=
ResourceFunction["LongFormDataset"][dataset[Values], "a"]
Image
Out[36]=
Image

If the identifier columns produce multiple corresponding rows for a given combination of identifier values, then only the last row is put in the long-form result:

In[37]:=
dataset
Out[37]=
Image

The row <|"ID"1, "a" x, "b" 5|> is lost:

In[38]:=
ResourceFunction["LongFormDataset"][dataset, "a"]
Out[38]=
Image

Publisher

Anton Antonov

Version History

  • 1.0.0 – 02 December 2020

Related Resources

Author Notes

Conversions to long format and wide format are fundamental data transformation techniques.

License Information