Ch. 6 – Data Organisation

This notebook demonstrates Data Diagnostics and Summaries.

This section is based on R template for analysing data from experiments and surveys and justification to follow certain conventions and structures. This document is available as a rendered html at https://tuomaseerola.github.io/R_template/.

Organisation

For each project, you should establish one repository that you can clone/share using appropriate service (such as Github, Gitlab, or other services, even Dropbox works for collaborations). Name the repository with a compact but informative name (chord_priming or emotion_recognition5) that separates it from the potential other projects.

Within this repository, it is a good idea to establish separate folders for separate elements:

  • /data Data in read-only format (preferably CSV or TSV format)
  • /munge All operations to pre-process, recode, or trim data
  • /scr All actual R scripts used in the analysis
  • /figures Outputs from the scripts (optional if use reporting languages)
  • /docs Outputs from the reports (optional if use reporting languages)

In this repository, contents.R is the file that compiles the full analysis and allows you to reproduce evertyhing in R. Alternatively this file can Rmarkdown file, which is neat analysis and reporting format or even Quarto document, which is more advanced version of this. Nevertheless, the summary document contains all the stages, structures and processes of the project. This is structured to be executed in a coherent order and manner (i.e., loading, transforming, screening the data, and then visualising, applying statistical analyses, creating figures, and tables).

report.Rmd will create the report that incorporates comments and the actual analyses and produces either html or pdf file (report.html, report.pdf) in the docs folder.

Data

Typically the data is in CSV (comma-separated values) or TSV (tab-separated values) as this is the output from most experiment software solutions and also easily exported from Qualtrics, psychophysiological measures and so on. Sometimes the data might be in Excel format, which can also be read easily into R, and I would advice against large amount of edits in Excel as you would lose the ability to tell what has been changed and why. The rule is that we never edit or manipulate or fix or alter the raw data, no matter what the format is.

It is good to store the raw original data with time stamps to the data folder and if you get a newer datasets or more observations, you add a new data file to the data folder with a new timestamp and keep the old one for reference. There are situations when the data has excess observations (pilot participants), typos and other issues, but it is easier and more transparent to handle these in the processing (munging) stage.

Munging

Munging refers to preprocessing the raw data to be useful for the actual analysis. Often this means relabelling the names of the variables (columns) and possibly recoding the observations (as numeric responses or as factors). It is also very typical to pivot the data from wide format (numerous variables in columns) to long format so the key variables contain all manipulations.

Scripts

Often you develop the analysis in stages, starting with some form of quality control and moving onto descriptives and then inferential statistics. For enhanced clarity and debugging, it is a good idea to develop these as separate scripts (and also possibly as functions) and store them in the scripts folder. The production of tables and figures can also be explicitly done with separate scripts.

In the end, you should have one file (that I call contents.R) that is able to produce the full analysis from reading the data, preprocessing the data, calculating quality control indicators, summarising the data, producing the analyses and creating tables and figures.

Example repository and template analysis

Proceed to either a fuller explanation of the process at https://tuomaseerola.github.io/R_template/ or check the quarto slides about R_template in action to explore the steps of the analysis process.

Initialise the analysis

Start R and open up the contents.R file using your preferred editor. Check that the directory after the first command setwd is pointing the location of your analysis directory and run the first lines of the code:

# contents.R
## INITIALISE: LOAD LIBRARIES
library(tidyverse, quietly = TRUE) # Loads the necessary R libraries

If you get errors at this stage with new installation of R, they might refer to the special libraries that were loaded or installed in libraries.R. This script should install the required libraries for you such as ggplot2, but there might be issues with your particular setup.

Load, preprocess and diagnose the data

Next, it is time to load the data with a scripts, the first one read_data_survey.R is simply reading an TSV file exported from Qualtrics stored in data folder. I’ve taken the second, descriptive header row out of the data to simply the process, but different datasets will have slightly different structures.

source('scr/read_data_survey.R') # Produces data frame v 

This should retrieve a data frame into a variable called v in R, which contains a complex data frame. In the next step this raw data will be munged, that is, pre-processed in several ways. Pre-processing can have multiple steps, here these have broken into two:

First operation carries out a long list of renaming the variables (columns in the data, rename_variables.R). This can be avoided if the data has these names already, and it is quite useful to try to embed meaningful variables names to the data collection (experiment or survey or manual coding).

Recoding instruments (recode_instruments.R) has several steps and it might be useful to study the steps separately. Finally the responses are reshaped into a form called long-form that is better suited for the analyses. This dataframe will be called df.

source('munge/rename_variables.R')  # Renames the columns of the v
source('munge/recode_instruments.R')# Produces df (long-form) from v

After the munging, it is prudent to check various aspects of the data.

  1. Descriptives such as the N, age, gender are echoed in order to remind us of the dataset properties (demographics_info.R).

  2. We can also explore the consistency of the ratings across the people to check whether people agreed on the ratings and generally understood the task (interrater_reliability.R).

  3. We also want to look at the distributions of the collected data in order to learn whether one needs to use certain operations (transformations or resort to non-parametric statistics) in the subsequent analyses (visualise.R). This step will also include displaying correlations between the emotion scales which is a useful operation to learn about the overlap of the concepts used in the tasks.

source('scr/demographics_info.R')  # Reports N, Age and other details
Back to top