R_template in action

MSL meeting

Tuomas Eerola

February 28, 2023

Preliminaries

Preliminaries (1)

This process assumes that you have

  • RStudio installed in your computer (it’s free)

  • Basic understanding of RStudio workspace, path, and environment (Chapter 6.5 in Scientific Musicology)

  • the materials downloaded from https://github.com/tuomaseerola/R_template

    • Tip 1: use the green “Code” button to “Download zip” file

    • Tip 2: extract this into a convenient location on your computer and open contents.R with RStudio

Preliminaries (2)

White slides refer to the contents.R

Light blue slides refer to questions to you

These slides are available at: https://tuomaseerola.github.io/R_template/report_presentation.html

Example Data

Initialise the analysis

We take a raw dataset collected via Qualtrics by Annaliese Micalle-Grimaud that set out to validate her EmoteControl system. She collected ratings of multiple expressed emotions for various musical excerpts that she has created and used in a expression production paradigm. This data is Experiment 2 of the study (Micallef Grimaud & Eerola, 2022).

## INITIALISE: SET PATH, CLEAR MEMORY & LOAD LIBRARIES
rm(list=ls(all=TRUE))             # Cleans the R memory, just in case
source('scr/load_libraries.R')    # Loads the necessary R libraries
The following libraries were loaded: ggplot2 psych dplyr reshape2 stringr tidyr stringr lme4 lmerTest emmeans knitr pbkrtest 

Question: What are packages? (1)

  • Why R utilises separate packages (libraries in your system)?

  • There are 18,419 of these in CRAN

  • Popular ones are

    • ggplot2, A grammar of graphics in R
    • dplyr, a grammar of data manipulation
    • tidyr, a collection of package development tools
    • foreign, read data stored by Minitab, S, SAS, SPSS

Question: What are packages? (2)

To check what you might already have in your R, type:

search()

If you don’t have a library installed, just type

install.packages("ggplot2")

After installation, you can load the library for your project with

library(ggplot2)

Installation vs library (Analogy and image credit to Dianne Cook of Monash University.)

Load, preprocess and diagnose

All these operations are sometimes called data carpentry.

## READ data
source('scr/read_data_survey.R')      # Produces data frame v 
N x Variables:119 131
head(v,3)
            StartDate             EndDate Status      IPAddress Progress
1 2019-05-10 10:44:03 2019-05-10 10:54:31      0   151.49.102.8      100
2 2019-05-10 10:42:42 2019-05-10 11:27:42      0 129.234.21.185      100
3 2019-05-10 11:48:53 2019-05-10 12:04:15      0   86.142.18.72      100
  Duration..in.seconds. Finished        RecordedDate        ResponseId
1                   628        1 2019-05-10 10:54:31 R_2zBKHnkPpp1VJNQ
2                  2700        1 2019-05-10 11:27:44 R_1NERfgwV6F9lmUM
3                   922        1 2019-05-10 12:04:16 R_3Get29azuhuBYBo
  RecipientLastName RecipientFirstName RecipientEmail ExternalReference
1                NA                 NA             NA                NA
2                NA                 NA             NA                NA
3                NA                 NA             NA                NA
  LocationLatitude LocationLongitude DistributionChannel UserLanguage Q1 Q3 Q4
1         45.43710         12.332703           anonymous        EN-GB  1 26  2
2         54.76840         -1.563095           anonymous        EN-GB  1 22  1
3         54.00011         -1.535004           anonymous        EN-GB  1 55  2
   Q5 Q6        Q7 Q8                           Q9 Q10 Q12_1 Q12_2 Q12_3 Q12_4
1 107  4 Classical 35                        Piano  13    16    15    15    15
2 185  1 Indie Pop 36                                     19    16    15    15
3 185  6 Classical 35 Piano, flute, voice, violin   50    19    17    15    15
  Q12_5 Q12_6 Q12_7 Q14_1 Q14_2 Q14_3 Q14_4 Q14_5 Q14_6 Q14_7 Q15_1 Q15_2 Q15_3
1    15    15    15     6     4     1     1     3     1     1     1     3     6
2    15    15    15     6     1     1     1     3     1     1     1     5     6
3    15    15    15     5     4     1     1     1     1     3     1     3     6
  Q15_4 Q15_5 Q15_6 Q15_7 Q16_1 Q16_2 Q16_3 Q16_4 Q16_5 Q16_6 Q16_7 Q17_1 Q17_2
1     1     3     3     4     1     6     4     1     1     3     1     3     1
2     1     1     1     1     3     4     1     1     1     1     1     1     1
3     1     1     5     3     3     6     3     1     1     1     1     1     1
  Q17_3 Q17_4 Q17_5 Q17_6 Q17_7 Q18_1 Q18_2 Q18_3 Q18_4 Q18_5 Q18_6 Q18_7 Q19_1
1     1     5     4     5     3     1     1     4     3     5     6     3     3
2     1     6     3     3     1     1     1     1     1     3     5     1     1
3     1     4     3     6     3     3     1     1     6     4     5     4     1
  Q19_2 Q19_3 Q19_4 Q19_5 Q19_6 Q19_7 Q20_1 Q20_2 Q20_3 Q20_4 Q20_5 Q20_6 Q20_7
1     1     1     4     4     4     1     4     3     4     3     3     4     4
2     1     3     1     1     4     3     4     5     1     1     1     1     4
3     1     3     4     1     5     1     3     3     3     4     3     6     5
  Q21_1 Q21_2 Q21_3 Q21_4 Q21_5 Q21_6 Q21_7 Q22_1 Q22_2 Q22_3 Q22_4 Q22_5 Q22_6
1     5     3     1     1     3     5     1     3     3     5     1     1     4
2     1     3     3     1     1     3     1     1     3     6     1     1     1
3     6     5     1     1     1     4     1     1     3     5     3     1     5
  Q22_7 Q23_1 Q23_2 Q23_3 Q23_4 Q23_5 Q23_6 Q23_7 Q24_1 Q24_2 Q24_3 Q24_4 Q24_5
1     4     4     6     5     1     1     6     1     3     1     3     3     4
2     3     1     5     1     1     1     3     1     1     1     1     3     1
3     3     1     5     4     1     1     4     1     4     3     1     3     5
  Q24_6 Q24_7 Q25_1 Q25_2 Q25_3 Q25_4 Q25_5 Q25_6 Q25_7 Q26_1 Q26_2 Q26_3 Q26_4
1     4     1     3     3     5     4     5     5     4     1     3     6     1
2     6     1     1     1     1     1     6     1     1     1     1     1     3
3     4     4     4     4     3     1     6     4     3     1     4     5     1
  Q26_5 Q26_6 Q26_7 Q27_1 Q27_2 Q27_3 Q27_4 Q27_5 Q27_6 Q27_7
1     1     4     3     3     1     6     1     3     5     4
2     1     4     1     1     1     6     1     1     1     3
3     1     4     1     3     1     1     1     1     5     6

Question: What is the script doing?

This is the only line that matters with read_data_survey.R:

v <- read.csv('data/Emotion_Identification_N119_noheader.tsv', header=TRUE, sep = "\t")
  • note how folder and filename is specified. Adopt a good filenaming convention for your files. No spaces, and clearly label the status of the data (date or N).
  • see header
  • separator (if not specified, assumes comma-separation)
  • DANGER: file-encoding can be a problem (UTF-8 is safe)

Question: What if the data is in Excel?

If you have the data in an Excel file, you can either * convert these into CSV or TSV or * utilise readxl library and then run:

library(readxl)
v <- read_excel('my_data.xlsx')
  • read_excel has a lot of options (read specific sheets etc.)
  • DANGER: Excel files have known deficiencies (date conventions, support for a limited number of columns, version differences, and it is a proprietary format)

Question: What is the structure of the data at this point?

1 What is the size of the data?

  1. What variables we have?

  2. What type of variables we have?

  3. What do we infer from the column names?

Useful commands: dim, head, str, is.na, View.

head(v)

Data munging

Data munging

In the next step this raw data will be munged, that is, pre-processed in several ways. Pre-processing can have multiple steps, here these have broken into two:

  1. First operation carries out a long list of renaming the variables (columns in the data, rename_variables.R). This can be avoided if the data has these names already, and it is quite useful to try to embed meaningful variables names to the data collection (coding them into the experiment/survey).
source('munge/rename_variables.R')        # Renames the columns of the v

What happens in this munging?

  • Can you explain the main changes from the raw data?

Here’s an extract of rename_variables.R

#### 1. Rename variable headers ---------------------------------------------------
colnames(v)[colnames(v)=="Duration..in.seconds."]<-"Time"
colnames(v)[colnames(v)=="Q3"]<-"Age"
colnames(v)[colnames(v)=="Q4"]<-"Gender"
# .... 

# Note:
# Track rating renamed, where OG = original track and PT = participants' track
# Middle number is number of track, and emotion names at end are the different emotion rating scales
colnames(v)[colnames(v)=="Q14_1"]<-"OG_01_SADNESS"
colnames(v)[colnames(v)=="Q14_2"]<-"OG_01_CALMNESS"
colnames(v)[colnames(v)=="Q14_3"]<-"OG_01_JOY"

Recode instruments (1): Trim unnecessary variables and tidy up

source('munge/recode_instruments.R')      # Produces df (long-form) from v
[1] "Raw data: N= 119 and 108 variables"
[1] "Trimmed data: N= 91"

So plenty of things happen in the recoding_instruments.R. Let’s look inside the script.

1. Trim variables

The script will drop all columns mentioned in the select command; they are mentioned with negative (-) in from of them, means dropping.

# eliminating unnecessary columns
v <- dplyr::select(v,-StartDate,-EndDate,-Status,-IPAddress,-Progress,-RecordedDate,-ResponseId,-RecipientLastName,-RecipientFirstName,-RecipientEmail,-ExternalReference,-LocationLatitude,-LocationLongitude,-DistributionChannel,-UserLanguage,-Q1,-Q12_1, -Q12_2, -Q12_3, -Q12_4, -Q12_5, -Q12_6, -Q12_7)

2. Recode instruments: Add IDs

For convenience, add participant ID’s

v$ID <- c(1:length(v$Age)) # Status = dataframe length
v$PID <- paste("S",sprintf("%03d", v$ID),sep="")
v$PID<-factor(v$PID)
ind<-colnames(v)!='ID'
v <- v[, ind]  ## Delete ID and just retain PID
head(v)

3. Recode instruments: Turn categorical vars into labelled factors

This is done to increase clarity and this will help future analyses as well.

v$Gender <- factor(v$Gender,levels=c(1,2,3),labels = c('Male','Female','Other'))
v$MusicalExpertise <- factor(v$MusicalExpertise,levels = c(1,2,3,4,5,6),
                             labels = c("NonMusician","Music-Loving NonMusician",
                                        "Amateur","Serious Amateur Musician","Semi-Pro","Pro"))
v$MusicalExpertiseBinary<-factor(v$MusicalExpertise,
                                 levels = levels(v$MusicalExpertise),
                                 labels=c('Nonmusician','Nonmusician','Musician','Musician','Musician','Musician'))

4. Recode instruments: Eliminate incomplete responses

Here we first created a row to identify the NAs (missing values) in the dataset, afterwards we created a threshold of 95% completion rate. If participants completed more than 95% of the survey, we keep them.

v$NAS <- rowSums(is.na(v[, 11:108]))
NAS <- 100 - (v$NAS/nrow(v))*100
threshold <- 95
good_ones <- NAS >= threshold
v <- v[good_ones, ]
v <- v[,1:110] # drop NAS 
print(paste('Trimmed data: N=',nrow(v)))
head(v)

5. Recode instruments: Convert into long-format

Pull out emotions and tracks from the data (convert to long-form) and collapse across all 14 tracks.

df <- pivot_longer(v,cols = 11:108) # These are the columns with ratings
df$Track<-df$name
df$Track <- gsub("[A-Z][A-Z]_", "", df$Track) #function to substitute every "_POWER" with "" in df$variable
df$Track <- gsub("_[A-Z]+$", "", df$Track) #function to substitute every "_POWER" with "" in df$variable

df$Source <- gsub("_[0-9][0-9]_[A-Z]+$", "", df$name) # take out source (OG and PTs ie own vs participant generated)
df$Scale <- gsub("[A-Z][A-Z]_[0-9][0-9]_", "", df$name) # take out scale

df$Track<-factor(df$Track,levels = c('01','02','03','04','05','06','07'),labels = c('Sadness','Joy','Calmness','Anger','Fear','Power','Surprise'))
df$Source<-factor(df$Source,levels = c('OG','PT'),labels = c('Exp1','Exp2'))

colnames(df)[colnames(df)=='value']<-'Rating'
df$Rating <- dplyr::recode(df$Rating, `1` = 1L, `3` = 2L, '4' = 3L, '5' = 4L, '6' = 5L) #  "1" = "1", "3" = "2", "4" = "3", "5"="4", "6" = "5
df$Scale<-factor(df$Scale)
df$PreferredGenre<-factor(df$PreferredGenre)

6. Recode instruments: Drop unnecessary columns

Finally we have a clean final data frame on long format.

df <- dplyr::select(df,-Country,-Finished,-InstrumentPlayer,-Instrument,-MusicalTraining,-name,-MusicalExpertise,-PreferredGenre)
head(df)

Question: What is the structure of the data at this point?

  • Give a short explanation of what do we have now in df?

Descriptives

Checking the data: Descriptives

After the munging, it is prudent to check various aspects of the data such as the N, age, and gender …

source('scr/demographics_info.R')     # Reports N, Age and other details
[1] "N = 91"
[1] "Mean age 34.99"
[1] "SD age 15.86"
[1] "Youngest 18 years"
[1] "Oldest 71 years"

  Male Female  Other 
    23     67      1 

             NonMusician Music-Loving NonMusician                  Amateur 
                      13                       44                       15 
Serious Amateur Musician                 Semi-Pro                      Pro 
                      11                        6                        2 

Nonmusician    Musician 
         57          34 

Checking the data: Descriptives

Summaries are easily created with few commands such as mean, sd or table commands:

mean(v$Age)
[1] 34.98901
round(mean(v$Age),2)
[1] 34.99
print(table(v$Gender)) # gender distribution

  Male Female  Other 
    23     67      1 

Question: Can you describe….

  • … how many emotion scales there are in the data?
  • … how many tracks there are in the data?
  • … how many ratings per emotion scales and tracks there are in the data?

tip: table command works here well. You can also combine multiple columns into a table just by referring to multiple table(df$Source,df$Track)

Checking the data (2): Consistency

We can explore the consistency of the ratings across the people. This calculates Cronbach’s \(\alpha\)s reliability coefficient for internal consistency across participants for each concept.

source('scr/interrater_reliability.R')
[1] "Fastest response 7.17 mins"
[1] "Slowest response 8291.48 mins"
[1] "Median response 14.9 mins"


Table: Inter-reliability ratings (Cronbach alphas)

| SADNESS| CALMNESS|   JOY| ANGER| FEAR| POWER| SURPRISE|
|-------:|--------:|-----:|-----:|----:|-----:|--------:|
|   0.995|    0.994| 0.995|  0.99| 0.99| 0.962|    0.978|

Checking the data (3): Distributions

We also want to look at the distributions of the collected data in order to learn whether one needs to use certain operations (transformations or resort to non-parametric statistics) in the subsequent analyses (visualise.R). This step will also include displaying correlations between the emotion scales which is a useful operation to learn about the overlap of the concepts used in the tasks.

Checking the data (3): Distributions

source('scr/visualise.R')             # Visualise few aspects of the data

Figure 1: ?(caption)

Figure 2: ?(caption)

Figure 3: ?(caption)

Figure 4: ?(caption)

Checking the data (4): Look at the distributions manually

Let’s do some basic plotting to look at the distributions.

hist(df$Age,col='yellow')
boxplot(Age ~ Gender, data=df,col='pink')

(a) Age

(b) Age by Gender

Figure 5: Age histogram and distribution across gender.

Conclusion

Conclusion

  • We now have mastered the data carpentry and descriptives
    • reading excel or CSV data into R
    • the data was labelled badly in Qualtrics, and contained incomplete data, which we fixed
    • we converted from wide-format to long-format
    • we have an explicit coding of factors and clear variable names
    • All of these operations are saved in scripts, and can be **replicated*; any analysis starts from running these preprocessing scripts on raw data
    • We never manually touch raw data

Next

The actual inferential analysis (Linear Mixed Models) is next set of operations needed in this project.

Thanks for Annaliese Micallef-Grimaud for sharing this data. This is Experiment 2 of the study published: