We take a raw dataset collected via Qualtrics by Annaliese Micalle-Grimaud that set out to validate her EmoteControl system. She collected ratings of multiple expressed emotions for various musical excerpts that she has created and used in a expression production paradigm. This data is Experiment 2 of the study (Micallef Grimaud & Eerola, 2022).
## INITIALISE: SET PATH, CLEAR MEMORY & LOAD LIBRARIESrm(list=ls(all=TRUE)) # Cleans the R memory, just in casesource('scr/load_libraries.R') # Loads the necessary R libraries
The following libraries were loaded: ggplot2 psych dplyr reshape2 stringr tidyr stringr lme4 lmerTest emmeans knitr pbkrtest
Question: What are packages? (1)
Why R utilises separate packages (libraries in your system)?
This is the only line that matters with read_data_survey.R:
v <-read.csv('data/Emotion_Identification_N119_noheader.tsv', header=TRUE, sep ="\t")
note how folder and filename is specified. Adopt a good filenaming convention for your files. No spaces, and clearly label the status of the data (date or N).
see header
separator (if not specified, assumes comma-separation)
DANGER: file-encoding can be a problem (UTF-8 is safe)
Question: What if the data is in Excel?
If you have the data in an Excel file, you can either * convert these into CSV or TSV or * utilise readxl library (this comes with tidyverse and then run:
library(readxl)v <-read_excel('my_data.xlsx')
read_excel has a lot of options (read specific sheets etc.)
DANGER: Excel files have known deficiencies (date conventions, support for a limited number of columns, version differences, and it is a proprietary format)
Question: What is the structure of the data at this point?
1 What is the size of the data?
What variables we have?
What type of variables we have?
What do we infer from the column names?
Useful commands: dim, head, str, is.na, View.
head(v)
Data munging
Data munging
In the next step this raw data will be munged, that is, pre-processed in several ways. Pre-processing can have multiple steps, here these have broken into two:
First operation carries out a long list of renaming the variables (columns in the data, rename_variables.R). This can be avoided if the data has these names already, and it is quite useful to try to embed meaningful variables names to the data collection (coding them into the experiment/survey).
source('munge/rename_variables.R') # Renames the columns of the v
What happens in this munging?
Can you explain the main changes from the raw data?
Here’s an extract of rename_variables.R
#### 1. Rename variable headers ---------------------------------------------------colnames(v)[colnames(v)=="Duration..in.seconds."]<-"Time"colnames(v)[colnames(v)=="Q3"]<-"Age"colnames(v)[colnames(v)=="Q4"]<-"Gender"# .... # Note:# Track rating renamed, where OG = original track and PT = participants' track# Middle number is number of track, and emotion names at end are the different emotion rating scalescolnames(v)[colnames(v)=="Q14_1"]<-"OG_01_SADNESS"colnames(v)[colnames(v)=="Q14_2"]<-"OG_01_CALMNESS"colnames(v)[colnames(v)=="Q14_3"]<-"OG_01_JOY"
Recode instruments (1): Trim unnecessary variables and tidy up
source('munge/recode_instruments.R') # Produces df (long-form) from v
v$ID <-c(1:length(v$Age)) # Status = dataframe lengthv$PID <-paste("S",sprintf("%03d", v$ID),sep="")v$PID<-factor(v$PID)ind<-colnames(v)!='ID'v <- v[, ind] ## Delete ID and just retain PIDhead(v)
3. Recode instruments: Turn categorical vars into labelled factors
This is done to increase clarity and this will help future analyses as well.
Here we first created a row to identify the NAs (missing values) in the dataset, afterwards we created a threshold of 95% completion rate. If participants completed more than 95% of the survey, we keep them.
v$NAS <-rowSums(is.na(v[, 11:108]))NAS <-100- (v$NAS/nrow(v))*100threshold <-95good_ones <- NAS >= thresholdv <- v[good_ones, ]v <- v[,1:110] # drop NAS print(paste('Trimmed data: N=',nrow(v)))head(v)
5. Recode instruments: Convert into long-format
Pull out emotions and tracks from the data (convert to long-form) and collapse across all 14 tracks.
df <-pivot_longer(v,cols =11:108) # These are the columns with ratingsdf$Track<-df$namedf$Track <-gsub("[A-Z][A-Z]_", "", df$Track) #function to substitute every "_POWER" with "" in df$variabledf$Track <-gsub("_[A-Z]+$", "", df$Track) #function to substitute every "_POWER" with "" in df$variabledf$Source <-gsub("_[0-9][0-9]_[A-Z]+$", "", df$name) # take out source (OG and PTs ie own vs participant generated)df$Scale <-gsub("[A-Z][A-Z]_[0-9][0-9]_", "", df$name) # take out scaledf$Track<-factor(df$Track,levels =c('01','02','03','04','05','06','07'),labels =c('Sadness','Joy','Calmness','Anger','Fear','Power','Surprise'))df$Source<-factor(df$Source,levels =c('OG','PT'),labels =c('Exp1','Exp2'))colnames(df)[colnames(df)=='value']<-'Rating'df$Rating <- dplyr::recode(df$Rating, `1`=1L, `3`=2L, '4'=3L, '5'=4L, '6'=5L) # "1" = "1", "3" = "2", "4" = "3", "5"="4", "6" = "5df$Scale<-factor(df$Scale)df$PreferredGenre<-factor(df$PreferredGenre)
6. Recode instruments: Drop unnecessary columns
Finally we have a clean final data frame on long format.
Question: What is the structure of the data at this point?
Give a short explanation of what do we have now in df?
Descriptives
Checking the data: Descriptives
After the munging, it is prudent to check various aspects of the data such as the N, age, and gender …
source('scr/demographics_info.R') # Reports N, Age and other details
[1] "N = 91"
[1] "Mean age 34.99"
[1] "SD age 15.86"
[1] "Youngest 18 years"
[1] "Oldest 71 years"
Male Female Other
23 67 1
NonMusician Music-Loving NonMusician Amateur
13 44 15
Serious Amateur Musician Semi-Pro Pro
11 6 2
Nonmusician Musician
57 34
Checking the data: Descriptives
Summaries are easily created with few commands such as mean, sd or table commands:
mean(v$Age)
[1] 34.98901
round(mean(v$Age),2)
[1] 34.99
print(table(v$Gender)) # gender distribution
Male Female Other
23 67 1
Question: Can you describe….
… how many emotion scales there are in the data?
… how many tracks there are in the data?
… how many ratings per emotion scales and tracks there are in the data?
tip: table command works here well. You can also combine multiple columns into a table just by referring to multiple table(df$Source,df$Track)
Checking the data (2): Consistency
We can explore the consistency of the ratings across the people. This calculates Cronbach’s \(\alpha\)s reliability coefficient for internal consistency across participants for each concept.
We also want to look at the distributions of the collected data in order to learn whether one needs to use certain operations (transformations or resort to non-parametric statistics) in the subsequent analyses (visualise.R). This step will also include displaying correlations between the emotion scales which is a useful operation to learn about the overlap of the concepts used in the tasks.
Checking the data (3): Distributions
source('scr/visualise.R') # Visualise few aspects of the data
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Checking the data (4): Look at the distributions manually
Let’s do some basic plotting to look at the distributions.
Figure 6: Age histogram and distribution across gender.
Conclusion
Conclusion
We now have mastered the data carpentry and descriptives
reading excel or CSV data into R
the data was labelled badly in Qualtrics, and contained incomplete data, which we fixed
we converted from wide-format to long-format
explicit coding of factors and clear variable names
All of these operations are saved in scripts, and can be **replicated*
Alternative strategies
This analysis relied on raw R scripts
You can utilise Rmarkdown to combine text and R code
outputs either html, pdf, or Word document
great tool for reports and even for full papers
see example “report_customised.Rmd” in “R_template”
You can also use quarto, new language from Posix:
More output formats (these slides!)
Use both Python and R and more advanced properties
see “R_template_in_action.qmd” (this document)
Next
The actual inferential analysis (Linear Mixed Models) is next set of operations needed in this project.
Thanks for Annaliese Micallef-Grimaud for sharing this data. This is Experiment 2 of the study published:
Grimaud, A. M. & Eerola, T. (2022). An Interactive Approach to Emotional Expression through Musical Cues. Music & Science, 5, 1-23. https://doi.org/10.1177/20592043211061745