MSL meeting
February 28, 2023
This process assumes that you have
RStudio installed in your computer (it’s free)
Basic understanding of RStudio workspace, path, and environment (Chapter 6.5 in Scientific Musicology)
the materials downloaded from https://github.com/tuomaseerola/R_template
Tip 1: use the green “Code” button to “Download zip” file
Tip 2: extract this into a convenient location on your computer and open contents.R
with RStudio
White slides refer to the contents.R
Light blue slides refer to questions to you
These slides are available at: https://tuomaseerola.github.io/R_template/report_presentation.html
We take a raw dataset collected via Qualtrics by Annaliese Micalle-Grimaud that set out to validate her EmoteControl
system. She collected ratings of multiple expressed emotions for various musical excerpts that she has created and used in a expression production paradigm. This data is Experiment 2 of the study (Micallef Grimaud & Eerola, 2022).
## INITIALISE: SET PATH, CLEAR MEMORY & LOAD LIBRARIES
rm(list=ls(all=TRUE)) # Cleans the R memory, just in case
source('scr/load_libraries.R') # Loads the necessary R libraries
The following libraries were loaded: ggplot2 psych dplyr reshape2 stringr tidyr stringr lme4 lmerTest emmeans knitr pbkrtest
Why R utilises separate packages (libraries in your system)?
There are 18,419 of these in CRAN
Popular ones are
ggplot2
, A grammar of graphics in Rdplyr
, a grammar of data manipulationtidyr
, a collection of package development toolsforeign
, read data stored by Minitab, S, SAS, SPSSTo check what you might already have in your R, type:
If you don’t have a library installed, just type
After installation, you can load the library for your project with
All these operations are sometimes called data carpentry.
N x Variables:119 131
StartDate EndDate Status IPAddress Progress
1 2019-05-10 10:44:03 2019-05-10 10:54:31 0 151.49.102.8 100
2 2019-05-10 10:42:42 2019-05-10 11:27:42 0 129.234.21.185 100
3 2019-05-10 11:48:53 2019-05-10 12:04:15 0 86.142.18.72 100
Duration..in.seconds. Finished RecordedDate ResponseId
1 628 1 2019-05-10 10:54:31 R_2zBKHnkPpp1VJNQ
2 2700 1 2019-05-10 11:27:44 R_1NERfgwV6F9lmUM
3 922 1 2019-05-10 12:04:16 R_3Get29azuhuBYBo
RecipientLastName RecipientFirstName RecipientEmail ExternalReference
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
LocationLatitude LocationLongitude DistributionChannel UserLanguage Q1 Q3 Q4
1 45.43710 12.332703 anonymous EN-GB 1 26 2
2 54.76840 -1.563095 anonymous EN-GB 1 22 1
3 54.00011 -1.535004 anonymous EN-GB 1 55 2
Q5 Q6 Q7 Q8 Q9 Q10 Q12_1 Q12_2 Q12_3 Q12_4
1 107 4 Classical 35 Piano 13 16 15 15 15
2 185 1 Indie Pop 36 19 16 15 15
3 185 6 Classical 35 Piano, flute, voice, violin 50 19 17 15 15
Q12_5 Q12_6 Q12_7 Q14_1 Q14_2 Q14_3 Q14_4 Q14_5 Q14_6 Q14_7 Q15_1 Q15_2 Q15_3
1 15 15 15 6 4 1 1 3 1 1 1 3 6
2 15 15 15 6 1 1 1 3 1 1 1 5 6
3 15 15 15 5 4 1 1 1 1 3 1 3 6
Q15_4 Q15_5 Q15_6 Q15_7 Q16_1 Q16_2 Q16_3 Q16_4 Q16_5 Q16_6 Q16_7 Q17_1 Q17_2
1 1 3 3 4 1 6 4 1 1 3 1 3 1
2 1 1 1 1 3 4 1 1 1 1 1 1 1
3 1 1 5 3 3 6 3 1 1 1 1 1 1
Q17_3 Q17_4 Q17_5 Q17_6 Q17_7 Q18_1 Q18_2 Q18_3 Q18_4 Q18_5 Q18_6 Q18_7 Q19_1
1 1 5 4 5 3 1 1 4 3 5 6 3 3
2 1 6 3 3 1 1 1 1 1 3 5 1 1
3 1 4 3 6 3 3 1 1 6 4 5 4 1
Q19_2 Q19_3 Q19_4 Q19_5 Q19_6 Q19_7 Q20_1 Q20_2 Q20_3 Q20_4 Q20_5 Q20_6 Q20_7
1 1 1 4 4 4 1 4 3 4 3 3 4 4
2 1 3 1 1 4 3 4 5 1 1 1 1 4
3 1 3 4 1 5 1 3 3 3 4 3 6 5
Q21_1 Q21_2 Q21_3 Q21_4 Q21_5 Q21_6 Q21_7 Q22_1 Q22_2 Q22_3 Q22_4 Q22_5 Q22_6
1 5 3 1 1 3 5 1 3 3 5 1 1 4
2 1 3 3 1 1 3 1 1 3 6 1 1 1
3 6 5 1 1 1 4 1 1 3 5 3 1 5
Q22_7 Q23_1 Q23_2 Q23_3 Q23_4 Q23_5 Q23_6 Q23_7 Q24_1 Q24_2 Q24_3 Q24_4 Q24_5
1 4 4 6 5 1 1 6 1 3 1 3 3 4
2 3 1 5 1 1 1 3 1 1 1 1 3 1
3 3 1 5 4 1 1 4 1 4 3 1 3 5
Q24_6 Q24_7 Q25_1 Q25_2 Q25_3 Q25_4 Q25_5 Q25_6 Q25_7 Q26_1 Q26_2 Q26_3 Q26_4
1 4 1 3 3 5 4 5 5 4 1 3 6 1
2 6 1 1 1 1 1 6 1 1 1 1 1 3
3 4 4 4 4 3 1 6 4 3 1 4 5 1
Q26_5 Q26_6 Q26_7 Q27_1 Q27_2 Q27_3 Q27_4 Q27_5 Q27_6 Q27_7
1 1 4 3 3 1 6 1 3 5 4
2 1 4 1 1 1 6 1 1 1 3
3 1 4 1 3 1 1 1 1 5 6
This is the only line that matters with read_data_survey.R
:
folder
and filename
is specified. Adopt a good filenaming convention for your files. No spaces, and clearly label the status of the data (date or N).header
If you have the data in an Excel file, you can either * convert these into CSV or TSV or * utilise readxl
library and then run:
read_excel
has a lot of options (read specific sheets etc.)1 What is the size of the data?
What variables we have?
What type of variables we have?
What do we infer from the column names?
Useful commands: dim
, head
, str
, is.na
, View
.
In the next step this raw data will be munged, that is, pre-processed in several ways. Pre-processing can have multiple steps, here these have broken into two:
rename_variables.R
). This can be avoided if the data has these names already, and it is quite useful to try to embed meaningful variables names to the data collection (coding them into the experiment/survey).Here’s an extract of rename_variables.R
#### 1. Rename variable headers ---------------------------------------------------
colnames(v)[colnames(v)=="Duration..in.seconds."]<-"Time"
colnames(v)[colnames(v)=="Q3"]<-"Age"
colnames(v)[colnames(v)=="Q4"]<-"Gender"
# ....
# Note:
# Track rating renamed, where OG = original track and PT = participants' track
# Middle number is number of track, and emotion names at end are the different emotion rating scales
colnames(v)[colnames(v)=="Q14_1"]<-"OG_01_SADNESS"
colnames(v)[colnames(v)=="Q14_2"]<-"OG_01_CALMNESS"
colnames(v)[colnames(v)=="Q14_3"]<-"OG_01_JOY"
[1] "Raw data: N= 119 and 108 variables"
[1] "Trimmed data: N= 91"
So plenty of things happen in the recoding_instruments.R
. Let’s look inside the script.
The script will drop all columns mentioned in the select
command; they are mentioned with negative (-) in from of them, means dropping.
# eliminating unnecessary columns
v <- dplyr::select(v,-StartDate,-EndDate,-Status,-IPAddress,-Progress,-RecordedDate,-ResponseId,-RecipientLastName,-RecipientFirstName,-RecipientEmail,-ExternalReference,-LocationLatitude,-LocationLongitude,-DistributionChannel,-UserLanguage,-Q1,-Q12_1, -Q12_2, -Q12_3, -Q12_4, -Q12_5, -Q12_6, -Q12_7)
For convenience, add participant ID’s
This is done to increase clarity and this will help future analyses as well.
v$Gender <- factor(v$Gender,levels=c(1,2,3),labels = c('Male','Female','Other'))
v$MusicalExpertise <- factor(v$MusicalExpertise,levels = c(1,2,3,4,5,6),
labels = c("NonMusician","Music-Loving NonMusician",
"Amateur","Serious Amateur Musician","Semi-Pro","Pro"))
v$MusicalExpertiseBinary<-factor(v$MusicalExpertise,
levels = levels(v$MusicalExpertise),
labels=c('Nonmusician','Nonmusician','Musician','Musician','Musician','Musician'))
Here we first created a row to identify the NAs (missing values) in the dataset, afterwards we created a threshold of 95% completion rate. If participants completed more than 95% of the survey, we keep them.
Pull out emotions and tracks from the data (convert to long-form) and collapse across all 14 tracks.
df <- pivot_longer(v,cols = 11:108) # These are the columns with ratings
df$Track<-df$name
df$Track <- gsub("[A-Z][A-Z]_", "", df$Track) #function to substitute every "_POWER" with "" in df$variable
df$Track <- gsub("_[A-Z]+$", "", df$Track) #function to substitute every "_POWER" with "" in df$variable
df$Source <- gsub("_[0-9][0-9]_[A-Z]+$", "", df$name) # take out source (OG and PTs ie own vs participant generated)
df$Scale <- gsub("[A-Z][A-Z]_[0-9][0-9]_", "", df$name) # take out scale
df$Track<-factor(df$Track,levels = c('01','02','03','04','05','06','07'),labels = c('Sadness','Joy','Calmness','Anger','Fear','Power','Surprise'))
df$Source<-factor(df$Source,levels = c('OG','PT'),labels = c('Exp1','Exp2'))
colnames(df)[colnames(df)=='value']<-'Rating'
df$Rating <- dplyr::recode(df$Rating, `1` = 1L, `3` = 2L, '4' = 3L, '5' = 4L, '6' = 5L) # "1" = "1", "3" = "2", "4" = "3", "5"="4", "6" = "5
df$Scale<-factor(df$Scale)
df$PreferredGenre<-factor(df$PreferredGenre)
Finally we have a clean final data frame on long format.
df
?After the munging, it is prudent to check various aspects of the data such as the N, age, and gender …
[1] "N = 91"
[1] "Mean age 34.99"
[1] "SD age 15.86"
[1] "Youngest 18 years"
[1] "Oldest 71 years"
Male Female Other
23 67 1
NonMusician Music-Loving NonMusician Amateur
13 44 15
Serious Amateur Musician Semi-Pro Pro
11 6 2
Nonmusician Musician
57 34
Summaries are easily created with few commands such as mean
, sd
or table
commands:
tip: table command works here well. You can also combine multiple columns into a table just by referring to multiple table(df$Source,df$Track)
We can explore the consistency of the ratings across the people. This calculates Cronbach’s \(\alpha\)s reliability coefficient for internal consistency across participants for each concept.
[1] "Fastest response 7.17 mins"
[1] "Slowest response 8291.48 mins"
[1] "Median response 14.9 mins"
Table: Inter-reliability ratings (Cronbach alphas)
| SADNESS| CALMNESS| JOY| ANGER| FEAR| POWER| SURPRISE|
|-------:|--------:|-----:|-----:|----:|-----:|--------:|
| 0.995| 0.994| 0.995| 0.99| 0.99| 0.962| 0.978|
We also want to look at the distributions of the collected data in order to learn whether one needs to use certain operations (transformations or resort to non-parametric statistics) in the subsequent analyses (visualise.R
). This step will also include displaying correlations between the emotion scales which is a useful operation to learn about the overlap of the concepts used in the tasks.
Let’s do some basic plotting to look at the distributions.
The actual inferential analysis (Linear Mixed Models) is next set of operations needed in this project.
Thanks for Annaliese Micallef-Grimaud for sharing this data. This is Experiment 2 of the study published:
MSL