Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.
Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.
In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.
Walkthrough
Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.
This solution involves the following steps:
- Install and configure RStudio with Athena
- Log in to RStudio
- Install R packages
- Connect to Athena
- Create a dataset
- Create models
Install and configure RStudio with Athena
Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.
Launching this stack creates all required resources and prerequisites:
- Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
- Provisioning of the EC2 instance in an existing VPC and public subnet
- Installation of Java 8
- Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
- Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
- S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
- RStudio password (Note: username is rstudio)
- Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
- Amazon EC2 Systems Manager agent, which makes it easy to manage and patch
All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.
Log in to RStudio
The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.
- In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
- Log in to RStudio with the and password you provided during setup.
-
Install R packages
Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.
#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 hours 42 minutes
## H2O cluster version: 3.10.4.6
## H2O cluster version age: 4 months and 4 days !!!
## H2O cluster name: H2O_started_from_R_rstudio_hjx881
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.30 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
install_aws_sdk()
}
load_sdk()
## NULL
Connect to Athena
Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir=Sys.getenv("ATHENABUCKET"),
aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
Verify the connection. The results returned depend on your specific Athena setup.
con
## <JDBCConnection>
dbListTables(con)
## [1] "gdelt" "wikistats" "elb_logs_raw_native"
## [4] "twitter" "twitter2" "usermovieratings"
## [7] "eventcodes" "events" "billboard"
## [10] "billboardtop10" "elb_logs" "gdelthist"
## [13] "gdeltmaster" "twitter" "twitter3"
Create a dataset
For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.
Field |
Description |
year |
Year that song was released |
songtitle |
Title of the song |
artistname |
Name of the song artist |
songid |
Unique identifier for the song |
artistid |
Unique identifier for the song artist |
timesignature |
Variable estimating the time signature of the song |
timesignature_confidence |
Confidence in the estimate for the timesignature |
loudness |
Continuous variable indicating the average amplitude of the audio in decibels |
tempo |
Variable indicating the estimated beats per minute of the song |
tempo_confidence |
Confidence in the estimate for tempo |
key |
Variable with twelve levels indicating the estimated key of the song (C, C#, B) |
key_confidence |
Confidence in the estimate for key |
energy |
Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness |
pitch |
Continuous variable that indicates the pitch of the song |
timbre_0_min thru timbre_11_min |
Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector |
timbre_0_max thru timbre_11_max |
Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector |
top10 |
Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not) |
Create an Athena table based on the dataset
In the Athena console, select the default database, sampled, or create a new database.
Run the following create table statement.
create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;
Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.
dbGetQuery(con, "show create table sampledb.billboard")
## createtab_stmt
## 1 CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2 `year` int,
## 3 `songtitle` string,
## 4 `artistname` string,
## 5 `songid` string,
## 6 `artistid` string,
## 7 `timesignature` int,
## 8 `timesignature_confidence` double,
## 9 `loudness` double,
## 10 `tempo` double,
## 11 `tempo_confidence` double,
## 12 `key` int,
## 13 `key_confidence` double,
## 14 `energy` double,
## 15 `pitch` double,
## 16 `timbre_0_min` double,
## 17 `timbre_0_max` double,
## 18 `timbre_1_min` double,
## 19 `timbre_1_max` double,
## 20 `timbre_2_min` double,
## 21 `timbre_2_max` double,
## 22 `timbre_3_min` double,
## 23 `timbre_3_max` double,
## 24 `timbre_4_min` double,
## 25 `timbre_4_max` double,
## 26 `timbre_5_min` double,
## 27 `timbre_5_max` double,
## 28 `timbre_6_min` double,
## 29 `timbre_6_max` double,
## 30 `timbre_7_min` double,
## 31 `timbre_7_max` double,
## 32 `timbre_8_min` double,
## 33 `timbre_8_max` double,
## 34 `timbre_9_min` double,
## 35 `timbre_9_max` double,
## 36 `timbre_10_min` double,
## 37 `timbre_10_max` double,
## 38 `timbre_11_min` double,
## 39 `timbre_11_max` double,
## 40 `top10` int)
## 41 ROW FORMAT DELIMITED
## 42 FIELDS TERMINATED BY ','
## 43 STORED AS INPUTFORMAT
## 44 'org.apache.hadoop.mapred.TextInputFormat'
## 45 OUTPUTFORMAT
## 46 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47 LOCATION
## 48 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49 TBLPROPERTIES (
## 50 'transient_lastDdlTime'='1505484133')
Run a sample query
Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.
dbGetQuery(con, " SELECT songtitle,artistname,top10 FROM sampledb.billboard WHERE lower(artistname) = 'janet jackson' AND top10 = 1")
## songtitle artistname top10
## 1 Runaway Janet Jackson 1
## 2 Because Of Love Janet Jackson 1
## 3 Again Janet Jackson 1
## 4 If Janet Jackson 1
## 5 Love Will Never Do (Without You) Janet Jackson 1
## 6 Black Cat Janet Jackson 1
## 7 Come Back To Me Janet Jackson 1
## 8 Alright Janet Jackson 1
## 9 Escapade Janet Jackson 1
## 10 Rhythm Nation Janet Jackson 1
Determine how many songs in this dataset are specifically from the year 2010.
dbGetQuery(con, " SELECT count(*) FROM sampledb.billboard WHERE year = 2010")
## _col0
## 1 373
The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.
Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.
#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note: Running the preceding query results in the following error:
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
## 0 1 3 4 5 7
## 10 143 503 6787 112 19
nrow(dftimesignature)
## [1] 7574
From the results, observe that 6787 songs have a timesignature of 4.
Next, determine the song with the highest tempo.
dbGetQuery(con, " SELECT songtitle,artistname,tempo FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
## songtitle artistname tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307
Create the training dataset
Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first. This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.
#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.
r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
## year songtitle artistname timesignature
## 1 2009 The Awkward Goodbye Athlete 3
## 2 2009 Rubik's Cube Athlete 3
## timesignature_confidence loudness tempo tempo_confidence
## 1 0.732 -6.320 89.614 0.652
## 2 0.906 -9.541 117.742 0.542
nrow(BillboardTrain)
## [1] 7201
Create the test dataset
BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
## year songtitle artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember 11
## 2 2010 Sticks & Bricks A Day to Remember 10
## key_confidence energy pitch timbre_0_min
## 1 0.453 0.9666556 0.024 0.002
## 2 0.469 0.9847095 0.025 0.000
nrow(BillboardTest)
## [1] 373
Convert the training and test datasets into H2O dataframes
train.h2o <- as.h2o(BillboardTrain)
##
|
| | 0%
|
|=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
##
|
| | 0%
|
|=================================================================| 100%
Inspect the column names in your H2O dataframes.
colnames(train.h2o)
## [1] "year" "songtitle"
## [3] "artistname" "songid"
## [5] "artistid" "timesignature"
## [7] "timesignature_confidence" "loudness"
## [9] "tempo" "tempo_confidence"
## [11] "key" "key_confidence"
## [13] "energy" "pitch"
## [15] "timbre_0_min" "timbre_0_max"
## [17] "timbre_1_min" "timbre_1_max"
## [19] "timbre_2_min" "timbre_2_max"
## [21] "timbre_3_min" "timbre_3_max"
## [23] "timbre_4_min" "timbre_4_max"
## [25] "timbre_5_min" "timbre_5_max"
## [27] "timbre_6_min" "timbre_6_max"
## [29] "timbre_7_min" "timbre_7_max"
## [31] "timbre_8_min" "timbre_8_max"
## [33] "timbre_9_min" "timbre_9_max"
## [35] "timbre_10_min" "timbre_10_max"
## [37] "timbre_11_min" "timbre_11_max"
## [39] "top10"
Create models
You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.
Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables: “year”, “songtitle”, “artistname”, “songid”, or “artistid”.
y.dep <- 39
x.indep <- c(6:38)
x.indep
## [1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38
Create Model 1: All numeric variables
Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.
modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
| | 0%
|
|===== | 8%
|
|=================================================================| 100%
Measure the performance of Model 1, using H2O’s built-in performance function.
h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
##
## MSE: 0.09924684
## RMSE: 0.3150347
## LogLoss: 0.3220267
## Mean Per-Class Error: 0.2380168
## AUC: 0.8431394
## Gini: 0.6862787
## R^2: 0.254663
## Null Deviance: 326.0801
## Residual Deviance: 240.2319
## AIC: 308.2319
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 255 59 0.187898 =59/314
## 1 17 42 0.288136 =17/59
## Totals 272 101 0.203753 =76/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.192772 0.525000 100
## 2 max f2 0.124912 0.650510 155
## 3 max f0point5 0.416258 0.612903 23
## 4 max accuracy 0.416258 0.879357 23
## 5 max precision 0.813396 1.000000 0
## 6 max recall 0.037579 1.000000 282
## 7 max specificity 0.813396 1.000000 0
## 8 max absolute_mcc 0.416258 0.455251 23
## 9 max min_per_class_accuracy 0.161402 0.738854 125
## 10 max mean_per_class_accuracy 0.124912 0.765006 155
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.auc(h2o.performance(modelh1,test.h2o))
## [1] 0.8431394
The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)
Next, inspect the coefficients of the variables in the dataset.
dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
## names coefficients sign
## 1 timbre_0_max 1.290938663 NEG
## 2 loudness 1.262941934 POS
## 3 pitch 0.616995941 NEG
## 4 timbre_1_min 0.422323735 POS
## 5 timbre_6_min 0.349016024 NEG
## 6 energy 0.348092062 NEG
## 7 timbre_11_min 0.307331997 NEG
## 8 timbre_3_max 0.302225619 NEG
## 9 timbre_11_max 0.243632060 POS
## 10 timbre_4_min 0.224233951 POS
## 11 timbre_4_max 0.204134342 POS
## 12 timbre_5_min 0.199149324 NEG
## 13 timbre_0_min 0.195147119 POS
## 14 timesignature_confidence 0.179973904 POS
## 15 tempo_confidence 0.144242598 POS
## 16 timbre_10_max 0.137644568 POS
## 17 timbre_7_min 0.126995955 NEG
## 18 timbre_10_min 0.123851179 POS
## 19 timbre_7_max 0.100031481 NEG
## 20 timbre_2_min 0.096127636 NEG
## 21 key_confidence 0.083115820 POS
## 22 timbre_6_max 0.073712419 POS
## 23 timesignature 0.067241917 POS
## 24 timbre_8_min 0.061301881 POS
## 25 timbre_8_max 0.060041698 POS
## 26 key 0.056158445 POS
## 27 timbre_3_min 0.050825116 POS
## 28 timbre_9_max 0.033733561 POS
## 29 timbre_2_max 0.030939072 POS
## 30 timbre_9_min 0.020708113 POS
## 31 timbre_1_max 0.014228818 NEG
## 32 tempo 0.008199861 POS
## 33 timbre_5_max 0.004837870 POS
## 34 NA <NA>
Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.
You can make the following observations from the results:
- The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
- The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
- The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.
These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.
cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067
This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.
You build two variations of the original model:
- Model 2, in which you keep “energy” and omit “loudness”
- Model 3, in which you keep “loudness” and omit “energy”
You compare these two models and choose the model with a better fit for this use case.
Create Model 2: Keep energy and omit loudness
colnames(train.h2o)
## [1] "year" "songtitle"
## [3] "artistname" "songid"
## [5] "artistid" "timesignature"
## [7] "timesignature_confidence" "loudness"
## [9] "tempo" "tempo_confidence"
## [11] "key" "key_confidence"
## [13] "energy" "pitch"
## [15] "timbre_0_min" "timbre_0_max"
## [17] "timbre_1_min" "timbre_1_max"
## [19] "timbre_2_min" "timbre_2_max"
## [21] "timbre_3_min" "timbre_3_max"
## [23] "timbre_4_min" "timbre_4_max"
## [25] "timbre_5_min" "timbre_5_max"
## [27] "timbre_6_min" "timbre_6_max"
## [29] "timbre_7_min" "timbre_7_max"
## [31] "timbre_8_min" "timbre_8_max"
## [33] "timbre_9_min" "timbre_9_max"
## [35] "timbre_10_min" "timbre_10_max"
## [37] "timbre_11_min" "timbre_11_max"
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
## [1] 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
| | 0%
|
|======= | 10%
|
|=================================================================| 100%
Measure the performance of Model 2.
h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
##
## MSE: 0.09922606
## RMSE: 0.3150017
## LogLoss: 0.3228213
## Mean Per-Class Error: 0.2490554
## AUC: 0.8431933
## Gini: 0.6863867
## R^2: 0.2548191
## Null Deviance: 326.0801
## Residual Deviance: 240.8247
## AIC: 306.8247
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 280 34 0.108280 =34/314
## 1 23 36 0.389831 =23/59
## Totals 303 70 0.152815 =57/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.254391 0.558140 69
## 2 max f2 0.113031 0.647208 157
## 3 max f0point5 0.413999 0.596026 22
## 4 max accuracy 0.446250 0.876676 18
## 5 max precision 0.811739 1.000000 0
## 6 max recall 0.037682 1.000000 283
## 7 max specificity 0.811739 1.000000 0
## 8 max absolute_mcc 0.254391 0.469060 69
## 9 max min_per_class_accuracy 0.141051 0.716561 131
## 10 max mean_per_class_accuracy 0.113031 0.761821 157
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
## names coefficients sign
## 1 pitch 0.700331511 NEG
## 2 timbre_1_min 0.510270513 POS
## 3 timbre_0_max 0.402059546 NEG
## 4 timbre_6_min 0.333316236 NEG
## 5 timbre_11_min 0.331647383 NEG
## 6 timbre_3_max 0.252425901 NEG
## 7 timbre_11_max 0.227500308 POS
## 8 timbre_4_max 0.210663865 POS
## 9 timbre_0_min 0.208516163 POS
## 10 timbre_5_min 0.202748055 NEG
## 11 timbre_4_min 0.197246582 POS
## 12 timbre_10_max 0.172729619 POS
## 13 tempo_confidence 0.167523934 POS
## 14 timesignature_confidence 0.167398830 POS
## 15 timbre_7_min 0.142450727 NEG
## 16 timbre_8_max 0.093377516 POS
## 17 timbre_10_min 0.090333426 POS
## 18 timesignature 0.085851625 POS
## 19 timbre_7_max 0.083948442 NEG
## 20 key_confidence 0.079657073 POS
## 21 timbre_6_max 0.076426046 POS
## 22 timbre_2_min 0.071957831 NEG
## 23 timbre_9_max 0.071393189 POS
## 24 timbre_8_min 0.070225578 POS
## 25 key 0.061394702 POS
## 26 timbre_3_min 0.048384697 POS
## 27 timbre_1_max 0.044721121 NEG
## 28 energy 0.039698433 POS
## 29 timbre_5_max 0.039469064 POS
## 30 timbre_2_max 0.018461133 POS
## 31 tempo 0.013279926 POS
## 32 timbre_9_min 0.005282143 NEG
## 33 NA <NA>
h2o.auc(h2o.performance(modelh2,test.h2o))
## [1] 0.8431933
You can make the following observations:
- The AUC metric is 0.8431933.
- Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
- As H2O orders variables by significance, the variable energy is not significant in this model.
You can conclude that Model 2 is not ideal for this use , as energy is not significant.
CreateModel 3: Keep loudness but omit energy
colnames(train.h2o)
## [1] "year" "songtitle"
## [3] "artistname" "songid"
## [5] "artistid" "timesignature"
## [7] "timesignature_confidence" "loudness"
## [9] "tempo" "tempo_confidence"
## [11] "key" "key_confidence"
## [13] "energy" "pitch"
## [15] "timbre_0_min" "timbre_0_max"
## [17] "timbre_1_min" "timbre_1_max"
## [19] "timbre_2_min" "timbre_2_max"
## [21] "timbre_3_min" "timbre_3_max"
## [23] "timbre_4_min" "timbre_4_max"
## [25] "timbre_5_min" "timbre_5_max"
## [27] "timbre_6_min" "timbre_6_max"
## [29] "timbre_7_min" "timbre_7_max"
## [31] "timbre_8_min" "timbre_8_max"
## [33] "timbre_9_min" "timbre_9_max"
## [35] "timbre_10_min" "timbre_10_max"
## [37] "timbre_11_min" "timbre_11_max"
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
## [1] 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
| | 0%
|
|======== | 12%
|
|=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
##
## MSE: 0.0978859
## RMSE: 0.3128672
## LogLoss: 0.3178367
## Mean Per-Class Error: 0.264925
## AUC: 0.8492389
## Gini: 0.6984778
## R^2: 0.2648836
## Null Deviance: 326.0801
## Residual Deviance: 237.1062
## AIC: 303.1062
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 286 28 0.089172 =28/314
## 1 26 33 0.440678 =26/59
## Totals 312 61 0.144772 =54/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.273799 0.550000 60
## 2 max f2 0.125503 0.663265 155
## 3 max f0point5 0.435479 0.628931 24
## 4 max accuracy 0.435479 0.882038 24
## 5 max precision 0.821606 1.000000 0
## 6 max recall 0.038328 1.000000 280
## 7 max specificity 0.821606 1.000000 0
## 8 max absolute_mcc 0.435479 0.471426 24
## 9 max min_per_class_accuracy 0.173693 0.745763 120
## 10 max mean_per_class_accuracy 0.125503 0.775073 155
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
## names coefficients sign
## 1 timbre_0_max 1.216621e+00 NEG
## 2 loudness 9.780973e-01 POS
## 3 pitch 7.249788e-01 NEG
## 4 timbre_1_min 3.891197e-01 POS
## 5 timbre_6_min 3.689193e-01 NEG
## 6 timbre_11_min 3.086673e-01 NEG
## 7 timbre_3_max 3.025593e-01 NEG
## 8 timbre_11_max 2.459081e-01 POS
## 9 timbre_4_min 2.379749e-01 POS
## 10 timbre_4_max 2.157627e-01 POS
## 11 timbre_0_min 1.859531e-01 POS
## 12 timbre_5_min 1.846128e-01 NEG
## 13 timesignature_confidence 1.729658e-01 POS
## 14 timbre_7_min 1.431871e-01 NEG
## 15 timbre_10_max 1.366703e-01 POS
## 16 timbre_10_min 1.215954e-01 POS
## 17 tempo_confidence 1.183698e-01 POS
## 18 timbre_2_min 1.019149e-01 NEG
## 19 key_confidence 9.109701e-02 POS
## 20 timbre_7_max 8.987908e-02 NEG
## 21 timbre_6_max 6.935132e-02 POS
## 22 timbre_8_max 6.878241e-02 POS
## 23 timesignature 6.120105e-02 POS
## 24 key 5.814805e-02 POS
## 25 timbre_8_min 5.759228e-02 POS
## 26 timbre_1_max 2.930285e-02 NEG
## 27 timbre_9_max 2.843755e-02 POS
## 28 timbre_3_min 2.380245e-02 POS
## 29 timbre_2_max 1.917035e-02 POS
## 30 timbre_5_max 1.715813e-02 POS
## 31 tempo 1.364418e-02 NEG
## 32 timbre_9_min 8.463143e-05 NEG
## 33 NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389
You can make the following observations:
- The AUC metric is 0.8492389.
- From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
- Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
- Loudness is significant in this model.
Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:
- Desired model accuracy at a given threshold
- Number of correct predictions for top10 hits
- Tolerable number of false positives or false negatives
Next, make predictions using Model 3 on the test dataset.
predict.regh <- h2o.predict(modelh3, test.h2o)
##
|
| | 0%
|
|=================================================================| 100%
print(predict.regh)
## predict p0 p1
## 1 0 0.9654739 0.034526052
## 2 0 0.9654748 0.034525236
## 3 0 0.9635547 0.036445318
## 4 0 0.9343579 0.065642149
## 5 0 0.9978334 0.002166601
## 6 0 0.9779949 0.022005078
##
## [373 rows x 3 columns]
predict.regh$predict
## predict
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
##
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
##
## 0 1
## 312 61
The first set of output results specifies the probabilities associated with each predicted observation. For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit). The second set of results list the actual predictions made. From the third set of results, this model predicts that 61 songs will be top 10 hits.
Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.
table(BillboardTest$top10)
##
## 0 1
## 314 59
Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.
It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?
View the two models from an investment perspective:
- A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
- How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
- It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
- The baseline model is not useful, as it simply does not label any song as a hit.
Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.
GBM model
H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.
Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.
train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
##
|
| | 0%
|
|=== | 5%
|
|===== | 7%
|
|====== | 9%
|
|======= | 10%
|
|====================== | 33%
|
|===================================== | 56%
|
|==================================================== | 79%
|
|================================================================ | 98%
|
|=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
##
## MSE: 0.09860778
## RMSE: 0.3140188
## LogLoss: 0.3206876
## Mean Per-Class Error: 0.2120263
## AUC: 0.8630573
## Gini: 0.7261146
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 266 48 0.152866 =48/314
## 1 16 43 0.271186 =16/59
## Totals 282 91 0.171582 =64/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.189757 0.573333 90
## 2 max f2 0.130895 0.693717 145
## 3 max f0point5 0.327346 0.598802 26
## 4 max accuracy 0.442757 0.876676 14
## 5 max precision 0.802184 1.000000 0
## 6 max recall 0.049990 1.000000 284
## 7 max specificity 0.802184 1.000000 0
## 8 max absolute_mcc 0.169135 0.496486 104
## 9 max min_per_class_accuracy 0.169135 0.796610 104
## 10 max mean_per_class_accuracy 0.169135 0.805948 104
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573
This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.
As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.
H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.
Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose. For models that require more computation, consider using accelerated computing instances such as the P2 instance type.
system.time(
dlearning.modelh <- h2o.deeplearning(y = y.dep,
x = x.indep,
training_frame = train.h2o,
epoch = 250,
hidden = c(250,250),
activation = "Rectifier",
seed = 1122,
distribution="multinomial"
)
)
##
|
| | 0%
|
|=== | 4%
|
|===== | 8%
|
|======== | 12%
|
|========== | 16%
|
|============= | 20%
|
|================ | 24%
|
|================== | 28%
|
|===================== | 32%
|
|======================= | 36%
|
|========================== | 40%
|
|============================= | 44%
|
|=============================== | 48%
|
|================================== | 52%
|
|==================================== | 56%
|
|======================================= | 60%
|
|========================================== | 64%
|
|============================================ | 68%
|
|=============================================== | 72%
|
|================================================= | 76%
|
|==================================================== | 80%
|
|======================================================= | 84%
|
|========================================================= | 88%
|
|============================================================ | 92%
|
|============================================================== | 96%
|
|=================================================================| 100%
## user system elapsed
## 1.216 0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
##
## MSE: 0.1678359
## RMSE: 0.4096778
## LogLoss: 1.86509
## Mean Per-Class Error: 0.3433013
## AUC: 0.7568822
## Gini: 0.5137644
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 290 24 0.076433 =24/314
## 1 36 23 0.610169 =36/59
## Totals 326 47 0.160858 =60/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.826267 0.433962 46
## 2 max f2 0.000000 0.588235 239
## 3 max f0point5 0.999929 0.511811 16
## 4 max accuracy 0.999999 0.865952 10
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.000000 1.000000 326
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.999929 0.363219 16
## 9 max min_per_class_accuracy 0.000004 0.662420 145
## 10 max mean_per_class_accuracy 0.000000 0.685334 224
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822
The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.
H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.
Metric |
Description |
Sensitivity |
Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall. |
Specificity |
Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate. |
Threshold |
Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives. |
Precision |
The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant |
AUC |
Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class. 0.90 – 1 = excellent (A) 0.8 – 0.9 = good (B) 0.7 – 0.8 = fair (C) .6 – 0.7 = poor (D) 0.5 – 0.5 = fail (F) |
Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.
Metric |
Model 3 |
GBM Model |
Deep Learning Model |
Accuracy (max) |
0.882038 (t=0.435479) |
0.876676 (t=0.442757) |
0.865952 (t=0.999999) |
Precision (max) |
1.0 (t=0.821606) |
1.0 (t=0802184) |
1.0 (t=1.0) |
Recall (max) |
1.0 |
1.0 |
1.0 (t=0) |
Specificity (max) |
1.0 |
1.0 |
1.0 (t=1) |
Sensitivity |
0.2033898 |
0.1355932 |
0.3898305 (t=0.5) |
AUC |
0.8492389 |
0.8630573 |
0.756882 |
Note: ‘t’ denotes threshold.
Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier. If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits. The AUC metric for the GBM model is also higher than that of Model 3.
Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.
Conclusion
In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.
If you have questions or suggestions, please comment below.
Additional Reading
Learn how to build and explore a simple geospita simple GEOINT application using SparkR.
About the Authors
Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.
Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.