This tutorial teaches you how to create a
matrix factorization model
and train it on the Google Analytics 360 user session data in the public
GA360_test.ga_sessions_sample
table. You then use the matrix factorization model to generate content recommendations
for site users.
Using indirect customer preference information, like user session duration, to train the model is called training with implicit feedback. Matrix factorization models are trained using the Weighted-Alternating Least Squares algorithm when you use implicit feedback as training data.
Create a dataset
Create a BigQuery dataset to store your ML model.
Console
In the Cloud de Confiance console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create datasetOn the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
Leave the remaining default settings as they are, and click Create dataset.
bq
To create a new dataset, use the
bq mk
command
with the --location
flag. For a full list of possible parameters, see the
bq mk --dataset
command
reference.
Create a dataset named
bqml_tutorial
with the data location set toUS
and a description ofBigQuery ML tutorial dataset
:bq --location=US mk -d \ --description "BigQuery ML tutorial dataset." \ bqml_tutorial
Instead of using the
--dataset
flag, the command uses the-d
shortcut. If you omit-d
and--dataset
, the command defaults to creating a dataset.Confirm that the dataset was created:
bq ls
API
Call the datasets.insert
method with a defined dataset resource.
{ "datasetReference": { "datasetId": "bqml_tutorial" } }
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up ADC for a local development environment.
Prepare the sample data
Transform the data from the GA360_test.ga_sessions_sample
table into a better
structure for model training, and then write this data to a
BigQuery table. The following query calculates the session
duration for each user for each piece of content, which you can then use as
implicit feedback to infer the user's preference for that content.
Follow these steps to create the training data table:
In the Cloud de Confiance console, go to the BigQuery page.
Create the training data table. In the query editor, paste in the following query and click Run:
CREATE OR REPLACE TABLE `bqml_tutorial.analytics_session_data` AS WITH visitor_page_content AS ( SELECT fullVisitorID, ( SELECT MAX( IF( index = 10, value, NULL)) FROM UNNEST(hits.customDimensions) ) AS latestContentId, (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) - hits.time) AS session_duration FROM `cloud-training-demos.GA360_test.ga_sessions_sample`, UNNEST(hits) AS hits WHERE # only include hits on pages hits.type = 'PAGE' GROUP BY fullVisitorId, latestContentId, hits.time ) # aggregate web stats SELECT fullVisitorID AS visitorId, latestContentId AS contentId, SUM(session_duration) AS session_duration FROM visitor_page_content WHERE latestContentId IS NOT NULL GROUP BY fullVisitorID, latestContentId HAVING session_duration > 0 ORDER BY latestContentId;
View a subset of the training data. In the query editor, paste in the following query and click Run:
SELECT * FROM `bqml_tutorial.analytics_session_data` LIMIT 5;
The results should look similar to the following:
+---------------------+-----------+------------------+ | visitorId | contentId | session_duration | +---------------------+-----------+------------------+ | 7337153711992174438 | 100074831 | 44652 | +---------------------+-----------+------------------+ | 5190801220865459604 | 100170790 | 121420 | +---------------------+-----------+------------------+ | 2293633612703952721 | 100510126 | 47744 | +---------------------+-----------+------------------+ | 5874973374932455844 | 100510126 | 32109 | +---------------------+-----------+------------------+ | 1173698801255170595 | 100676857 | 10512 | +---------------------+-----------+------------------+
Create the model
Create a matrix factorization model and train it on the data in the
analytics_session_data
table. The model is trained to predict a confidence
rating for every visitorId
-contentId
pair. The confidence rating is created
with centering and scaling by the median session duration. Records where the
session duration is more than 3.33 times the median are filtered out
as outliers.
The following CREATE MODEL
statement uses these columns to generate
recommendations:
visitorId
—The visitor ID.contentId
—The content ID.rating
—The implicit rating from 0 to 1 calculated for each visitor-content pair, centered and scaled.
In the Cloud de Confiance console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE OR REPLACE MODEL `bqml_tutorial.mf_implicit` OPTIONS ( MODEL_TYPE = 'matrix_factorization', FEEDBACK_TYPE = 'implicit', USER_COL = 'visitorId', ITEM_COL = 'contentId', RATING_COL = 'rating', L2_REG = 30, NUM_FACTORS = 15) AS SELECT visitorId, contentId, 0.3 * (1 + (session_duration - 57937) / 57937) AS rating FROM `bqml_tutorial.analytics_session_data` WHERE 0.3 * (1 + (session_duration - 57937) / 57937) < 1;
The query takes about 10 minutes to complete, after which the
mf_implicit
model appears in the Explorer pane. Because the query uses aCREATE MODEL
statement to create a model, you don't see query results.
Get training statistics
Optionally, you can view the model's training statistics in the Cloud de Confiance console.
A machine learning algorithm builds a model by creating many iterations of the model using different parameters, and then selecting the version of the model that minimizes loss. This process is called empirical risk minimization. The model's training statistics let you see the loss associated with each iteration of the model.
Follow these steps to view the model's training statistics:
In the Cloud de Confiance console, go to the BigQuery page.
In the left pane, click
Explorer:If you don't see the left pane, click
Expand left pane to open the pane.In the Explorer pane, expand your project and click Datasets.
Click the
bqml_tutorial
dataset. You can also use the search feature or filters to find the dataset.Click the Models tab.
Click the
mf_implicit
model and then click the Training tabIn the View as section, click Table. The results should look similar to the following:
+-----------+--------------------+--------------------+ | Iteration | Training Data Loss | Duration (seconds) | +-----------+--------------------+--------------------+ | 5 | 0.0027 | 47.27 | +-----------+--------------------+--------------------+ | 4 | 0.0028 | 39.60 | +-----------+--------------------+--------------------+ | 3 | 0.0032 | 55.57 | +-----------+--------------------+--------------------+ | ... | ... | ... | +-----------+--------------------+--------------------+
The Training Data Loss column represents the loss metric calculated after the model is trained. Because this is a matrix factorization model, this column shows the mean squared error.
Evaluate the model
Evaluate the performance of the model by using the ML.EVALUATE
function.
The ML.EVALUATE
function evaluates the predicted content ratings returned by
the model against the evaluation metrics calculated during training.
Follow these steps to evaluate the model:
In the Cloud de Confiance console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.mf_implicit`);
The results should look similar to the following:
+------------------------+-----------------------+---------------------------------------+---------------------+ | mean_average_precision | mean_squared_error | normalized_discounted_cumulative_gain | average_rank | +------------------------+-----------------------+---------------------------------------+---------------------+ | 0.4434341257478137 | 0.0013381759837648962 | 0.9433280547112802 | 0.24031636088594222 | +------------------------+-----------------------+---------------------------------------+---------------------+
For more information about the
ML.EVALUATE
function output, see Output.
Get the predicted ratings for a subset of visitor-content pairs
Use the ML.RECOMMEND
to get the predicted rating for each piece of content
for five site visitors.
Follow these steps to get predicted ratings:
In the Cloud de Confiance console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.RECOMMEND( MODEL `bqml_tutorial.mf_implicit`, ( SELECT visitorId FROM `bqml_tutorial.analytics_session_data` LIMIT 5 ));
The results should look similar to the following:
+-------------------------------+---------------------+-----------+ | predicted_rating_confidence | visitorId | contentId | +-------------------------------+---------------------+-----------+ | 0.0033608418060270262 | 7337153711992174438 | 277237933 | +-------------------------------+---------------------+-----------+ | 0.003602395397293956 | 7337153711992174438 | 158246147 | +-------------------------------+---------------------+-- -------+ | 0.0053197670652785356 | 7337153711992174438 | 299389988 | +-------------------------------+---------------------+-----------+ | ... | ... | ... | +-------------------------------+---------------------+-----------+
Generate recommendations
Use the predicted ratings to generate the top five recommended content IDs for each visitor ID.
Follow these steps to generate recommendations:
In the Cloud de Confiance console, go to the BigQuery page.
Write the predicted ratings to a table. In the query editor, paste in the following query and click Run:
CREATE OR REPLACE TABLE `bqml_tutorial.recommend_content` AS SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.mf_implicit`);
Select the top five results per visitor. In the query editor, paste in the following query and click Run:
SELECT visitorId, ARRAY_AGG( STRUCT(contentId, predicted_rating_confidence) ORDER BY predicted_rating_confidence DESC LIMIT 5) AS rec FROM `bqml_tutorial.recommend_content` GROUP BY visitorId;
The results should look similar to the following:
+---------------------+-----------------+---------------------------------+ | visitorId | rec:contentId | rec:predicted_rating_confidence | +---------------------+-----------------+------------------------- ------+ | 867526255058981688 | 299804319 | 0.88170525357178664 | | | 299935287 | 0.54699439944935124 | | | 299410466 | 0.53424780863188659 | | | 299826767 | 0.46949603950374219 | | | 299809748 | 0.3379991197434149 | +---------------------+-----------------+---------------------------------+ | 2434264018925667659 | 299824032 | 1.3903516407308065 | | | 299410466 | 0.9921995618196483 | | | 299903877 | 0.92333625294129218 | | | 299816215 | 0.91856701667757279 | | | 299852437 | 0.86973661454890561 | +---------------------+-----------------+---------------------------------+ | ... | ... | ... | +---------------------+-----------------+---------------------------------+