Connecting Databricks AWS to Churney Integration

Here you can read our guide for connecting your Databricks AWS data warehouse with Churney.

What kind of data is required?

The short answer is as much as possible. The long answer is that Churney requires data about:

Payments
Trials (if applicable)
User demographic (if available)
User activity

Additionally, we need to know the location (region) of your data warehouse.

Create hashed views

To create the views, we would need to create queries to hash the PII columns.

‍

For context, Facebook has a guide on how to hash contact information for their conversion api: https://developers.facebook.com/docs/marketing-api/conversions-api/parameters/customer-information-parameters. Google has a similar guide for enhanced conversions using their ads api https://developers.google.com/google-ads/api/docs/conversions/enhance-conversions. Basically, we want to hash columns that contain data which would allow one to determine the identity of a user: First name, last name, birth day, street address, phone number, email address etc.

‍

Example of creating hashed views

Create a dataset named churney .

If the raw data contains email, phone, birthday or other identifiers, the columns need to be excluded and hashed in the view.

Example, if the raw data lies under raw_data.sensitive and you would like to create a view churney.sensitive_data_view

‍

CREATE OR REPLACE <your_project>.churney.sensitive_data_view as ( 
select
SHA2(LOWER(TRIM(email)), 256) as email,
SHA2(LOWER(TRIM(name)), 256) as name,
SHA2(LTRIM('0', regexp_replace(phone, '[()-+-]', '')), 256) as phone,
SHA2(date_format(birthday, 'yyyyMMdd'), 256) as birthday,
* EXCEPT(email, name, phone, birthday)
from <your_project>.raw_data.sensitive
)

‍

If you want to test it how the script looks with some data:

‍

with sensitive as
(
 select 1 as id,
 "abc" as name,
 "abc@abc.com" as email,
 current_date() as birthday,
 '+4511111111' as phone
)
select
SHA2(LOWER(TRIM(email)), 256) as email,
SHA2(LOWER(TRIM(name)), 256) as name,
SHA2(LTRIM('0', regexp_replace(phone, '[()-+-]', '')), 256) as phone,
SHA2(date_format(birthday, 'yyyyMMdd'), 256) as birthday,
* EXCEPT(email, name, phone, birthday)
from sensitive

‍

Create a service principal for Churney

‍

Go to your account -> Settings, and then Identity and access -> Users and create the service principal:

‍

Then go to the service principal to generate a token as below. The maximum lifetime is 730 days, so please use that. Store the secret somewhere safe for now.

‍

Create a group for the churney service princial as below:

And add the service principal to the group:

Create an s3 bucket for Churney

Let's create an s3 bucket for Churney. Make sure it is in the same region as your Databricks and public access is blocked.

‍

In this example, we are creating a bucket named churney-databricks-unload.

‍

Create a storage credential

First, we need to create a role and assign policies for Churney in AWS.

Go in IAM -> Policies in your AWS account:

And create a policy for Churney to access the bucket we created previously

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:GetBucketLocation",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::<churney-s3-bucket>/*",
                "arn:aws:s3:::<churney-s3-bucket>"
            ]
        }
    ]
}

‍

Replace <churney-s3-bucket> with the name of the bucket you created above, in this example, with churney-databricks-unload

‍

Then go to Roles to create a new role for churney with Custom trust policy:

and put this in the JSON field:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "accounts.google.com"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "accounts.google.com:sub": [
                        "<id_1>",
                        "<id_2>"
                    ]
                }
            }
        }
    ]
}

‍

Replace <id_1> and <id_2> with the ids of the service accounts that Churney will give you.

Make sure you add the policy to the Churney role:

Go to your Databricks under Catalog -> Add a storage credential and create a storage credential, where you enter the arn of the churney role you added above. The output should look like below:

‍

Now, go back to the role in AWS and update the trust policy as below:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "accounts.google.com"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "accounts.google.com:sub": [
                        "<id_1>",
                        "<id_2>"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<your-aws-account-id>:role/<this-role-name>"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<external_id>"
                }
            }
        }


    ]
}

‍

where the <external_id> is in your storage credential details

‍

Go to the storage credential and grant the below permissions to yourself

And finally run this on your SQL Editor:

CREATE EXTERNAL LOCATION IF NOT EXISTS `churney_external`
URL '<s3_url>'
WITH (STORAGE CREDENTIAL `churney_storage_credential`);

where the url contains your s3 bucket. In this example, the sql will look like below:

CREATE EXTERNAL LOCATION IF NOT EXISTS `churney_external`
URL 's3://churney-databricks-unload/'
WITH (STORAGE CREDENTIAL `churney_storage_credential`);

‍

Then run:

GRANT CREATE EXTERNAL TABLE ON EXTERNAL LOCATION `churney_external` TO `Churney`;
GRANT READ FILES ON EXTERNAL LOCATION `churney_external` TO `Churney`;

‍

Grant permissions on schema

Let’s assume that you would like to share the views under churney_views with Churney.

Churney will create external tables pointing at the s3 bucket, so let’s create a schema for the churney_exports

CREATE SCHEMA aws.churney_exports;

Churney will maintain the exports here, so grant the permissions below:

For the churney_views schema, Churney only needs Data reader, so grant those permissions as below:

Grant permissions to the warehouse

Finally, go the your warehouse, and add the service principal as below:

What to share with Churney

‍

The token generated for the user (in a safe way) and the id of the service principal, which you can find it as below

The region of your Databricks (e.g. europe-west1)
The name of the export dataset create for Churney (in this example churney_exports)
The s3 bucket name (in this example churney-databricks-unload)
The arn of the role created for Churney (in this example arn:aws:iam::*******:role/churney-databricks-access-role)
The catalog name (in this example aws)
The connection details of the sql warehouse: server_hostname and http_path as below

‍

Connecting Databricks AWS to Churney