Connecting Athena to Churney

Suela Isaj, Jonas Dahlbæk | September 19, 2023 · 14 min

Here you can read our guide for connecting your Athena data warehouse with Churney.

How will Churney access the data?

We would need a role in AWS with the minimal permissions to access the data in Athena and export it to an S3 bucket. On a high level, we need access to:

  • Athena for reading and exporting the data 

  • AWS Glue for looking into table metadata, table partitions, and being able to get the tables (for the specific database the hashed view that is needed, and for the underlying table with the raw data)

  • S3 bucket for Churney to export the hashed data

  • S3 bucket containing the Athena data physically so we can read the data (limited to read access only through Athena)

The next steps explain the creation of the secure views for the data that you would like to share with Churney. If you would like Churney’s help to create the views of top of your tables and hash the private information, then follow Generating the hashed views, otherwise, jump to steps for Steps for creating the hashed views

Generating the hashed views

Go to the query editor in Athena and choose the context, the data source and the database where your raw data lies.

Run the following query where <your_schema_name_here> belongs to the schema with the tables you would like to share with Churney.

SELECT 
    t.table_catalog, 
    t.table_schema, 
    t.table_name, 
    c.column_name, 
    c.data_type
FROM information_schema.tables t 
INNER JOIN information_schema.columns c
ON t.table_catalog = c.table_catalog
    AND t.table_schema = c.table_schema
    AND t.table_name = c.table_name
WHERE t.table_schema = '<your_schema_name_here>'  
ORDER BY t.table_name;

Export the result by clicking on Download results as below,  and share the csv with ChurneyWe will come back to you with some create view scripts to automatically create the views and hash the data.

In the case of json fields, if there is private information in those fields like email, phone, last name, first name, etc, then we should skip the column from the view, and unpack the elements we need. The private information will also be included in the view for us, but we can modify the create view statement to unpack and hash the private field. In these cases, what we need from you is the name of the json column that has private information, and then the key that we should search for to unpack and hash. For the rest of the json fields that do not have any private user identifier, we can read them as they are.

Steps for creating the hashed views

For context, Facebook has a guide on how to hash contact information for their conversion api: https://developers.facebook.com/docs/marketing-api/conversions-api/parameters/customer-information-parameters. Google has a similar guide for enhanced conversions using their ads api https://developers.google.com/google-ads/api/docs/conversions/enhance-conversions. Basically, we want to hash columns that contain data which would allow one to determine the identity of a user: First name, last name, birth day, street address, phone number, email address, ip address.

Churney wants to have access only to hashed sensitive data, and not to the raw data of the private information of the users. We will go over limiting our permissions in the next section, and in this section, we will guide you to create views on top of the raw data, which hash the private information.

  1. Create a new database for Churney. In the snippet below, replace the <churney-database> with the name you would like to name this database. In this example, I will create a database named churney

CREATE DATABASE <churney-database>;

2. Create hashed views on top of the raw data. Let’s suppose that the raw data lies in a database called mydatabase and the table with the user information is called user_activity, which contains private information like first_name, last_name, email, etc. In the code snippet below, we will create a view that hashes the private information, and leaves the rest of the columns as raw. Note that if your table (e.g. event tables) do not contain any private information, your view can be just a select from the table.

CREATE OR REPLACE VIEW churney.user_activity AS 
SELECT
    id,
    LOWER(TO_HEX("sha256"("to_utf8"("trim"(first_name))))) first_name,
    LOWER(TO_HEX("sha256"("to_utf8"("trim"(last_name))))) last_name,
    LOWER(TO_HEX("sha256"("to_utf8"("trim"(email))))) email,
    amount_spent,
    activity_at
FROM
    mydatabase.user_activity;

After you are done with creating all the views that you would like to give access to Churney, do a simple test by running a select query to make sure everything you have hashed looks correctly, e.g.

select * from churney.user_activity

Steps for creating the Churney role


Create an S3 bucket for Churney

  1. Create an S3 bucket for Churney by going to this link https://s3.console.aws.amazon.com/s3/buckets and clicking on Create bucket

2. Choose a unique name for the bucket, and something self-explanatory, like churney-athena-unload. 

Be careful here to choose the region of this bucket the same as your Athena region. For example, if Athena is on us-east-1, then the region of the bucket should be us-east-1

3. Make sure the rest of the setting are the default ones and they look like below

And the s3 bucket doesn’t have public access

4. After the bucket is created, go to Management to create a lifecycle rule that cleans old objects from the bucket, so Churney manages your storage carefully.

Make sure you have the below settings

Create a workgroup for Churney

5. We will create a workgroup for Churney https://docs.aws.amazon.com/athena/latest/ug/workgroups-procedure.html. Replace {region-name} with the name of your region https://{region-name}.console.aws.amazon.com/athena/home?region={region-name}#/workgroups for example, if that is us-east-1 then the link should be https://us-east-1.console.aws.amazon.com/athena/home?region=us-east-1#/workgroups

6. Click on Create workgroup

7. Give the workgroup a unique name, we would suggest something simple like churney-workgroup

8. Expand the Additional Configurations section and fill them as below, where the default location of the query result of Churney would always be on our bucket, which is the bucket that was created in Step 1-3. In this example, we used churney-athena-unload, so the path to s3 is s3://churney-athena-unload/

Create the Churney role

9. Figure out the database parameters that you will need to provide. Churney will need the name of the table(s) in Athena to export from, the database, and the data source. You can find these parameters in Athena, for example, in the screenshot below, if the table that will contain the data is user_activity, then table-name = user_activity, database-name = mydatabase and catalog-name = AwsDataCatalog. Keep these parameters in mind because they will be needed for the steps below.

10. Go to https://{region-name}.console.aws.amazon.com/iamv2/home?region={region-name}#/policies  where region-name is the name of the region where your Athena lie, for example if the region is us-east-1, then the link would look like this https://us-east-1.console.aws.amazon.com/iamv2/home?region=us-east-1#/policies and click on Create policy

11. Create a policy for Athena as below, where you need to replace all in bold: region-name with the region where Athena lies, for example us-east-1, and account-id with the id of the Athena admin account. You will also need to provide the workgroup-name which is the name you provided in Step 6 and 7

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "athena:GetTableMetadata",
                "athena:StartQueryExecution",
                "athena:GetSession",
                "athena:GetQueryResults",
                "athena:GetDatabase",
                "athena:GetDataCatalog",
                "athena:GetQueryRuntimeStatistics",
                "athena:GetNamedQuery",
                "athena:GetPreparedStatement",
                "athena:GetWorkGroup",
                "athena:GetSessionStatus",
                "athena:StopQueryExecution",
                "athena:GetQueryExecution"

            ],
            "Resource": [
                "arn:aws:athena:<region-name>:<account-id>:datacatalog/*",
            "arn:aws:athena:<region-name>:<account-id>:workgroup/<workgroup-name>"
            ]
        }
    ]
}

12. Create a policy for the S3 bucket where Churney will export the data. Replace churney-bucket-name  with the name of the bucket you chose in Step 1-3. Note that is the bucket where Churney will unload the hashed data from the views you created above.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:GetBucketLocation",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::<churney-bucket-name>/*",
                "arn:aws:s3:::<churney-bucket-name>"
            ]
        }
    ]

13. Create a policy for AWS Glue to access the hashed views. Replace region-name with the region where Athena lies, for example us-east-1, and account-id with the id of the Athena admin account. Replace database-name with the named of the database with the hashed views, so in this example, that would be churney

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetPartition",
                "glue:GetTableVersion",
                "glue:GetTables",
                "glue:GetTableVersions",
                "glue:GetPartitions",
                "glue:GetColumnStatisticsForTable",
                "glue:GetDatabases",
                "glue:GetPartitionIndexes",
                "glue:GetTable",
                "glue:GetColumnStatisticsForPartition"
            ],
            "Resource": [
                "arn:aws:glue:<region-name>:<account-id>:catalog",
                "arn:aws:glue:<region-name>:<account-id>:database/<churney-database>",
                "arn:aws:glue:<region-name>:<account-id>:table/<churney-database>/*"
            ]
        }
    ]
}

14. Create a policy for AWS Glue to be able to read data from the hashed views. For this, we need access to the tables from which you created hashed views. Replace region-name with the region where Athena lies, for example us-east-1, and account-id with the id of the Athena admin account. Replace database-name with the name of the database as described in Step 10, and table_name_1 and table_name_2 with the names of the raw tables you created hashed views for. In our example, we had only 1 table user_activity and the database-name is mydatabase.

Note that by listing all the tables, we limit access only to those tables that you would like to share with Churney.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:GetPartitionIndexes",
                "glue:GetTable"

            ],
            "Resource": [
                "arn:aws:glue:<region-name>:<account-id>:catalog",
             "arn:aws:glue:<region-name>:<account-id>:database/<database-name>",
           "arn:aws:glue:<region-name>:<account-id>:table/<database-name>/<table_name_1>",
"arn:aws:glue:<region-name>:<account-id>:table/<database-name>/<table_name_2>"
            ]
        }
    ]
}

15. Create a policy for the S3 bucket where Athena queries tha data from. This is the location of the physical table data that Athena refers to. To find the bucket of the data, you can click on the three dots in the table you want to find the s3 bucket for, for example in the screenshot below the table is user_activity and click on View in Glue. in Location, highlighted in blue, you can see the s3 bucket. In this example, that is s3://one-time-athena-experiment/activity, so athena-bucket-name is one-time-athena-experiment

Note that we are adding a condition to limit the access to S3 only through Athena! This means that Churney cannot read the contents of your S3 bucket directly unless we access it through Athena.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::<athena-bucket-name>",
                "arn:aws:s3:::<athena-bucket-name>/*"
            ],
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:CalledVia": "athena.amazonaws.com"
                }
            }
        }
    ]
}

It can happen that your tables originate from different S3 buckets, so make sure to check the physical S3 location of each table that you will share with Churney through views. Let’s suppose you have two different buckets, then the above code would look like:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::<athena-bucket-name_1>",
                "arn:aws:s3:::<athena-bucket-name_1>/*",
                "arn:aws:s3:::<athena-bucket-name_2>",
                "arn:aws:s3:::<athena-bucket-name_2>/*"
            ],
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:CalledVia": "athena.amazonaws.com"
                }
            }
        }
    ]
}

16. Create a role by going to this link https://<region-name>.console.aws.amazon.com/iamv2/home#/roles where region-name is the region where Athena is and click on Create role. For the purpose of the example, the name of the role is churney-role

17. Choose a custom trust policy and add the trust policy as below, Churney will give you the ids needed for this step, so what you should fill in is id_1 and id_2 The way we will use this trust policy is all described in this link and follows the best practices https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp_oidc.html

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "accounts.google.com"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "accounts.google.com:sub": [
                        "<id_1>",
                        "<id_2>"
                    ]
                }
            }
        }
    ]
}

On the next step, you will be required to add permissions, remember to add all permissions we created on Step 11, 12, 13, 14, and 15

And the role should finally look like this, where you can see the policies and if you click on Trust relationships, you should be able to see the trust relationship

What to send to Churney?

After you are done with the above steps, you will need to send these details to Churney:

1 .The arn of the role you created in Step 17. You can find that in this link https://<region-name>.console.aws.amazon.com/iamv2/home#/roles/details/<role-name>?section=permissions where region-name is the name of the region where Athena lies and role-name is the name of the role you created in Step 17. For our example, the link would be https://us-east-1.console.aws.amazon.com/iamv2/home#/roles/details/churney-role?section=permissions and the ARN is the one marked in red below

2 .The details of the Athena data (the hashed views), which would be 

catalog-name e.g. AwsDataCatalog

database-name e.g. churney

table-name e.g. user_activity

3 . The name of the S3 bucket where Churney will unload the data, this is the bucket you created in Step 1

  • churney-bucket-name e.g. churney-athena-unload

4 . The name of workgroup you created above in Step 6.