For Admins
This guide covers the initial setup required to use Databricks Labs in Vocareum, supported by resources from your own Databricks E2 and AWS accounts.
Alternative Approaches
You may also choose to connect an Azure account using the Cloud Labs: Bring Your Own Azure Account guide, then use the 1st-party Databricks within Azure Cloud Labs. In this case, Vocareum will help manage usage of your Azure account, but will not manage Databricks directly.
For higher education institutions, the Databricks University Alliance may be another option to support Databricks training within your courses. If you are interested in working with them, you can reach out here: Databricks Help Center | Contact Us.
Prerequisites/Considerations
Databricks SSO setup: Vocareum has to be the identity provider
this means any UI access to Databricks has to be done through Vocareum
Databricks resource limits:
maximum 3 active workspaces for standard
maximum 10 active workspaces for premium
maximum 50 active workspaces for enterprise
Configure your AWS account
For a standard setup, two IAM roles and one S3 bucket need to be created.
If you will be using a metastore with metastore-level managed storage in AWS, you will need to create a total of two AWS S3 buckets. For more information, refer to the Databricks on AWS documentation:
IAM Role: vocareumvm
The first role needs to be named vocareumvm. Use the following for the permissions policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VocDbEc2Access",
"Effect": "Allow",
"Action": [
"ec2:AllocateAddress",
"ec2:AssociateRouteTable",
"ec2:AttachInternetGateway",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CreateInternetGateway",
"ec2:CreateNatGateway",
"ec2:CreateRoute",
"ec2:CreateRouteTable",
"ec2:CreateSecurityGroup",
"ec2:CreateSubnet",
"ec2:CreateTags",
"ec2:CreateVpc",
"ec2:CreateVpcEndpoint",
"ec2:DeleteInternetGateway",
"ec2:DeleteNatGateway",
"ec2:DeleteRoute",
"ec2:DeleteRouteTable",
"ec2:DeleteSecurityGroup",
"ec2:DeleteSubnet",
"ec2:DeleteVpc",
"ec2:DescribeAccountAttributes",
"ec2:DescribeAddresses",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeCustomerGateways",
"ec2:DescribeDhcpOptions",
"ec2:DescribeEgressOnlyInternetGateways",
"ec2:DescribeInstances",
"ec2:DescribeInternetGateways",
"ec2:DescribeNatGateways",
"ec2:DescribeNetworkAcls",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeRegions",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeTags",
"ec2:DescribeVpcAttribute",
"ec2:DescribeVpcEndpoints",
"ec2:DescribeVpcEndpointServiceConfigurations",
"ec2:DescribeVpcPeeringConnections",
"ec2:DescribeVpcs",
"ec2:DescribeVpnConnections",
"ec2:DescribeVpnGateways",
"ec2:DetachInternetGateway",
"ec2:DisassociateRouteTable",
"ec2:ModifySubnetAttribute",
"ec2:ModifyVpcAttribute",
"ec2:ReleaseAddress",
"ec2:RevokeSecurityGroupEgress",
"ec2:RevokeSecurityGroupIngress"
],
"Resource": "*"
},
{
"Sid": "VocDbCfnAccess",
"Effect": "Allow",
"Action": [
"cloudformation:CreateStack",
"cloudformation:DeleteStack",
"cloudformation:DescribeStackEvents",
"cloudformation:DescribeStacks",
"cloudformation:GetStackPolicy",
"cloudformation:ListStacks",
"cloudformation:UpdateTerminationProtection"
],
"Resource": "*"
}
]
}
Use the following for the trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::{{your AWS account ID}}:role/vocareumvm",
"arn:aws:iam::{{our AWS account ID}}:root",
]
},
"Action": "sts:AssumeRole"
},
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}We will let you know what "our AWS account ID" is
IAM Role: vocareum-db
The second role can be named anything, but vocareum-db will work. Use the following for the permissions policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1403287045000",
"Effect": "Allow",
"Action": [
"ec2:AssociateIamInstanceProfile",
"ec2:AttachVolume",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CancelSpotInstanceRequests",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:DeleteTags",
"ec2:DeleteVolume",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeIamInstanceProfileAssociations",
"ec2:DescribeInstanceStatus",
"ec2:DescribeInstances",
"ec2:DescribeInternetGateways",
"ec2:DescribeNatGateways",
"ec2:DescribeNetworkAcls",
"ec2:DescribePrefixLists",
"ec2:DescribeReservedInstancesOfferings",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSpotInstanceRequests",
"ec2:DescribeSpotPriceHistory",
"ec2:DescribeSubnets",
"ec2:DescribeVolumes",
"ec2:DescribeVpcAttribute",
"ec2:DescribeVpcs",
"ec2:DetachVolume",
"ec2:DisassociateIamInstanceProfile",
"ec2:ReplaceIamInstanceProfileAssociation",
"ec2:RequestSpotInstances",
"ec2:RevokeSecurityGroupEgress",
"ec2:RevokeSecurityGroupIngress",
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:DescribeFleetHistory",
"ec2:ModifyFleet",
"ec2:DeleteFleets",
"ec2:DescribeFleetInstances",
"ec2:DescribeFleets",
"ec2:CreateFleet",
"ec2:DeleteLaunchTemplate",
"ec2:GetLaunchTemplateData",
"ec2:CreateLaunchTemplate",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeLaunchTemplateVersions",
"ec2:ModifyLaunchTemplate",
"ec2:DeleteLaunchTemplateVersions",
"ec2:CreateLaunchTemplateVersion",
"ec2:AssignPrivateIpAddresses",
"ec2:GetSpotPlacementScores"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"iam:CreateServiceLinkedRole",
"iam:PutRolePolicy"
],
"Resource": "arn:aws:iam::*:role/aws-service-role/spot.amazonaws.com/AWSServiceRoleForEC2Spot",
"Condition": {
"StringLike": {
"iam:AWSServiceName": "spot.amazonaws.com"
}
}
}
]
}
Use the following for the trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::414351767826:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "{{your Databricks E2 account ID}}"
}
}
}
]
}
S3 Bucket: vocareum-db-bucket
The S3 bucket can be named anything, but vocareum-db-bucket will work. Use the following for the bucket permissions policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Grant Databricks Access",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::414351767826:root"
},
"Action": [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::vocareum-db-bucket/*",
"arn:aws:s3:::vocareum-db-bucket"
],
"Condition": {
"StringEquals": {
"aws:PrincipalTag/DatabricksAccountId": "{{your Databricks E2 account ID}}"
}
}
}
]
}
Quotas
AWS service quota increases don't need to be requested immediately, but they will need to be in the future as things scale.
EC2-VPC Elastic IPs
Recommended value: 100
This one is probably the most important and the most difficult to request increases for
VPCs per Region
Recommended value: 150
This is another important and difficult quota to request
Gateway VPC endpoints per Region
Recommended value: 150
NAT gateways per Availability Zone
Recommended value: 100
Storage for General Purpose SSD (gp3) volumes, in TiB
Recommended value: 400
Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances
Recommended value: 5000
All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests
Recommended value: 1000
Elastic IP address quota per NAT gateway
Recommended value: 8
Running On-Demand G and VT instances
Recommended value: 100
Configure your Databricks account
SSO
Configure the authentication with the following values:
Identity protocol: SAML 2.0
Single Sign-On URL: https://labs.vocareum.com/idp/databricks.php
Entity ID: https://labs.vocareum.com/idp/metadata.php
x.509 Certificate:
-----BEGIN CERTIFICATE----- MIIDiTCCAnGgAwIBAgIBADANBgkqhkiG9w0BAQsFADBfMQswCQYDVQQGEwJVUzET MBEGA1UECAwKQ2FsaWZvcm5pYTERMA8GA1UEBwwIU2FuIEpvc2UxETAPBgNVBAoM CFZvY2FyZXVtMRUwEwYDVQQDDAx2b2NhcmV1bS5jb20wHhcNMjMwMTI0MjI1MDU3 WhcNMjUwMTIzMjI1MDU3WjBfMQswCQYDVQQGEwJVUzETMBEGA1UECAwKQ2FsaWZv cm5pYTERMA8GA1UEBwwIU2FuIEpvc2UxETAPBgNVBAoMCFZvY2FyZXVtMRUwEwYD VQQDDAx2b2NhcmV1bS5jb20wggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIB AQDrKf0u2WbQ+R4utxEj0hD7Stgj6SGq207kCHI+XtIThgFZTMGyVoGyeDlgTNgZ wG/+Qm45R7GOeRIq8gC1B4R6WidFg0xEURYE6kkqQ6CFHhqKIb144RQQyN3jfc3n g8CxzZrS2j5BRTKy2oiYP16xXiWMjg5qL6gXDchM/VjN6+kgXf54WGc9TT98vQWC yd8H2UaM43hujOlrprtr5PsQZhc9uDevcbj1YgIK+W4ox0QqNbUJPJgSrzFkukVy rjZKSwLLCn6FtzCu3AfYmk0/+NqdqRsPQNuReiMVkuyVO5A+jPNjxchldDg/LQkF KX/3lmLSybtNwfJdPtK6UKH9AgMBAAGjUDBOMB0GA1UdDgQWBBRPWM+O1uAzlxEN /vRXZ4gTqnKRyTAfBgNVHSMEGDAWgBRPWM+O1uAzlxEN/vRXZ4gTqnKRyTAMBgNV HRMEBTADAQH/MA0GCSqGSIb3DQEBCwUAA4IBAQDjRVBbPyTTCkQo8MVdEnL4Ou3w tfnzFhWl69O6AEUyF7RKab0FE9kCPpwh/2/6lMG6dvtnFJDfeUIEluz2mho7UqGz pDH72/6TDTootYvs01wSBMXof7F7ZFJ+lul7lA+4sjSrr6GcB6StaD3qENY7rG32 8Ty16bvUZLq11kvM+6NbqQdpe9dg+9N0Ju9krg63zoox4cQDe4JRd/dH7/yZr5DO xcXrN7zR2QZ4duNOk/EZMNg6gLOBQ5Y+j2QcuWTZ3XtUO5j2wW6/C/AGSRhdhnon wmj4ZDdUr3mTZvf03+77hAbyoIdjsdyhjiYyLth1FIP+ITnPGQZBykKIWeyz -----END CERTIFICATE-----
Service Principal
One service principal needs to be created.
Assign the "Account admin" role to it
Generate an OAuth secret for that service principal
note down the secret
Metastore
If you will be using a metastore, then that can also be created at this time. (See https://docs.databricks.com/aws/en/data-governance/unity-catalog/create-metastore)
Information We Need
Since there is some sensitive information, contact Vocareum to ask about who to send the information to.
AWS role ARN for the vocareum-db role
AWS S3 bucket name
Databricks E2 account ID
Databricks service principal's secret
Databricks service principal's secret expiration date
Databricks service principal's ID
Databricks service principal's UUID, a.k.a., Application ID
