This document is intended to serve as a plan for migrating on-premises and/or cloud IaaS Oracle databases and tables to Snowflake.

Rationale for Migration

Let us accept this. One is cost and second is continuous maintenance which in turns add to cost. Oracle is a relational database (RDBMS) that often comes with high licensing fees to host an enterprise’s most important data. Oracle also requires a significant amount of work to set up and maintain, both for hardware (network, storage, OS patching, configuration) and the software (configuration, updates). If there is a need to utilize the data on these systems for analytical purposes, there will often be additional licensing fees for the OLAP counterparts or the added risk of analytical workloads interfering with OLTP systems that are serving end users. All of these things can quickly add up to create a costly product that is taxing to both the budget and IT workforce. There are reasons beyond licensing costs for wanting to migrate, some called out above. It should be noted that the migration plan laid out here will work for far more systems than just Oracle, but the context of this document is to focus on Oracle migrations.

Why Snowflake

Snowflake is a Data Warehouse as a Service solution, that allows its users to interface with the software without having to worry about the infrastructure it runs on or how to install it. Between the reduction in operational complexity, the pay-for-what-you-use pricing model, and the ability to isolate compute workloads there are numerous ways to reduce costs associated with performing analytical tasks. Some other benefits and capabilities include:

  • Data sharing: Easily share data securely within your organization or externally with your customers
  • Zero copy cloning: Create multiple ‘copies’ of tables, schemas, or databases without actually copying the data. This saves on the time to copy and reduces data storage costs.
  • Separate compute and storage: Scale your compute and storage independent of one another, and isolate compute power for jobs that need their own dedicated warehouse.
  • No hardware provisioning: No hardware to provision, just a t-shirt sized warehouse available as needed within seconds.

Snowflake Considerations

It is built on public cloud infrastructure, and can be deployed to AWS, Azure and GCP. When moving to Snowflake, there are a couple things to consider regarding which supported cloud platform should be used. For the purposes of the migration, AWS technologies will be used when options are available.

Migration Goals and Requirements

The primary goals of this migration are to reduce the costs and operational burden of running licenced OLAP database systems on-premises, in favor of a cloud native, payfor-what-you-use SaaS counterpart. It seems only natural that a migration process would have similar goals. This will be a foundational concept applied throughout the migration plan. Further, it would be difficult to build a solution without some

sort of requirements. There is a lot of value in knowing requirements up front, as they will help to drive the design of a system and can often be used to create SLA’s by which the solution can be evaluated against. This solution must: • Migrate specific databases/schemas/tables (defined by configuration) for the source relational database systems - Oracle. • Migrate to an equivalent target database/schema/table in Snowflake. • Migrate both retrospective and prospective data. • Migrate prospective data from the source database to Snowflake with a latency that is under five minutes.

Migration Plan

The data flow for migrating any source system to Snowflake can be broadly described as three step approac

This picture is pretty straightforward, but immediately introduces something not obvious; a small pitstop for the source data in an ‘external stage’. This is because Snowflake has no native tools for pulling data directly from a database. Instead it uses the concept of a Stage, which can be either internal or external, as a point to load data from and unload data to.

Snowflake Stages are effectively pointers to public cloud object storage locations and metadata about the files in that location. These object storage locations are ideal for staging data because they scale nearly infinitely and often large enterprises would build data lakes on this storage, making it convenient to get data into Snowflake and allow for data in Snowflake to be easily contributed back to the lake.

The two arrows represent distinct processing steps of the migration that will be outlined in the coming sections. Per the migration goals, each of these steps should attempt to minimize costs by utilizing cloud infrastructure (preferring SaaS solutions to PaaS, PaaS to IaaS, and IaaS to on-premises) and allowing cost to scale with the migration.

Migration – Source to Stage

Snowflake has many technology partners that assist with moving data from a source database to an object storage location that can be used as an external stage. These include FiveTran, Attunity, Informatica, HVR, and Streamsets, all of which ScriptStory has past experience with, but this migration will instead focus on a cloud native solution from AWS, the Database Migration Service (DMS)

AWS DMS is a data migration service that supports highly configurable data migrations and replications between various sources and targets. Oracle is included as an available source system, making this the perfect tool to handle data collection from our source Oracle database. Targets include Apache Kafka, multiple AWS services like S3, DynamoDB, and Kinesis but for the purpose of this paper we are going to focus on S3 for Snowflake consumption.

AWS DMS appears to meet the migration goals defined above. It is a SaaS solution by AWS that is primarily setup by configuration. Costs scale with the number and size of its replication instances needed for a migration. It’s replication tasks have settings for bulk load migration to support retrospective data, as well as CDC for prospective data. It also supports S3 as a write target which is going to be used as the stage for Snowflake. The replication tasks achieve near real time latency for CDC by using Oracles’s native API to read archived redo logs as they are written to.

Building this step of the migration is primarily done by configuring the AWS DMS to produce the desired results, but it isn’t the only thing that needs to be done. Ensuring that the DMS infrastructure is authorized to access both the Oracle database and the S3 target is important, as is setting up the source database to produce the data that is needed for migration.

DMS Infrastructure

Replication Instance: This is the infrastructure that is running a replication task. Creation of this resource will require network configuration (VPCs, Subnets, SecurityGroups) and instance class/size details. Additional considerations include whether or not this instance is publicly available and whether or not it is highly available.

  • Replication Task: This is the definition of how data is replicated from the source to the target. The type of the migration is defined here, CDC and/or Full (bulk) Load, as well as settings that include (but isn’t limited to) logging, CDC tuning, error handling, bulk load tuning, validation, and transformation. This is where a majority of configuration of this step of the migration will be.
  • Source Endpoint: A holder for the connection details for the source of a replication task. This will be the Oracle connection details such as: username, password, hostname, port, and database.
  • Target Endpoint: Similar to the source endpoint, this is a holder of connection details for the target. For S3 this would include bucket, object key (folder) and IAM Role, but also includes additional configuration details such as encryption, compression, max file size and data format.
  • S3 Bucket: The storage location for the external stage for Snowflake.
  • IAM Role (not shown): A target of S3 is authorized by IAM role, rather than a database user, and so it is necessary to include an IAM Role. This role must have access to write to the S3 location, and have a trust policy that allows the DMS service the ability to assume the role.

Authentication and Authorization

The replication task must authenticate against both the Oracle database and the S3 bucket to perform the first step of the migration. Authentication into both systems is specified via the respective endpoints. For Oracle, the database username and password provided would be for a user specifically created for use by AWS DMS. For the target, the IAM Role is the identity, and must have an attached trust policy that specifies AWS DMS as a valid principal.

Authorization for the Oracle database is specified via a series of GRANTS to allow access to the database redo logs. For the target, the IAM role must include a policy that allows it to write to the target S3 bucket, and the target S3 bucket’s policy must allow the IAM role to write to it.

Replication tasks will output to S3 with a default data format of comma separated values (.csv), but the format is configurable with Parquet being recommended. This can be updated via the extra connection attributes provided during Target Endpoint creation.

When CDC is involved, it is important to note a couple points about the data that is output.

  • The schema for a given table will reflect that table’s schema, but will also include an additional ‘timestamp’ field and ‘op’ field to reflect the point in time a change happened and the operation that the change reflects, either I[nsert], U[pdate], or D[elete].
  • The data is an append-only log of ongoing changes added to the end of the bulk load of data. This matters because this data will need to be reassembled in Snowflake for the tables to be reflective of the data in the source tables.

Migration – STAGE TO SNOWFLAKE
Once data is in the external stage, the remainder of the migration can be achieved using Snowflake resources and capabilities.

Even though Snowflake is the primary requirement for this solution, it is still valid to evaluate it against the migration goals. Snowflake is a SaaS solution that builds data warehouse systems using SQL commands.

With Snowflake, costs accrue for storage use and compute use on a per-second basis. Snowflakes pipe and task objects support building low latency data pipelines. Finally, the data landing in S3 can be treated the same through the Snowflake pipeline, whether retrospective or prospective.

Building this step of the migration involves configuring a couple components to enable authorized access to the data in S3 and to ensure timely delivery to the Snowflake pipe. The remainder of this step would then be built by executing DDL to build out the pipe, tables, stream and task.

Authentication and Authorization

Snowflake has the concept of a Storage Integration that it uses to provide credentialless access to AWS S3 external stages. These are setup using an IAM Role that is provided access to the S3 location. This role is then set up with a trust policy that allows a Snowflake IAM role to assume the original role, effectively delegating its abilities to the Snowflake role.

The process of properly creating a storage integration has a few back and forth steps, which is essentially creating a long term authentication to Snowflake via the trust policy. The policies attached to the IAM role and the S3 bucket are what defines the authorization.

Migration Pipeline

This step of the migration involves two Snowflake tables. The first is a representation of the data that mirrors what is in S3, an append-only table that contains the operations performed against the original source database, the ‘change table’. The second is reflective of the (mostly) current state of the original source database, the ‘reporting table’, which is the final target for the data in the migration. The pipe, often referred to as Snowpipe in Snowflake marketing materials, is used to keep the change table up to date with the latest data from S3. A pipe is effectively a COPY INTO statement that is listening for changes to data in a stage. As new CDC data lands in S3 from the first step of the migration, S3 object notifications signal the pipe of the new data. Upon receiving a signal, the pipe queues its COPY INTO command for a Snowflake managed warehouse to copy the data into the change table. Charges for the warehouse are billed per second with a small overhead for every 1000 files.

Table Streams keep track of DML changes to a table allowing for action to be taken against the delta. Streams can be queried just like a table can, and the contents of the stream are cleared when queried. The stream on the change table is keeping track of the changes to the change table, which will only be insert operations for this migration, as the data coming from S3 is either the data from the bulk load or the CDC data from the source database.

Tasks are scheduled execution of a specified SQL statement. Creation of a task requires providing a warehouse with which the query will execute. The migration for this step uses a task that starts up periodically, checks to see if the stream has any new changes, and executes a MERGE INTO statement to the reporting table.

Pipeline Output

Together these Snowflake resources work together to take the CDC data as it lands in S3 and reassemble it to a representation that mimics the source, with some degree of latency.