@yashkumar you've just about hit the nail on the head. Bulk insert is great when you have thousands and millions of records to push into Snowflake because it's more efficient (up to 1 MB of data per insert) thus consumes fewer credits in the long run, you don't have to write any sql, and you can use the mapper that you're used to. However, with bulk insert, you most likely need an additional flow to run your merge query so that you can dedup your data. Outside of the bulk insert option, you have "run once per page of records" and "run once per record". These two options require you to write some sql, but would allow you to not have two flows strung together since the sql could just be a merge query that inserts and updates your data. Out of those two options, running once per page would be preferred because you could run the query once for the entire page of data (default is 20 records per page) instead of having to run a query for each individual record.
Bulk insert
Pros:
- Supports dropdown style selection for table names
- Mapper support with drop down of available columns for the chosen table
- No sql required for inserting
- More performative for getting data quickly into Snowflake and thus reducing warehouse compute time. Can insert 1 MB of data per query.
Cons:
- You most likely need another flow to run your merge query so you can dedup your data.
- If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)
Typical setup:
merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
select
*
from {{{settings.flowGrouping.stagingDatabase}}}.{{{settings.flowGrouping.stagingSchema}}}.{{{settings.flowGrouping.stagingTable}}} as s
qualify
row_number() over (partition by s.primarykey||s.resource order by _updated_at desc) = 1) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then
update set
t.data = s.data
,t._updated_at = s._updated_at
when not matched then
insert
(primarykey
,resource
,data
,_updated_at
,_created_at
)
values
(s.primarykey
,s.resource
,s.data
,s._updated_at
,s._created_at
)
;
Run once per page
Pros:
- One flow can be made to get data from source to Snowflake
- One query can be made for the page of data and not just a single records, thus saving you 20 times (by default) the number of query calls to Snowflake
Cons:
- Requires sql to be written
- Slower to get a lot of data into Snowflake
- If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)
Typical setup:
merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
{{#each batch_of_records}}
{{#if @last}}
select
{{record.id}} as primarykey
,{{record.resource}} as resource
,parse_json({{jsonSerialize record.data}}) as data
,{{timestamp}} as _updated_at
,{{timestamp}} as _created_at
{{else}}
select
{{record.id}} as primarykey
,{{record.resource}} as resource
,parse_json({{jsonSerialize record.data}}) as data
,{{timestamp}} as _updated_at
,{{timestamp}} as _created_at
union all
{{/if}}
{{/each}}
) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then
update set
t.data = s.data
,t._updated_at = s._updated_at
when not matched then
insert
(primarykey
,resource
,data
,_updated_at
,_created_at
)
values
(s.primarykey
,s.resource
,s.data
,s._updated_at
,s._created_at
)
;
Run once per record
Pros:
- Great to use for single step processing like running merge queries, table creates, table drops, clearing stage data, etc..
- If an error occurs, the single record will fail out compared to the page/batch of records sent
Cons:
- Will be the slowest of all the options if you're trying to get a lot of data in
- Requires sql
Typical setup:
Same as run once per page, but with this query.
merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
select
{{record.id}} as primarykey
,{{record.resource}} as resource
,parse_json({{jsonSerialize record.data}}) as data
,{{timestamp}} as _updated_at
,{{timestamp}} as _created_at
) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then
update set
t.data = s.data
,t._updated_at = s._updated_at
when not matched then
insert
(primarykey
,resource
,data
,_updated_at
,_created_at
)
values
(s.primarykey
,s.resource
,s.data
,s._updated_at
,s._created_at
)
;