Snowflake Bulk Vs Once Query Difference

Hello everyone,

I would greatly appreciate your assistance in clarifying when to utilize bulk queries versus once per record queries.

My flow is suppose to run every 30 minutes, and I may encounter scenarios where there are more than 50 records or none at all. I'm unsure about the criteria for deciding which type of query to use in such cases.

Additionally, I'm curious if the selection of query type is determined solely by the quantity of data or if it depends on other factors. For instance, should we use the once per record query for custom queries and the bulk query for automatically generated queries?

Thank you in advance for your guidance and insights.

@yashkumar you've just about hit the nail on the head. Bulk insert is great when you have thousands and millions of records to push into Snowflake because it's more efficient (up to 1 MB of data per insert) thus consumes fewer credits in the long run, you don't have to write any sql, and you can use the mapper that you're used to. However, with bulk insert, you most likely need an additional flow to run your merge query so that you can dedup your data. Outside of the bulk insert option, you have "run once per page of records" and "run once per record". These two options require you to write some sql, but would allow you to not have two flows strung together since the sql could just be a merge query that inserts and updates your data. Out of those two options, running once per page would be preferred because you could run the query once for the entire page of data (default is 20 records per page) instead of having to run a query for each individual record.

Bulk insert
Pros:

  • Supports dropdown style selection for table names
  • Mapper support with drop down of available columns for the chosen table
  • No sql required for inserting
  • More performative for getting data quickly into Snowflake and thus reducing warehouse compute time. Can insert 1 MB of data per query.

Cons:

  • You most likely need another flow to run your merge query so you can dedup your data.
  • If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)

Typical setup:

merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
select
*
from {{{settings.flowGrouping.stagingDatabase}}}.{{{settings.flowGrouping.stagingSchema}}}.{{{settings.flowGrouping.stagingTable}}} as s
qualify
row_number() over (partition by s.primarykey||s.resource order by _updated_at desc) = 1) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then
update set
t.data = s.data
,t._updated_at = s._updated_at
when not matched then
insert
(primarykey
,resource
,data
,_updated_at
,_created_at
)
values
(s.primarykey
,s.resource
,s.data
,s._updated_at
,s._created_at
)
;



Run once per page

Pros:

  • One flow can be made to get data from source to Snowflake
  • One query can be made for the page of data and not just a single records, thus saving you 20 times (by default) the number of query calls to Snowflake

Cons:

  • Requires sql to be written
  • Slower to get a lot of data into Snowflake
  • If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)

Typical setup:

merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
{{#each batch_of_records}}
{{#if @last}}
select
{{record.id}} as primarykey
,{{record.resource}} as resource
,parse_json({{jsonSerialize record.data}}) as data
,{{timestamp}} as _updated_at
,{{timestamp}} as _created_at
{{else}}
select
{{record.id}} as primarykey
,{{record.resource}} as resource
,parse_json({{jsonSerialize record.data}}) as data
,{{timestamp}} as _updated_at
,{{timestamp}} as _created_at

union all

{{/if}}
{{/each}}
) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then
update set
t.data = s.data
,t._updated_at = s._updated_at
when not matched then
insert
(primarykey
,resource
,data
,_updated_at
,_created_at
)
values
(s.primarykey
,s.resource
,s.data
,s._updated_at
,s._created_at
)
;


Run once per record

Pros:

  • Great to use for single step processing like running merge queries, table creates, table drops, clearing stage data, etc..
  • If an error occurs, the single record will fail out compared to the page/batch of records sent

Cons:

  • Will be the slowest of all the options if you're trying to get a lot of data in
  • Requires sql

Typical setup:

Same as run once per page, but with this query.

merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
select
{{record.id}} as primarykey
,{{record.resource}} as resource
,parse_json({{jsonSerialize record.data}}) as data
,{{timestamp}} as _updated_at
,{{timestamp}} as _created_at
) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then
update set
t.data = s.data
,t._updated_at = s._updated_at
when not matched then
insert
(primarykey
,resource
,data
,_updated_at
,_created_at
)
values
(s.primarykey
,s.resource
,s.data
,s._updated_at
,s._created_at
)
;

Thanks a lot Tyler!

I was looking to know regarding one more point i.e if I have to perform both Add and update on the same table does it requires two separate steps?

@yashkumar the merge query for Snowflake handles both inserts and updates so you could use it like I showed above. The recommended approach is to have one flow that bulk inserts data into a table, then have another flow triggered to run afterwards that merges the staged data into some final production table.