Snowflake Bulk Vs Once Query Difference

yashkumar · July 14, 2023, 11:34am

Hello everyone,

I would greatly appreciate your assistance in clarifying when to utilize bulk queries versus once per record queries.

My flow is suppose to run every 30 minutes, and I may encounter scenarios where there are more than 50 records or none at all. I'm unsure about the criteria for deciding which type of query to use in such cases.

Additionally, I'm curious if the selection of query type is determined solely by the quantity of data or if it depends on other factors. For instance, should we use the once per record query for custom queries and the bulk query for automatically generated queries?

Thank you in advance for your guidance and insights.

tylerlamparter · July 14, 2023, 1:41pm

@yashkumar you've just about hit the nail on the head. Bulk insert is great when you have thousands and millions of records to push into Snowflake because it's more efficient (up to 1 MB of data per insert) thus consumes fewer credits in the long run, you don't have to write any sql, and you can use the mapper that you're used to. However, with bulk insert, you most likely need an additional flow to run your merge query so that you can dedup your data. Outside of the bulk insert option, you have "run once per page of records" and "run once per record". These two options require you to write some sql, but would allow you to not have two flows strung together since the sql could just be a merge query that inserts and updates your data. Out of those two options, running once per page would be preferred because you could run the query once for the entire page of data (default is 20 records per page) instead of having to run a query for each individual record.

Bulk insert
Pros:

Supports dropdown style selection for table names
Mapper support with drop down of available columns for the chosen table
No sql required for inserting
More performative for getting data quickly into Snowflake and thus reducing warehouse compute time. Can insert 1 MB of data per query.

Cons:

You most likely need another flow to run your merge query so you can dedup your data.
If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)

Typical setup:

merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
    select
        *
    from {{{settings.flowGrouping.stagingDatabase}}}.{{{settings.flowGrouping.stagingSchema}}}.{{{settings.flowGrouping.stagingTable}}} as s
    qualify 
        row_number() over (partition by s.primarykey||s.resource order by _updated_at desc) = 1) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then 
    update set 
        t.data = s.data
        ,t._updated_at = s._updated_at
when not matched then 
    insert 
        (primarykey 
         ,resource
         ,data
         ,_updated_at
         ,_created_at
        ) 
    values 
        (s.primarykey
         ,s.resource
         ,s.data
         ,s._updated_at
         ,s._created_at
        )
;

Run once per page

Pros:

One flow can be made to get data from source to Snowflake
One query can be made for the page of data and not just a single records, thus saving you 20 times (by default) the number of query calls to Snowflake

Cons:

Requires sql to be written
Slower to get a lot of data into Snowflake
If an error occurs it will fail the entire batch sent to Snowflake since the entire query would have to be successful for Snowflake to accept it (this is typical for any database though)

Typical setup:

merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
    {{#each batch_of_records}}
    {{#if @last}}
    select 
      {{record.id}} as primarykey
      ,{{record.resource}} as resource
      ,parse_json({{jsonSerialize record.data}}) as data
      ,{{timestamp}} as _updated_at
      ,{{timestamp}} as _created_at
    {{else}}
    select 
      {{record.id}} as primarykey
      ,{{record.resource}} as resource
      ,parse_json({{jsonSerialize record.data}}) as data
      ,{{timestamp}} as _updated_at
      ,{{timestamp}} as _created_at
    
    union all
    
    {{/if}}
    {{/each}}
    ) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then 
    update set 
        t.data = s.data
        ,t._updated_at = s._updated_at
when not matched then 
    insert 
        (primarykey 
         ,resource
         ,data
         ,_updated_at
         ,_created_at
        ) 
    values 
        (s.primarykey
         ,s.resource
         ,s.data
         ,s._updated_at
         ,s._created_at
        )
;

Run once per record

Pros:

Great to use for single step processing like running merge queries, table creates, table drops, clearing stage data, etc..
If an error occurs, the single record will fail out compared to the page/batch of records sent

Cons:

Will be the slowest of all the options if you're trying to get a lot of data in
Requires sql

Typical setup:

Same as run once per page, but with this query.

merge into {{{settings.flowGrouping.productionDatabase}}}.{{{settings.flowGrouping.productionSchema}}}.{{{settings.flowGrouping.productionTable}}} as t using (
    select 
      {{record.id}} as primarykey
      ,{{record.resource}} as resource
      ,parse_json({{jsonSerialize record.data}}) as data
      ,{{timestamp}} as _updated_at
      ,{{timestamp}} as _created_at
    ) as s on s.primarykey = t.primarykey and s.resource = t.resource
when matched and s._updated_at > t._updated_at then 
    update set 
        t.data = s.data
        ,t._updated_at = s._updated_at
when not matched then 
    insert 
        (primarykey 
         ,resource
         ,data
         ,_updated_at
         ,_created_at
        ) 
    values 
        (s.primarykey
         ,s.resource
         ,s.data
         ,s._updated_at
         ,s._created_at
        )
;

yashkumar · July 14, 2023, 3:15pm

Thanks a lot Tyler!

I was looking to know regarding one more point i.e if I have to perform both Add and update on the same table does it requires two separate steps?

tylerlamparter · July 14, 2023, 4:01pm

@yashkumar the merge query for Snowflake handles both inserts and updates so you could use it like I showed above. The recommended approach is to have one flow that bulk inserts data into a table, then have another flow triggered to run afterwards that merges the staged data into some final production table.

Topic		Replies	Views
Just how does page and batch size effect flow execution - Official Answer Needed! Integrator.io flow-builder , export , batch-size , page-size	1	68	December 10, 2021
Can I use Snowflake Variant & Array Integrator.io error-handling , sql , import , json , snowflake	5	31	November 10, 2022
Possible to replicate "Multiple rows per record" functionality with script? Integrator.io script , json , grouping	2	51	July 28, 2021
How to understand the new BigQuery connector "once per page of records" Integrator.io handlebars , sql , json , google-bigquery	3	7	September 12, 2022
BigQuery Batch Import Integrator.io handlebars , error-handling , performance , batch-size , google-bigquery	19	17	August 10, 2022

Snowflake Bulk Vs Once Query Difference

Related topics