Function Folding in #PowerQuery

Looking at a typical Power Query query you will noticed that it's made up of a number of small steps. As an example take a look at the query I did in my previous post about joining a fact table to a slowly changing dimension. It was roughly built up of the following steps:

  1. Get all records from the fact table
  2. Get all records from the dimension table
  3. do an outer join between these two tables on the business key (resulting in an increase in the row count as there are multiple records in the dimension table for each business key)
  4. Filter out the excess rows introduced in step 3
  5. remove extra columns that are not required in the final result set.

If Power Query was to execute a query like this literally, following the same steps in the same order it would not be overly efficient. Particularly if your two source tables were quite large. However Power Query has a feature called function folding where it can take a number of these small steps and push them down to the data source. The degree of function folding that can be performed depends on the data source, As you might expect, relational data sources like SQL Server, Oracle and Teradata support folding, but so do some of the other sources like OData, Exchange and Active Directory.

To explore how this works I took the data from my previous post and loaded it into a SQL database. Then I converted my Power Query expression to source it's data from that database. Below is the resulting Power Query which I edited by hand so that the whole thing can be shown in a single expression:

    SqlSource = Sql.Database("localhost", "PowerQueryTest"),
    BU = SqlSource{[Schema="dbo",Item="BU"]}[Data],
    Fact = SqlSource{[Schema="dbo",Item="fact"]}[Data],
    Source = Table.NestedJoin(Fact,{"BU_Code"},BU,{"BU_Code"},"NewColumn"),
    LeftJoin = Table.ExpandTableColumn(Source, "NewColumn"
                                  , {"BU_Key", "StartDate", "EndDate"}
                                  , {"BU_Key", "StartDate", "EndDate"}),
    BetweenFilter = Table.SelectRows(LeftJoin, each (([Date] >= [StartDate]) and ([Date] <= [EndDate])) ),
    RemovedColumns = Table.RemoveColumns(BetweenFilter,{"StartDate", "EndDate"})

If the above query was run step by step in a literal fashion you would expect it to run two queries against the SQL database doing "SELECT * …" from both tables. However a profiler trace shows just the following single SQL query:

select [_].[BU_Code],
    select [$Outer].[BU_Code],
    from [dbo].[fact] as [$Outer]
    left outer join
        select [_].[BU_Key] as [BU_Key],
            [_].[BU_Code] as [BU_Code2],
            [_].[BU_Name] as [BU_Name],
            [_].[StartDate] as [StartDate],
            [_].[EndDate] as [EndDate]
        from [dbo].[BU] as [_]
    ) as [$Inner] on ([$Outer].[BU_Code] = [$Inner].[BU_Code2] or [$Outer].[BU_Code] is null and [$Inner].[BU_Code2] is null)
) as [_]
where [_].[Date] >= [_].[StartDate] and [_].[Date] <= [_].[EndDate]

The resulting query is a little strange, you can probably tell that it was generated programmatically. But if you look closely you'll notice that every single part of the Power Query formula has been pushed down to SQL Server. Power Query itself ends up just constructing the query and passing the results back to Excel, it does not do any of the data transformation steps itself.

So now you can feel a bit more comfortable showing Power Query to your less technical Colleagues knowing that the tool will do it's best fold all the  small steps in Power Query down the most efficient query that it can against the source systems.

Print | posted on Friday, May 16, 2014 7:40 AM