Skip to main content

TPCH Query Example

In this section, we will demonstrate how Accio can significantly enhance your scenario by utilizing TPCH as an example. You will witness the practical application of Accio in improving your data management and analysis processes.

Accio MDL

Firstly, we should define Accio MDL for TPCH schema.

tpch-schema

Q3 - Shipping Priority Query

This query retrieves the 10 unshipped orders with the highest value.

Business Question

The Shipping Priority Query retrieves the shipping priority and potential revenue, defined as the sum of l_extendedprice * (1-l_discount), of the orders having the largest revenue among those that had not been shipped as of a given date. Orders are listed in decreasing order of revenue. If more than 10 unshipped orders exist, only the 10 orders with the largest revenue are listed.

How to apply Accio

In this case, we want to know the top 10 high priority orders. revenue is the sorting standard. So we can start from Orders table.

SELECT * FROM Orders

The formula of revenue is sum of l_extendedprice * (1-l_discount). We can get the associated Lineitem object using linitems field in Orders. We know the relationship of Orders to Lineitem is a ONE_TO_MANY relationship. So we can apply transform, filter and array_sum function to get the revenue of each order.

filter is used to get the data after the specific shipping date. In this case, the given shipping date is 1995-03-15.

transform is used to apply the formula of revenue to get the income of each order.

array_sum can sum the income of all lineitem to get the final revenue.

The full SQL is:

SELECT
orderkey,
array_sum(
transform(
filter(
lineitems,
l -> l.shipdate > DATE '1995-03-15'),
l -> l.extendedprice * (1 - l.discount))) as revenue
FROM
Orders

Now that the most important question has been resolved, we can enhance the completeness of this result by incorporating additional filter conditions, desired item selection, and sorting expressions.

SELECT
orderkey,
array_sum(
transform(
filter(
lineitems,
l -> l.shipdate > DATE '1995-03-15'),
l -> l.extendedprice * (1 - l.discount))) as revenue,
orderdate,
shippriority
FROM Orders
WHERE
customer.mktsegment = 'BUILDING'
AND orderdate < DATE '1995-03-15'
ORDER BY revenue DESC, orderdate
LIMIT 10

Q7 - Volume Shipping Query

This query determines the value of goods shipped between certain nations to help in the re-negotiation of shipping contracts.

Business Question

The Volume Shipping Query finds, for two given nations, the gross discounted revenues derived from lineitems in which parts were shipped from a supplier in either nation to a customer in the other nation during 1995 and 1996. The query lists the supplier nation, the customer nation, the year, and the revenue from shipments that took place in that year. The query orders the answer by Supplier nation, Customer nation, and year (all ascending).

How to Apply Accio

Let's approach this question by starting with the Lineitem table. We want to determine the discounted revenues generated from shipping parts between FRANCE and GERMANY. Here's how we can solve it:

  1. Filter the specific line items: We can use the partsupp.supplier.nation.name field to identify the nation name of the supplier and the order_field.customer.nation.name field to identify the nation name of the customer.

    SELECT *
    FROM Lineitem
    WHERE
    ((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
    OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
  2. Calculate the revenue: We can obtain the revenue by performing a simple aggregation on the line items.

    SELECT
    sum(extendedprice * (1 - discount)) AS volume
    FROM
    Lineitem
    WHERE
    ((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
    OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
  3. Grouping and sorting: Add the necessary grouping keys (supplier_nation, customer_nation, and shipment_year), filter conditions (e.g., for the years 1995 and 1996), and sorting keys (supplier_nation, customer_nation, and shipment_year) to organize the results.

    SELECT
    partsupp.supplier.nation.name AS supp_nation,
    order_field.customer.nation.name AS cust_nation,
    date_trunc('YEAR', shipdate) AS l_year,
    sum(extendedprice * (1 - discount)) AS volume
    FROM
    Lineitem
    WHERE
    ((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
    OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
    AND shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31'
    GROUP BY 1, 2, 3
    ORDER BY 1, 2, 3

By utilizing the relation fields, performing aggregations, and applying filters and sorting, we can directly access the required data and calculate the necessary results.

Accio enhances the readability of SQL

Let’s see the original TPCH query. This is an intriguing case involving two tables, namely n1 and n2, which are referred to as the nation table. Based on the join condition, it becomes apparent that n1 represents the nation of the supplier, while n2 represents the nation of the customer. In traditional SQL, this relationship is not explicitly evident in the select item. To fully comprehend the relationship, one must carefully examine the join condition, table relationships, and column connections.

However, by utilizing Accio, we can introduce semantic clarity. Instead of using n1.name, we can use partsupp.supplier.nation.name, which directly indicates that it refers to the nation of the supplier. This makes it more accessible for anyone to understand the intended meaning.

SELECT
supp_nation,
cust_nation,
l_year,
SUM(volume) AS revenue
FROM
(
SELECT
n1.n_name AS supp_nation,
n2.n_name AS cust_nation,
YEAR(l_shipdate) AS l_year,
l_extendedprice * (1 - l_discount) AS volume
FROM
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
WHERE
s_suppkey = l_suppkey
AND o_orderkey = l_orderkey
AND c_custkey = o_custkey
AND s_nationkey = n1.n_nationkey
AND c_nationkey = n2.n_nationkey
AND (
(n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
)
AND l_shipdate BETWEEN MDY(1,1,1995) AND MDY(12,31,1996)
) AS shipping
GROUP BY
supp_nation,
cust_nation,
l_year
ORDER BY
supp_nation,
cust_nation,
l_year
SELECT
supp_nation,
cust_nation,
l_year,
SUM(volume) AS revenue
FROM
(
SELECT
partsupp.supplier.nation.name AS supp_nation,
order_field.customer.nation.name AS cust_nation,
date_trunc('YEAR', shipdate) AS l_year,
extendedprice * (1 - discount) AS volume
FROM
Lineitem
WHERE
(
(partsupp.supplier.nation.name = 'FRANCE'AND order_field.customer.nation.name = 'GERMANY')
OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
AND shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31'
) AS shipping
GROUP BY
supp_nation,
cust_nation,
l_year
ORDER BY
supp_nation,
cust_nation,
l_year

Q15 - Top Supplier Query

This query determines the top supplier so it can be rewarded, given more business, or identified for special recognition.

Business Question

The Top Supplier Query finds the supplier who contributed the most to the overall revenue for parts shipped during a given quarter of a given year. In case of a tie, the query lists all suppliers whose contribution was equal to the maximum, presented in supplier number order.

How to apply Accio

In this particular scenario, our objective is to identify the top supplier who has made the most significant contribution to the total revenue generated from parts shipped during a specific quarter of a given year. To accomplish this, we can begin our query from the Supplier table.

SELECT * FROM Supplier

Next, we aim to determine the revenue of each supplier. To obtain this information, we can employ filter, transform and array_sum .

SELECT
array_sum(
transform(
filter(Supplier.lineitems,
item -> item.shipdate >= date '1995-04-01' AND item.shipdate < date '1995-04-01' + interval '3' month),
item -> item.extendedprice * (1 - item.discount))) as revenue
FROM
Supplier

To determine the top supplier, we will sort all the data in descending order based on the revenues and select only the top result. This single supplier represents the final answer to our question.

SELECT 
suppkey,
name,
address,
phone,
array_sum(
transform(
filter(Supplier.lineitems, item -> item.shipdate >= date '1995-04-01' AND item.shipdate < date '1995-04-01' + interval '3' month),
item -> item.extendedprice * (1 - item.discount))) as revenue
FROM Supplier
ORDER BY revenue DESC
LIMIT 1;

Reusable Expression Revenue for TPCH dataset

In the TPCH query, we can observe that the concept of revenue is utilized in multiple scenarios. Revenue is a commonly used metric, which can be calculated as l_extendedprice * (1 - l_discount). This formula is applied in all three cases mentioned earlier to calculate the income for each lineitem.

In Expression part, we introduced a feature of Accio, expression. Using expression, we can define a column which is a reusable formula, then user can access it without compose the formula by themself.

Given a Lineitem model definition,

Model Lineitem @sql('select * from lineitem') {
orderkey: INTEGER
linenumber: INTEGER
extendedprice: REAL
discount: REAL

// others columns of Lineitem
...

revenue: REAL @expr(extendedprice * (1 - discount))
}

We can modify Q15 to

SELECT 
suppkey,
name,
address,
phone,
array_sum(
transform(
filter(Supplier.lineitems, item -> item.shipdate >= date '1995-04-01' AND item.shipdate < date '1995-04-01' + interval '3' month),
item -> item.revenue)) as revenue
FROM Supplier
ORDER BY revenue DESC
LIMIT 1;

Q7 can be modified to

SELECT
partsupp.supplier.nation.name AS supp_nation,
order_field.customer.nation.name AS cust_nation,
date_trunc('YEAR', shipdate) AS l_year,
sum(revenue) AS volume
FROM
Lineitem
WHERE
((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
AND shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31'
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3

The beauty of reusing the same concept in different scenarios lies in its ability to ensure that people do not make mistakes when dealing with a common metric. By employing this consistent approach, there is a reduced risk of confusion or error when working with the metric across various contexts.