TPCH Query Example
In this section, we will demonstrate how Accio can significantly enhance your scenario by utilizing TPCH as an example. You will witness the practical application of Accio in improving your data management and analysis processes.
Accio MDL
Firstly, we should define Accio MDL for TPCH schema.
Q3 - Shipping Priority Query
This query retrieves the 10 unshipped orders with the highest value.
Business Question
The Shipping Priority Query retrieves the shipping priority and potential revenue, defined as the sum of l_extendedprice * (1-l_discount)
, of the orders having the largest revenue among those that had not been shipped as of a given date. Orders are listed in decreasing order of revenue. If more than 10 unshipped orders exist, only the 10 orders with the largest revenue are listed.
How to apply Accio
In this case, we want to know the top 10 high priority orders. revenue
is the sorting standard. So we can start from Orders
table.
SELECT * FROM Orders
The formula of revenue
is sum of l_extendedprice * (1-l_discount)
. We can get the associated Lineitem
object using linitems
field in Orders
. We know the relationship of Orders
to Lineitem
is a ONE_TO_MANY
relationship. So we can apply transform
, filter
and array_sum
function to get the revenue
of each order
.
filter
is used to get the data after the specific shipping date. In this case, the given shipping date is 1995-03-15
.
transform
is used to apply the formula of revenue
to get the income of each order.
array_sum
can sum the income of all lineitem
to get the final revenue.
The full SQL is:
SELECT
orderkey,
array_sum(
transform(
filter(
lineitems,
l -> l.shipdate > DATE '1995-03-15'),
l -> l.extendedprice * (1 - l.discount))) as revenue
FROM
Orders
Now that the most important question has been resolved, we can enhance the completeness of this result by incorporating additional filter conditions, desired item selection, and sorting expressions.
SELECT
orderkey,
array_sum(
transform(
filter(
lineitems,
l -> l.shipdate > DATE '1995-03-15'),
l -> l.extendedprice * (1 - l.discount))) as revenue,
orderdate,
shippriority
FROM Orders
WHERE
customer.mktsegment = 'BUILDING'
AND orderdate < DATE '1995-03-15'
ORDER BY revenue DESC, orderdate
LIMIT 10
Q7 - Volume Shipping Query
This query determines the value of goods shipped between certain nations to help in the re-negotiation of shipping contracts.
Business Question
The Volume Shipping Query finds, for two given nations, the gross discounted revenues derived from lineitems in which parts were shipped from a supplier in either nation to a customer in the other nation during 1995 and 1996. The query lists the supplier nation, the customer nation, the year, and the revenue from shipments that took place in that year. The query orders the answer by Supplier nation, Customer nation, and year (all ascending).
How to Apply Accio
Let's approach this question by starting with the Lineitem
table. We want to determine the discounted revenues generated from shipping parts between FRANCE
and GERMANY
. Here's how we can solve it:
Filter the specific line items: We can use the
partsupp.supplier.nation.name
field to identify the nation name of the supplier and theorder_field.customer.nation.name
field to identify the nation name of the customer.SELECT *
FROM Lineitem
WHERE
((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))Calculate the revenue: We can obtain the revenue by performing a simple aggregation on the line items.
SELECT
sum(extendedprice * (1 - discount)) AS volume
FROM
Lineitem
WHERE
((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))Grouping and sorting: Add the necessary grouping keys (
supplier_nation
,customer_nation
, andshipment_year
), filter conditions (e.g., for the years 1995 and 1996), and sorting keys (supplier_nation
,customer_nation
, andshipment_year
) to organize the results.SELECT
partsupp.supplier.nation.name AS supp_nation,
order_field.customer.nation.name AS cust_nation,
date_trunc('YEAR', shipdate) AS l_year,
sum(extendedprice * (1 - discount)) AS volume
FROM
Lineitem
WHERE
((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
AND shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31'
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3
By utilizing the relation fields, performing aggregations, and applying filters and sorting, we can directly access the required data and calculate the necessary results.
Accio enhances the readability of SQL
Let’s see the original TPCH query. This is an intriguing case involving two tables, namely n1
and n2
, which are referred to as the nation
table. Based on the join condition, it becomes apparent that n1
represents the nation of the supplier, while n2
represents the nation of the customer. In traditional SQL, this relationship is not explicitly evident in the select item. To fully comprehend the relationship, one must carefully examine the join condition, table relationships, and column connections.
However, by utilizing Accio, we can introduce semantic clarity. Instead of using n1.name
, we can use partsupp.supplier.nation.name
, which directly indicates that it refers to the nation of the supplier. This makes it more accessible for anyone to understand the intended meaning.
SELECT
supp_nation,
cust_nation,
l_year,
SUM(volume) AS revenue
FROM
(
SELECT
n1.n_name AS supp_nation,
n2.n_name AS cust_nation,
YEAR(l_shipdate) AS l_year,
l_extendedprice * (1 - l_discount) AS volume
FROM
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
WHERE
s_suppkey = l_suppkey
AND o_orderkey = l_orderkey
AND c_custkey = o_custkey
AND s_nationkey = n1.n_nationkey
AND c_nationkey = n2.n_nationkey
AND (
(n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
)
AND l_shipdate BETWEEN MDY(1,1,1995) AND MDY(12,31,1996)
) AS shipping
GROUP BY
supp_nation,
cust_nation,
l_year
ORDER BY
supp_nation,
cust_nation,
l_year
SELECT
supp_nation,
cust_nation,
l_year,
SUM(volume) AS revenue
FROM
(
SELECT
partsupp.supplier.nation.name AS supp_nation,
order_field.customer.nation.name AS cust_nation,
date_trunc('YEAR', shipdate) AS l_year,
extendedprice * (1 - discount) AS volume
FROM
Lineitem
WHERE
(
(partsupp.supplier.nation.name = 'FRANCE'AND order_field.customer.nation.name = 'GERMANY')
OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
AND shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31'
) AS shipping
GROUP BY
supp_nation,
cust_nation,
l_year
ORDER BY
supp_nation,
cust_nation,
l_year
Q15 - Top Supplier Query
This query determines the top supplier so it can be rewarded, given more business, or identified for special recognition.
Business Question
The Top Supplier Query finds the supplier who contributed the most to the overall revenue for parts shipped during a given quarter of a given year. In case of a tie, the query lists all suppliers whose contribution was equal to the maximum, presented in supplier number order.
How to apply Accio
In this particular scenario, our objective is to identify the top supplier who has made the most significant contribution to the total revenue generated from parts shipped during a specific quarter of a given year. To accomplish this, we can begin our query from the Supplier
table.
SELECT * FROM Supplier
Next, we aim to determine the revenue of each supplier. To obtain this information, we can employ filter
, transform
and array_sum
.
SELECT
array_sum(
transform(
filter(Supplier.lineitems,
item -> item.shipdate >= date '1995-04-01' AND item.shipdate < date '1995-04-01' + interval '3' month),
item -> item.extendedprice * (1 - item.discount))) as revenue
FROM
Supplier
To determine the top supplier, we will sort all the data in descending order based on the revenues and select only the top result. This single supplier represents the final answer to our question.
SELECT
suppkey,
name,
address,
phone,
array_sum(
transform(
filter(Supplier.lineitems, item -> item.shipdate >= date '1995-04-01' AND item.shipdate < date '1995-04-01' + interval '3' month),
item -> item.extendedprice * (1 - item.discount))) as revenue
FROM Supplier
ORDER BY revenue DESC
LIMIT 1;
Reusable Expression Revenue
for TPCH dataset
In the TPCH query, we can observe that the concept of revenue is utilized in multiple scenarios. Revenue is a commonly used metric, which can be calculated as l_extendedprice * (1 - l_discount)
. This formula is applied in all three cases mentioned earlier to calculate the income for each lineitem
.
In Expression part, we introduced a feature of Accio, expression
. Using expression
, we can define a column which is a reusable formula, then user can access it without compose the formula by themself.
Given a Lineitem
model definition,
Model Lineitem @sql('select * from lineitem') {
orderkey: INTEGER
linenumber: INTEGER
extendedprice: REAL
discount: REAL
// others columns of Lineitem
...
revenue: REAL @expr(extendedprice * (1 - discount))
}
We can modify Q15 to
SELECT
suppkey,
name,
address,
phone,
array_sum(
transform(
filter(Supplier.lineitems, item -> item.shipdate >= date '1995-04-01' AND item.shipdate < date '1995-04-01' + interval '3' month),
item -> item.revenue)) as revenue
FROM Supplier
ORDER BY revenue DESC
LIMIT 1;
Q7 can be modified to
SELECT
partsupp.supplier.nation.name AS supp_nation,
order_field.customer.nation.name AS cust_nation,
date_trunc('YEAR', shipdate) AS l_year,
sum(revenue) AS volume
FROM
Lineitem
WHERE
((partsupp.supplier.nation.name = 'FRANCE' AND order_field.customer.nation.name = 'GERMANY')
OR (partsupp.supplier.nation.name = 'GERMANY' AND order_field.customer.nation.name = 'FRANCE'))
AND shipdate BETWEEN DATE '1995-01-01' AND DATE '1996-12-31'
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3
The beauty of reusing the same concept in different scenarios lies in its ability to ensure that people do not make mistakes when dealing with a common metric. By employing this consistent approach, there is a reduced risk of confusion or error when working with the metric across various contexts.