- edits to the subqueryload ordering merge

This commit is contained in:
Mike Bayer
2014-11-05 15:11:13 -05:00
parent 9f592d6481
commit 8200c2cd35
3 changed files with 80 additions and 50 deletions
+57 -39
View File
@@ -603,62 +603,78 @@ The same idea applies to all the other arguments, such as ``foreign_keys``::
foo = relationship(Dest, foreign_keys=[foo_id, bar_id])
.. _faq_subqueryload_sort:
.. _faq_subqueryload_limit_sort:
Why must I always ``ORDER BY`` a unique column when using ``subqueryload``?
----------------------------------------------------------------------------
Why is ``ORDER BY`` required with ``LIMIT`` (especially with ``subqueryload()``)?
---------------------------------------------------------------------------------
The SQL standard prescribes that RDBMSs are free to return rows in any order it
deems appropriate, if no ``ORDER BY`` clause is specified. This even extends to
the case where the ``ORDER BY`` clause is not unique across all rows, i.e. rows
with the same value in the ``ORDER BY`` column(s) will not necessarily be
returned in a deterministic order.
A relational database can return rows in any
arbitrary order, when an explicit ordering is not set.
While this ordering very often corresponds to the natural
order of rows within a table, this is not the case for all databases and
all queries. The consequence of this is that any query that limits rows
using ``LIMIT`` or ``OFFSET`` should **always** specify an ``ORDER BY``.
Otherwise, it is not deterministic which rows will actually be returned.
SQLAlchemy implements :func:`.orm.subqueryload` by issuing a separate query
(where the table specified in the relationship is joined to the original query)
and then attempting to match up the results in Python. This works fine
normally:
When we use a SQLAlchemy method like :meth:`.Query.first`, we are in fact
applying a ``LIMIT`` of one to the query, so without an explicit ordering
it is not deterministic what row we actually get back.
While we may not notice this for simple queries on databases that usually
returns rows in their natural
order, it becomes much more of an issue if we also use :func:`.orm.subqueryload`
to load related collections, and we may not be loading the collections
as intended.
SQLAlchemy implements :func:`.orm.subqueryload` by issuing a separate query,
the results of which are matched up to the results from the first query.
We see two queries emitted like this:
.. sourcecode:: python+sql
>>> session.query(User).options(subqueryload(User.addresses)).all()
{opensql}# the "main" query
{opensql}-- the "main" query
SELECT users.id AS users_id
FROM users
{stop}
{opensql}# the "load" query issued by subqueryload
SELECT addresses.id AS addresses_id, addresses.user_id AS addresses_user_id, anon_1.users_id AS anon_1_users_id
FROM (SELECT users.id AS users_id
FROM users) AS anon_1 JOIN addresses ON anon_1.users_id = addresses.user_id ORDER BY anon_1.users_id
{opensql}-- the "load" query issued by subqueryload
SELECT addresses.id AS addresses_id,
addresses.user_id AS addresses_user_id,
anon_1.users_id AS anon_1_users_id
FROM (SELECT users.id AS users_id FROM users) AS anon_1
JOIN addresses ON anon_1.users_id = addresses.user_id
ORDER BY anon_1.users_id
Notice how the main query is a subquery in the load query. When an
``OFFSET``/``LIMIT`` is involved, however, things get a bit tricky:
The second query embeds the first query as a source of rows.
When the inner query uses ``OFFSET`` and/or ``LIMIT`` without ordering,
the two queries may not see the same results:
.. sourcecode:: python+sql
>>> user = session.query(User).options(subqueryload(User.addresses)).first()
{opensql}# the "main" query
{opensql}-- the "main" query
SELECT users.id AS users_id
FROM users
LIMIT 1
{stop}
{opensql}# the "load" query issued by subqueryload
SELECT addresses.id AS addresses_id, addresses.user_id AS addresses_user_id, anon_1.users_id AS anon_1_users_id
FROM (SELECT users.id AS users_id
FROM users
LIMIT 1) AS anon_1 JOIN addresses ON anon_1.users_id = addresses.user_id ORDER BY anon_1.users_id
{opensql}-- the "load" query issued by subqueryload
SELECT addresses.id AS addresses_id,
addresses.user_id AS addresses_user_id,
anon_1.users_id AS anon_1_users_id
FROM (SELECT users.id AS users_id FROM users LIMIT 1) AS anon_1
JOIN addresses ON anon_1.users_id = addresses.user_id
ORDER BY anon_1.users_id
The main query is still a subquery in the load query, but *it may return a
different set of results in the second query from the first* because it does
not have a deterministic sort order! Depending on database internals, there is
a chance we may get the following resultset for the two queries::
Depending on database specifics, there is
a chance we may get the a result like the following for the two queries::
-- query #1
+--------+
|users_id|
+--------+
| 1|
+--------+
-- query #2
+------------+-----------------+---------------+
|addresses_id|addresses_user_id|anon_1_users_id|
+------------+-----------------+---------------+
@@ -667,26 +683,28 @@ a chance we may get the following resultset for the two queries::
| 4| 2| 2|
+------------+-----------------+---------------+
From SQLAlchemy's point of view, it didn't get any addresses back for user 1,
so ``user.addresses`` is empty. Oops.
Above, we receive two ``addresses`` rows for ``user.id`` of 2, and none for
1. We've wasted two rows and failed to actually load the collection. This
is an insidious error because without looking at the SQL and the results, the
ORM will not show that there's any issue; if we access the ``addresses``
for the ``User`` we have, it will emit a lazy load for the collection and we
won't see that anything actually went wrong.
The solution to this problem is to always specify a deterministic sort order,
so that the main query always returns the same set of rows. This generally
means that you should :meth:`.Query.order_by` on a unique column on the table,
usually the primary key::
means that you should :meth:`.Query.order_by` on a unique column on the table.
The primary key is a good choice for this::
session.query(User).options(subqueryload(User.addresses)).order_by(User.id).first()
You can get away with not doing a sort if the ``OFFSET``/``LIMIT`` does not
throw away any rows at all, but it's much simpler to remember to always ``ORDER
BY`` the primary key::
session.query(User).options(subqueryload(User.addresses)).filter(User.id == 1).first()
Note that :func:`.joinedload` does not suffer from the same problem because
only one query is ever issued, so the load query cannot be different from the
main query.
.. seealso::
:ref:`subqueryload_ordering`
Performance
===========
+18 -7
View File
@@ -120,21 +120,32 @@ query options:
# set children to load eagerly with a second statement
session.query(Parent).options(subqueryload('children')).all()
.. _subquery_loading_tips:
.. _subqueryload_ordering:
Subquery Loading Tips
^^^^^^^^^^^^^^^^^^^^^
The Importance of Ordering
--------------------------
If you have ``LIMIT`` or ``OFFSET`` in your query, you **must** ``ORDER BY`` a
unique column, generally the primary key of your table, in order to ensure
correct results (see :ref:`faq_subqueryload_sort`)::
A query which makes use of :func:`.subqueryload` in conjunction with a
limiting modifier such as :meth:`.Query.first`, :meth:`.Query.limit`,
or :meth:`.Query.offset` should **always** include :meth:`.Query.order_by`
against unique column(s) such as the primary key, so that the additional queries
emitted by :func:`.subqueryload` include
the same ordering as used by the parent query. Without it, there is a chance
that the inner query could return the wrong rows::
# incorrect
# incorrect, no ORDER BY
session.query(User).options(subqueryload(User.addresses)).first()
# incorrect if User.name is not unique
session.query(User).options(subqueryload(User.addresses)).order_by(User.name).first()
# correct
session.query(User).options(subqueryload(User.addresses)).order_by(User.name, User.id).first()
.. seealso::
:ref:`faq_subqueryload_limit_sort` - detailed example
Loading Along Paths
-------------------
+5 -4
View File
@@ -1631,11 +1631,12 @@ very easy to use:
>>> jack.addresses
[<Address(email_address='jack@google.com')>, <Address(email_address='j25@yahoo.com')>]
.. warning::
.. note::
If you use :func:`.subqueryload`, you should generally
:meth:`.Query.order_by` on a unique column in order to ensure correct
results. See :ref:`subquery_loading_tips`.
:func:`.subqueryload` when used in conjunction with limiting such as
:meth:`.Query.first`, :meth:`.Query.limit` or :meth:`.Query.offset`
should also include :meth:`.Query.order_by` on a unique column in order to
ensure correct results. See :ref:`subqueryload_ordering`.
Joined Load
-------------