- edits to the subqueryload ordering merge

2026-06-01 21:38:55 -04:00 · 2014-11-05 15:11:13 -05:00
parent 9f592d6481
commit 8200c2cd35
3 changed files with 80 additions and 50 deletions
@@ -603,62 +603,78 @@ The same idea applies to all the other arguments, such as ``foreign_keys``::

        foo = relationship(Dest, foreign_keys=[foo_id, bar_id])

-.. _faq_subqueryload_sort:
+.. _faq_subqueryload_limit_sort:

-Why must I always ``ORDER BY`` a unique column when using ``subqueryload``?
----------------------------------------------------------------------------
+Why is ``ORDER BY`` required with ``LIMIT`` (especially with ``subqueryload()``)?
+---------------------------------------------------------------------------------

-The SQL standard prescribes that RDBMSs are free to return rows in any order it
-deems appropriate, if no ``ORDER BY`` clause is specified. This even extends to
-the case where the ``ORDER BY`` clause is not unique across all rows, i.e. rows
-with the same value in the ``ORDER BY`` column(s) will not necessarily be
-returned in a deterministic order.
+A relational database can return rows in any
+arbitrary order, when an explicit ordering is not set.
+While this ordering very often corresponds to the natural
+order of rows within a table, this is not the case for all databases and
+all queries.   The consequence of this is that any query that limits rows
+using ``LIMIT`` or ``OFFSET`` should **always** specify an ``ORDER BY``.
+Otherwise, it is not deterministic which rows will actually be returned.

-SQLAlchemy implements :func:`.orm.subqueryload` by issuing a separate query
-(where the table specified in the relationship is joined to the original query)
-and then attempting to match up the results in Python. This works fine
-normally:
+When we use a SQLAlchemy method like :meth:`.Query.first`, we are in fact
+applying a ``LIMIT`` of one to the query, so without an explicit ordering
+it is not deterministic what row we actually get back.
+While we may not notice this for simple queries on databases that usually
+returns rows in their natural
+order, it becomes much more of an issue if we also use :func:`.orm.subqueryload`
+to load related collections, and we may not be loading the collections
+as intended.
+
+SQLAlchemy implements :func:`.orm.subqueryload` by issuing a separate query,
+the results of which are matched up to the results from the first query.
+We see two queries emitted like this:

 .. sourcecode:: python+sql

    >>> session.query(User).options(subqueryload(User.addresses)).all()
-    {opensql}# the "main" query
+    {opensql}-- the "main" query
    SELECT users.id AS users_id
    FROM users
    {stop}
-    {opensql}# the "load" query issued by subqueryload
-    SELECT addresses.id AS addresses_id, addresses.user_id AS addresses_user_id, anon_1.users_id AS anon_1_users_id
-    FROM (SELECT users.id AS users_id
-    FROM users) AS anon_1 JOIN addresses ON anon_1.users_id = addresses.user_id ORDER BY anon_1.users_id
+    {opensql}-- the "load" query issued by subqueryload
+    SELECT addresses.id AS addresses_id,
+           addresses.user_id AS addresses_user_id,
+           anon_1.users_id AS anon_1_users_id
+    FROM (SELECT users.id AS users_id FROM users) AS anon_1
+    JOIN addresses ON anon_1.users_id = addresses.user_id
+    ORDER BY anon_1.users_id

-Notice how the main query is a subquery in the load query. When an
-``OFFSET``/``LIMIT`` is involved, however, things get a bit tricky:
+The second query embeds the first query as a source of rows.
+When the inner query uses ``OFFSET`` and/or ``LIMIT`` without ordering,
+the two queries may not see the same results:

 .. sourcecode:: python+sql

    >>> user = session.query(User).options(subqueryload(User.addresses)).first()
-    {opensql}# the "main" query
+    {opensql}-- the "main" query
    SELECT users.id AS users_id
    FROM users
     LIMIT 1
    {stop}
-    {opensql}# the "load" query issued by subqueryload
-    SELECT addresses.id AS addresses_id, addresses.user_id AS addresses_user_id, anon_1.users_id AS anon_1_users_id
-    FROM (SELECT users.id AS users_id
-    FROM users
-     LIMIT 1) AS anon_1 JOIN addresses ON anon_1.users_id = addresses.user_id ORDER BY anon_1.users_id
+    {opensql}-- the "load" query issued by subqueryload
+    SELECT addresses.id AS addresses_id,
+           addresses.user_id AS addresses_user_id,
+           anon_1.users_id AS anon_1_users_id
+    FROM (SELECT users.id AS users_id FROM users LIMIT 1) AS anon_1
+    JOIN addresses ON anon_1.users_id = addresses.user_id
+    ORDER BY anon_1.users_id

-The main query is still a subquery in the load query, but *it may return a
-different set of results in the second query from the first* because it does
-not have a deterministic sort order! Depending on database internals, there is
-a chance we may get the following resultset for the two queries::
+Depending on database specifics, there is
+a chance we may get the a result like the following for the two queries::

+    -- query #1
    +--------+
    |users_id|
    +--------+
    |       1|
    +--------+

+    -- query #2
    +------------+-----------------+---------------+
    |addresses_id|addresses_user_id|anon_1_users_id|
    +------------+-----------------+---------------+
@@ -667,26 +683,28 @@ a chance we may get the following resultset for the two queries::
    |           4|                2|              2|
    +------------+-----------------+---------------+

-From SQLAlchemy's point of view, it didn't get any addresses back for user 1,
-so ``user.addresses`` is empty. Oops.
+Above, we receive two ``addresses`` rows for ``user.id`` of 2, and none for
+1.  We've wasted two rows and failed to actually load the collection.  This
+is an insidious error because without looking at the SQL and the results, the
+ORM will not show that there's any issue; if we access the ``addresses``
+for the ``User`` we have, it will emit a lazy load for the collection and we
+won't see that anything actually went wrong.

 The solution to this problem is to always specify a deterministic sort order,
 so that the main query always returns the same set of rows. This generally
-means that you should :meth:`.Query.order_by` on a unique column on the table,
-usually the primary key::
+means that you should :meth:`.Query.order_by` on a unique column on the table.
+The primary key is a good choice for this::

    session.query(User).options(subqueryload(User.addresses)).order_by(User.id).first()

-You can get away with not doing a sort if the ``OFFSET``/``LIMIT`` does not
-throw away any rows at all, but it's much simpler to remember to always ``ORDER
-BY`` the primary key::
-
-    session.query(User).options(subqueryload(User.addresses)).filter(User.id == 1).first()
-
 Note that :func:`.joinedload` does not suffer from the same problem because
 only one query is ever issued, so the load query cannot be different from the
 main query.

+.. seealso::
+
+    :ref:`subqueryload_ordering`
+
 Performance
 ===========

@@ -120,21 +120,32 @@ query options:
    # set children to load eagerly with a second statement
    session.query(Parent).options(subqueryload('children')).all()

-.. _subquery_loading_tips:
+.. _subqueryload_ordering:

-Subquery Loading Tips
-^^^^^^^^^^^^^^^^^^^^^
+The Importance of Ordering
+--------------------------

-If you have ``LIMIT`` or ``OFFSET`` in your query, you **must** ``ORDER BY`` a
-unique column, generally the primary key of your table, in order to ensure
-correct results (see :ref:`faq_subqueryload_sort`)::
+A query which makes use of :func:`.subqueryload` in conjunction with a
+limiting modifier such as :meth:`.Query.first`, :meth:`.Query.limit`,
+or :meth:`.Query.offset` should **always** include :meth:`.Query.order_by`
+against unique column(s) such as the primary key, so that the additional queries
+emitted by :func:`.subqueryload` include
+the same ordering as used by the parent query.  Without it, there is a chance
+that the inner query could return the wrong rows::

-    # incorrect
+    # incorrect, no ORDER BY
+    session.query(User).options(subqueryload(User.addresses)).first()
+
+    # incorrect if User.name is not unique
    session.query(User).options(subqueryload(User.addresses)).order_by(User.name).first()

    # correct
    session.query(User).options(subqueryload(User.addresses)).order_by(User.name, User.id).first()

+.. seealso::
+
+    :ref:`faq_subqueryload_limit_sort` - detailed example
+
 Loading Along Paths
 -------------------

@@ -1631,11 +1631,12 @@ very easy to use:
    >>> jack.addresses
    [<Address(email_address='jack@google.com')>, <Address(email_address='j25@yahoo.com')>]

-.. warning::
+.. note::

-   If you use :func:`.subqueryload`, you should generally
-   :meth:`.Query.order_by` on a unique column in order to ensure correct
-   results. See :ref:`subquery_loading_tips`.
+   :func:`.subqueryload` when used in conjunction with limiting such as
+   :meth:`.Query.first`, :meth:`.Query.limit` or :meth:`.Query.offset`
+   should also include :meth:`.Query.order_by` on a unique column in order to
+   ensure correct results.  See :ref:`subqueryload_ordering`.

 Joined Load
 -------------