Null values: RFC on a different method for paging of large result set, please comment

Summary : I think of two methods for the same need. I expose them here and ask the forum for hints and comments.

Here is the context :

SQL Server 2005 database called from a C# program. Based on user interaction the program has to issue a query that returns a variable number of records. Sometimes the number is small, sometimes it may be bigger (thousands) and will eventually become even bigger (hundreds of thousands). The query may have several JOIN and ORDER BY (no GROUP BY). The columns to retrieve depend on user input. There may be a lot of columns to fetch. The queried tables are append-only and do not change often. The C# program probably runs on the same machine as the database but the user GUI is on a remote computer.

All the record data is never needed at once. Instead, only some slices (pages) of the whole result set are needed, on demand. They are displayed in a virtual grid in a Windows Forms app.

So far this looks relatively common. Searching on SQL Server docs, on the web and on the forum shows one recommended way, which is basically explained on http://technet.microsoft.com/fr-fr/library/ms186734(SQL.90).aspx : make a first SELECT that includes a computed column based on ROW_NUMBER() and a certain ORDER BY. That SELECT is wrapped in a second select which restricts the result set to a certain range of rows.

Although this method is better than some previous before ROW_NUMBER() was available, it looks like it makes the SQL Server perform the same query again and again for each slice/page needed.

I see another method and wonder if anyone has hints or advice about it.

The method is :

Perform the whole query once (no limit on returned rows) but only request the ID of the records, not any other column. Fetch all those IDs from the database and keep them at C# level. When any page is needed, the C# level only has to look in the fetched table to find the IDs actually needed. It request a set of objects by their IDs, like this :
SELECT some, columns IN my_table WHERE ID IN (someId1, someId2 etc...)
(a real example may still have some JOIN, the order is more or less irrelevant because the client know the order by IDs). So only one SELECT does all the hard work. All subsequent SELECTs (one per page) only fetch a set of records using simple queries (still with joins).

Let's call that second method the "list of IDs" method.

The supposed differences are :

on one hand, list of IDs seems better because the subsequent queries are simpler and only involve few objects, making them much lighter and faster. On the contrary, ROW_NUMBER() method reperforms the whole initial query each time a page is needed (only fetching a different limited range of rows). It seems that in that case the database engine has to basically redo the same tough work (considering the whole thing, perform the order, generate the row_number column), right ? on the other hand, list of IDs method fetches all the IDs even when very little of them are needed. ROW_NUMBER() method only sends from database to client the smallest needed data. Can this be an argument in favor of ROW_NUMBER() method ?

So I'm wondering why I could not find anything mentioning this "list of ids" method or anything similar neither in the documentation, nor on the web.

Thank you for your insightful comments.

The first questions I have are about returning "hundreds of thousands" of ID_LIST records:

What are the bandwidth standards for your network? What is the bandwidth that will be required for you to return "hundreds of thousands" of ID_LIST records? How frequently might the network need to service such requests? How sensitive do you need to be to changes in the list?|||

Here are the clarifications you asked for :

What is the bandwidth that will be required for you to return "hundreds of thousands" of ID_LIST records?

For a start we can assume that the database client lives on the same machine as the SQL Server.

The "hundreds of thousands" are only IDs, not any ID_LIST record. I mean : the ID column of the involved database table (which has other columns and joins to other tables).

So, that big one-column-of-IDs result set will only travel from the SQL Server 2005 process to the .NET process on the same machine.

For completeness, that C# program (the client of the SQL Server process) is in turn a remoting server to which a C# GUI client is connected, through a, say, 128kbps to 1Mbps link. As far as the latter user process is concerned, the situation is the same in the two methods because exactly the same data will flow through the slow link. The difference between the two methods only change the interaction between the SQL Server 2005 process and its direct C# client which we can assume to live on the same machine.

I guess this answers your questions about bandwidth. As for the frequency, well, several remote C# clients may connect simultaneously to the server machine, all to the same C# process. That only server C# process is the only client connected to the involved dabatase. Requests are triggered by human interaction so they are not numerous but they should quickly serve the first page and any other requested page (by scrolling the virtual grid). Often there won't be many pages requested, but if they are they have to be served quickly.

How sensitive do you need to be to changes in the list?

As written above "The queried tables are append-only and do not change often.". Not often means here a few times a day. Moreover I can be notified of a change by other means and invalidate the display so that it refreshes correctly.

Thank you for your attention.

|||I have used both methods you describe in different situations. There are pros and cons to each. The simple answer is whatever works for your situation.

Yes, the ROW_NUMBER method does the same query many times. However, in most situations, the data is cached, so the performance is very fast and it should be only searching the indexes. This method also gives instant results of changes. If someone adds a record on page 10, now your query shows it on page 10 and shifts everything down.

The other method, involves either running multiple individual queries, or parsing a string of 50 IDs to get at once or dynamic SQL. You have basically cached the query results. So you need a method to rerun the query and update the list, so when someone adds a record you get it. Also, you need to handle if someone deletes one of your keys in the database. This works best for short lists which do not change often.

Most applications do not want "cached" results. Which is why you don't see this often. You will get calls, "Joe just added a record and I don't see it".

Either method works fine.

|||

Thank you for your answer. So, there is apparently no hidden gotcha about the "list of ids" method. Fine !

In our application a notification mechanism already ensures that when Joe adds a record all clients immediately get an updated display, so both methods are acceptable in our case without further complication.

Those who will read that thread may find interesting that link that I found on another thread in those forums : Arrays and Lists in SQL Server 2005 . It enumerates various techniques on how to pass a list of values from client to SQL Server.

Wednesday, March 21, 2012

RFC on a different method for paging of large result set, please comment

No comments:

Post a Comment

Null values

Blog Archive

About Me