New Posts New Posts RSS Feed: GUID Keys and DataBase Performance
  FAQ FAQ  Forum Search   Calendar   Register Register  Login Login

GUID Keys and DataBase Performance

 Post Reply Post Reply
Author
Customer View Drop Down
Senior Member
Senior Member
Avatar
User Submitted Questions to Support

Joined: 30-May-2007
Location: United States
Posts: 260
Post Options Post Options   Quote Customer Quote  Post ReplyReply Direct Link To This Post Topic: GUID Keys and DataBase Performance
    Posted: 06-Jun-2007 at 11:48am

Question:

What are the impacts of Guid keys on database performance? 

 



Edited by IdeaBlade - 06-Jun-2007 at 12:04pm
Back to Top
IdeaBlade View Drop Down
Moderator Group
Moderator Group
Avatar

Joined: 30-May-2007
Location: United States
Posts: 353
Post Options Post Options   Quote IdeaBlade Quote  Post ReplyReply Direct Link To This Post Posted: 06-Jun-2007 at 12:11pm

“Here is my annotated, Universal Time version of the C# Guid.Comb described in the email that begins as shown below.

I've tweaked the Audit in Cabana to use it.

I still suspect that an Audit table that uses a bigint (long) identity column might be better from a space perspective at least (although you could eliminate the ModTs now and recover that info from the Guid.Comb itself !)

I've kept that version around (Audit2) if anyone cares to see it.”

Ward Bell

Answer (by Ward Bell):

I said that I thought the impact would be marginal. From a space perspective, it is the difference between an 8 byte id (long, bigint) that we typically recommend and the 16 bytes of a Guid.

But I'm not a SQL expert by any means so I thought I'd do a little more research.

WARNING: The following is a long and rambling ensemble of my thoughts and bits clipped from around the net. Read on only if you are interested.

 If I had to sum it up: Guids are tempting but could endanger performance. It appears that Guid.Combs, which are "COMBinations of Guid and current datetime) may be a better approach. Guid.Combs are easily calculated on the client side.

OK - read on if you want. This is more stream of consciousness than I intended as it captures info as I discovered it. I hope to clean it up later.

====

It seems that the performance story is a little more complicated. The way I read it, Guids are often OK; you just have to make sure your application is aligned with Guid strengths and weaknesses.

 [See below for a discussion of pros and cons]

Their intended purpose is to support records that need global uniqueness and/or in replication scenarios. Any application which expects to operate off-line for a period of time is made much easier to program and maintain by the use of Guids.

On the other hand, performance concerns become serious if you are going to have millions of rows or lots of inserts. Sure we have disk space to burn but disk I/O remains precious and shouldn't be squandered. Guid columns should not be in a clustered index but it is easy to forget that when minting new tables as clustered index is the default for primary keys.

In particular, I am regretting my use of Guids for Audit records because (a) they are never germane to off-line scenarios and (b) we insert them in droves, and (c) we hardly ever do anything with them. We also neglected to make the Guid pk a non-clustered index in our Tutorial database which is an absolutely critical error. Every way I look at them it seems they are the wrong choice for an Audit table. They might make sense if there was a replication issue (e.g., writing to multiple databases and then consolidating them through replication) but even here I'd look to some other trick (e.g, a composite key of {autoincrementing bigint, small-int-marker-for-the-db}.

I have a revised Audit table (called Audit2) and companion mechanism that uses SQL Server's autoincrement and does not rely upon our fix-up logic for such columns. It involves a hack (I'll ask Jay for something more proper here) and it still awaits the mechanism needed to block re-read but it works.

 [THIS JUST IN] I just learned about COMBS (see below). Now I'm thinking of going back to that approach. Stay tuned.

There is a technique for generating GUID equivalents on a client that ensures monotonically increasing values that are globally unique nonetheless. They are called COMBs. You'll see my discussion of them way below.

Finally, I think it is reasonable to have a mix of Guid and non-Guid primary keys.

Anyway, on with the story

 ====

Here is an article with some actual metrics:
http://www.sql-server-performance.com/zn_guid_performance.asp

The author starts out terrified of Guids and, after analysis, comes around.

Certainly the most horrifying statistic is an insert comparison in which timing difference between Guid and non-Guid inserts is 45,191 seconds to 1,630!  That means integer inserts are 45 times faster than Guid inserts.

On closer inspection we see that the problem stems from doing 4,000,000 individual inserts. The 4,000,000 integer inserts involved consecutive ids; the 4,000,000 Guid inserts were random Guids. The randomness of the Guids causes tremendous page fragmentation - although some tuning might have helped. The consecutive integers slipped right into place. Note that when the Guids were forced to be consecutive (the 3rd case tested), the time was 2,025 - an insignificant difference from the integer 1,630. [BTW, I don't know if he was using clustered index for his Guid test; that would be a killer]

I don't know if your application is likely to make 4,000,000 inserts in a short amount of time; if so, Guids of this kind are not for you - but SEQUENTIAL GUID might be.

More interesting is that the read statistics were a wash until you start fetching a ton (e.g., 70,000 records).

He concludes: "So in short, it looks as though if you plan carefully and design wisely, then the use of the uniqueidentifier can perform as well in SELECT statements as an integer. Inserts are a different story, and we already discussed, that as long as they are sequential values, it really shouldn't matter. "

Aside: He doesn't do a straight Guid to single, integer Id comparison. The scenario calls for a solution to a situation in which clients talk to different databases and the results are later pooled in a central db. A Guid Id is perfect for the case; an simple integer id won't work because there would be id collisions when the different dbs merged data. So he suggests a composite id consisting of {integerId, databaseId}. I think this is a bad idea because composite keys are dreadful for foreign key lookups. I would have turned first to a reserved range approach in which the high (or low) digits of a long were used to distinguish the databases.

None of this really effects the analysis I don't think because the differences in insert performance have to do with the randomness of the Guid inserts more than the size of Guids themselves.

I should add Jim's findings about Guids in DevForce applications. The statistics quoted concern SQL performance but what of the impact on clients? Jim observes that with Guids there is no id-fixup processing before inserts. Every other id generation approach requires id fixup.

His experience (unmeasured as yet) is that client save performance is much improved when there is no id-fixup. This is not surprising because any id-fixup means a trip to the server - and any trip to the server hurts client responsiveness, especially over poor connections.

===

A Guid column should never be in a clustered index because a clustered index is only good for monotonically increasing narrow columns and a Guid is neither. Use your one-clustered-index-per-table allowance on something that fits.

====== PROS AND CONS OF GUIDS =====

First, here's a summary tidbit from Scott Bellware who is an MVP with lots of database and object model credentials

(http://codebetter.com/user/Profile.aspx?UserID=2342)

I typically prefer GUIDs for ID's in place of integers if the integer ID's have to be generated by a database server. [emphasis mine; this leaves open the question of integer ids generated on the client ... which is how we do it until the moment we actually insert.]

I've never really been comfortable with the idea that an abstraction like a  business object - something that can be used for more than persistence to the application's database  - must get it's ID from an external, and very likely, distributed system.  Since a GUID's uniqueness is ensured algorithmically, and since it can be done in the same local memory space as the object that it identifies, it seems a bit off to me that we'd defer to a remote dependency for something that is easily had locally, and that can be kept close to the object that is its primary client, and that is most concerned with it.

Since SQL Server 2005 has addressed the index fragmentation issue with GUID-based indexes, and since the COMB GUID generation code is easily had, GUIDs are a much easier decision to make. [emphasis mine.]

That said, there are still compelling reasons to use integer ID's, but the vast majority of database apps aren't of the ilk that require integer ID's specifically.

August 22, 2006 12:15 AM

----

Now for an article by Jimmy Nilsson that lays it out.

Pros and Cons of GUIDs as Primary Keys

http://(www.informit.com/articles/article.asp?p=25862&seqNum=5&rl=1)

Let's say that we plan to use GUIDs for the primary keys. What are the advantages and disadvantages when compared to INTEGERs? We should always start with the positive, so here goes:

·         We get a more object ID–like datatype because each primary key typically will be unique for the complete database. This gives good effects, for example, when designing a custom solution for pessimistic locking (but that's another long story).

·         @@IDENTITY has been a source of problems when INSERT triggers have INSERTed rows into other tables with IDENTITYs. (You can use SCOPE_IDENTITY() in SQL Server 2000 to avoid this problem.)

·         You can move the creation processing of the values from the data tier to the business or consumer tiers. For example, the consumer can create complete structures of orders and details that can be sent all the way through the business tier to the data tier in one go, which can reduce the number of round trips. A more basic way of saying this is that you can get the ID value before you use it, not just afterward, as is the case with IDENTITYs.

NOTE

In the case of .NET and ADO.NET, it's a big advantage to know the ID value at the time rows are added to the DataSet. Otherwise, for example, autogenerated INSERTs can't be used for master-detail relationships within the DataSet.

·         The span of values for GUIDs is enough for most scenarios. INTEGERs are not big enough for some scenarios. (By the way, with INTEGERs, you don't have to start the seed on 1. You can just as easily start it on the smallest negative INTEGER and thereby double the span of values.)

·         You can prepare for merge replication. When merge replication is to be used, each table to be replicated must have a GUID column (called ROWGUID in SQL Server). You also don't have to give each server in a merge replication scenario a unique set of values to use for the keys. The risk of duplicate GUIDs is so small that this shouldn't be a problem.

·         IDENTITY can't be used for tables in distributed partitioned views.

·         A GUID will not mean anything to the users. An INTEGER can be read, understood, and remembered. It's common to see INTEGER keys in the UI, and that might cause problems later.

·         You know what a nightmare it can be when doing a manual merge between two tables with INTEGERs as primary keys. Not only do you have to create a new sequence for the union of both tables, but you also must change all the foreign keys for all the dependent tables. If GUIDs are used, this sort of problem does not arise.

NOTE

The algorithm for generating GUIDs doesn't use the MAC address of the network cards in recent Windows versions anymore. Instead, it just creates a random value. In theory, this presents a risk of getting duplicate GUIDs, but, in practice, it shouldn't be a problem.

The reason for excluding the use of the MAC address is that it not only couples users to GUIDs, but some network cards don't use unique MAC addresses.

I will discuss this further when we reach the section about the COMBs. Stay tuned.

I'm not a salesman, so here are a few drawbacks:

·         There is a huge overhead in size. As I said, a GUID is four times larger than an INT. This is very important when it comes to indexes.

·         Portability can be a problem because not every database has GUID as a native datatype.

·         You can't use GUIDs with aggregates such as MIN. This is unfortunate when it comes to using WHILE plus an aggregate function, such as MIN or MAX for looping instead of a CURSOR. (Well, a workaround is to use a CONVERT or a CAST to CHAR(36) or BINARY(16) of the GUID before it's used in the aggregate, but watch out for performance problems with large tables.)

·         It's detrimental to developer-friendliness because it's really hard to edit GUIDs manually, to remember them when you browse the database, and so on. You have to get used to navigating the database by using SQL scripts, but this is not usually that much of a problem.

So, the overhead related to size is the main problem.

He goes on to test performance as described in the next section, GUID.COMBS

==== GUID.COMBS =========

COMBS are a transform on ordinary GUIDs that result in a sequential Guids, thus overcoming the SQL insert problems caused by random Guids.

GUID.COMB stands for "COMBination of Guid and the current datetime".

COMBS are still unique but they are not random; they always increase and that means there will be far fewer occasions when SQL has to shuffle the pages to insert a record between two others.

BTW, it appears that SQL Server 2005's newsequentialid() appears to be like a COMB. Unfortunately, it is determined on the data tier, not the client, which means it is no help in off-line scenarios.

[For more on SQL Server 2005's newsequentialid(), see

http://sqljunkies.com/Article/4067A1B1-C31C-4EAF-86C3-80513451FC03.scuk]

Jimmy Nilsson's article http://(www.informit.com/articles/article.asp?p=25862&seqNum=7&rl=1)

describes tests performed on the data tier that compared GUIDs, COMBs, and Identity+Int. These were tests for:

  • Inserts of 500,000 records
  • Fetches by pk
  • Joins on the 16-byte Guid/Comb for. key v. the 4 byte integer

In sum, the COMBs are very competitive on all three tests - about 1/10th slower than the integer in each test. BTW, the GUID insert was 30x the integer insert - which aligns with the described in the other article above for the 4 million rows.

-----------------

Here are three ways to calculate COMBS on the client:

Here's some C# code to create a COMB, I think. One issue is that while the resolution for a DATETIME in SQL Server is 1/300th of a second, the resolution for a DateTime in .NET is 1/10000th of a second (100 nanoseconds). This means that the sequential part of the COMB will change more frequently than in the SQL Server version, but I'm not sure if this is a good or a bad thing. If it could have a negative impact, it would be a relatively simple matter to use the six bytes corresponding to SQL Server's 1/300th of a second instead of the last six bytes of the .NET DateTime object.

public static Guid NewComb() {
byte[] dateBytes = BitConverter.GetBytes(DateTime.Now.Ticks);
byte[] guidBytes = Guid.NewGuid().ToByteArray();
// copy the last six bytes from the date to the last six bytes of the GUID
Array.Copy(dateBytes, dateBytes.Length - 7, guidBytes, guidBytes.Length - 7, 6);
return new Guid(guidBytes);
}

C# Code to mirror SQL ServerBy glapointe, Aug 16, 2002 08:10 AM

The following code creates a comb that will match exactly (or pretty darn close) the combs created in SQL Server. I also included a little method to retrieve the date from the comb.

public static Guid NewComb() {
byte[] guidArray = System.Guid.NewGuid().ToByteArray();

DateTime baseDate = new DateTime(1900,1,1);
DateTime now = DateTime.Now;

// Get the days and milliseconds which will be used to build the byte string
TimeSpan days = new TimeSpan(now.Ticks - baseDate.Ticks);
TimeSpan msecs = new TimeSpan(now.Ticks - (new DateTime(now.Year, now.Month, now.Day).Ticks));

// Convert to a byte array
// Note that SQL Server is accurate to 1/300th of a millisecond so we divide by 3.333333
byte[] daysArray = BitConverter.GetBytes(days.Days);
byte[] msecsArray = BitConverter.GetBytes((long)(msecs.TotalMilliseconds/3.333333));

// Reverse the bytes to match SQL Servers ordering
Array.Reverse(daysArray);
Array.Reverse(msecsArray);

// Copy the bytes into the guid
Array.Copy(daysArray, daysArray.Length - 2, guidArray, guidArray.Length - 6, 2);
Array.Copy(msecsArray, msecsArray.Length - 4, guidArray, guidArray.Length - 4, 4);

return new System.Guid(guidArray);
}

public static DateTime GetDateFromComb(System.Guid guid) {
DateTime baseDate = new DateTime(1900,1,1);
byte[] daysArray = new byte[4];
byte[] msecsArray = new byte[4];
byte[] guidArray = guid.ToByteArray();

// Copy the date parts of the guid to the respective byte arrays.
Array.Copy(guidArray, guidArray.Length - 6, daysArray, 2, 2);
Array.Copy(guidArray, guidArray.Length - 4, msecsArray, 0, 4);

// Reverse the arrays to put them into the appropriate order
Array.Reverse(daysArray);
Array.Reverse(msecsArray);

// Convert the bytes to ints
int days = BitConverter.ToInt32(daysArray, 0);
int msecs = BitConverter.ToInt32(msecsArray, 0);

DateTime date = baseDate.AddDays(days);
date = date.AddMilliseconds(msecs * 3.333333);

return date;
}

 

Here is one in VB:

Public Shared Function NewComb() As Guid

        'Descr/Purpose: Create a COMB. (A COMBination of an

        '               ordinary GUID and the current datetime.

        '

        '               The current datetime should be in

        '               SQL Server format. That is days after

        '               1st of Jan 1900 and the no of milliseconds

        '               after midnight, divided by 3.3333...

        '

        '               It is the lowest six bytes of the GUID

        '               that get exchanged for the current datetime.

        'Parameters:    -

        'Return:        A new GUID (COMB)

        'Date, who:     2002-04-04, Jimmy Nilsson

 

        'Revisions:

        '2002-04-07, Jimmy Nilsson

        'Changed the solution to use a bytearray instead

        'of a string.

 

        Dim i As Short

        Dim theNow As DateTime = System.DateTime.Now

 

        'Calculate the days after 1st of Jan 1900.

        Dim theDays As Integer = theNow.Subtract _

        (New System.DateTime(1900, 1, 1)).Days

 

        'Calculate the number of milliseconds after

        'midnight, divided by 3.3333...

        Dim theMilliSeconds As Integer = CType(theNow.Subtract _

        (System.DateTime.Today).TotalMilliseconds _

        / (10 / 3), Integer)

 

        'Create a new GUID and save it as a bytearray.

        Dim aBuffer As Byte() = Guid.NewGuid().ToByteArray()

 

        'Exchange byte 10 and 11 for the current days from

        '1900-01-01.

        Dim theDaysAsBytes As Byte() = BitConverter.GetBytes(theDays)

        For i = 0 To 1

            aBuffer(10 + i) = theDaysAsBytes(1 - i)

        Next

 

        'Exchange byte 12-15 for the no of milliseconds after

        'midnight.

        '(Divided by 3.3333...)

        Dim theMilliSecondsAsBytes As Byte() = _

        BitConverter.GetBytes(theMilliSeconds)

        For i = 0 To 3

            aBuffer(12 + i) = theMilliSecondsAsBytes(3 - i)

        Next

 

        'Return the bytearray as a new guid.

        Return New Guid(aBuffer)

    End Function

=========================

Other sources on this topic:

http://www.yafla.com/papers/SQL_Server_Performance/high_performance_sql_server_pt2_1.htm



Edited by DFFadmin - 06-Jun-2007 at 5:43pm
Back to Top
 Post Reply Post Reply

Forum Jump Forum Permissions View Drop Down