Tuesday, August 16, 2005
Identity Crisis
This evening, I had a discussion with a coworker, whom I admire greatly, about identity columns and primary keys in SQL database design. While I have a great deal of respect for this gentleman I was surprised when he said that primary keys should always be unrelated to the data and should be identity columns.So, when I got to the hotel this evening, I had to get online and find the documentation to prove my side of the debate. Boy was I in for a shock. Not that I am wrong - I am not - but that there is so much debate, and in one poll I found a full 60% of database developers are staunch supporters of identity columns as primary keys.
It seems that Joe Celko (search Microsoft Newsgroups and Amazon for his name; buy and read his books) and I are the only ones in the debate who understand that identity columns should almost never be used as primary keys.
A primary key, by definition, uniquely identifies an entity, not a data row. Mr. Celko (I have never spoken to him so he's not Joe) correctly states that using an identity column for primary key is actually trying to emulate a file position pointer in a file based database. It is not a relational database functionality by any stretch of the imagination.
http://msdn.microsoft.com/newsgroups/default.aspx?dg=microsoft.public.sqlserver.programming&mid=1b5bf5aa-ef12-4c00-a4d6-2a31b0cdd2b5&sloc=en-us
No, I don't do TinyUrl. Why should I give them free access to your and my internet usage data?
The one point I tried to make tonight was that, above all the other reasons for not using an identity column, identity columns destroy data integrity. If I have a person table, and I put Jane in the table, what prevents you from putting Jane in the table again tomorrow? There is nothing to stop you.
So let's say we create a unique constraint on SSN, or for the SSN-is-not-necessarily-unique naysayers, on SSN, LastName, FirstName, and DOB. And, of course, we add an appropriate index for performance. Now when we try to add Jane again, we check for the existence of a row matching Jane's SSN, last name, first name, and date of birth. Finding the previous row, we reject the new insertion.
Guess what? No matter what column we marked as primary key in our handy-dandy database administration tool when we created the database, our unique constraint has just become the defacto primary key. It is the primary key because that is what we are using to uniquely identify the data in the row - not the row, but the data in the row! Uniquely identifying the data when we perform search and retrieval operations is the function of the primary key. We could not perform this integrity check with the identity column faux-key.
This data integrity problem, as far as I am concerned, trumps all other arguments by itself. A primary key that does not guarantee unique data is less than a worthless primary key... it is just not a primary key. Period. Let alone the fact that our database physical structure does not match our logical model or even our usage. Therefore what we have created is not a relational database.
The identity column value serves, at best, as a unique key for foreign key relationships but actually should not even do that. It just shouldn't exist.
To a great many developers-turned-database-designers, identity columns are synonymous with primary key, so much so that the terms are often, and incorrectly, used inter-changeably. As I searched for signs of SQL-intelligent life tonight, a common theme in the newsgroups was, "If I don't use an identity column for a primary key, then what is the best way to generate a primary key for my tables?"
That one left me shaking my head in disbelief over and over again. If you are asking yourself the same question, the answer is you don't generate a primary key. You CAN'T generate a primary key. It is not possible. Primary keys exist in the properties of an entity that make it just that: an entity. Period. Your task is to identify the candidate keys in your data and then from those, identify the primary key from the data.
The definition of 3rd Normal Form is that all non-key columns depend on the primary key. If I have an employee table with an identity column as primary key, which other columns, such as payrate, department, or phone number, depend on the identity column value? None! If I use an employee ID number as the primary key, then all those columns do depend on the primary key. Even though both columns look similar, there IS a difference. I will go so far as to say that I don't believe you can achieve 3rd normal form in a table using an identity column as an artificial primary key.
How can I say that, when an identity column may look just like the employee ID? It's easy. For one thing, I type fast. Besides that, let's say you decide to copy your data to a new server and re-generate the identity columns in order to fill gaps in the sequence for deleted employees. You may say, "But I never do that, so that is meaningless" But it is not meaningless.
The reason it is not meaningless is that you can do it if you wanted to. You can re-order all your identity column data in the parent tables of a relationship and in all of the child tables, and no one would ever be the wiser. That is because surrogate keys, by definition, should never be visible to the user. Thus, the dependency of the data on the primary key is made up.
If, on the other hand, you used an employee ID number as the primary key, you cannot easily change it. It has identifying traits outside of the context of the database table structure. If you don't believe me, try it sometime but you may want to be out of town before the next payroll.
I have read many comments tonight from database developers espousing the virtues of the identity column as primary key because, since it is unrelated to the data (violation of 3rd normal form), it doesn't ever change. That's not the point. The point is that it can change without changing the meaning of the data. Therefore, even if the accident of time never sees it change, it is not immutable. The employee ID number, though, has serious potential repercussions if someone were to arbitrarily change it. Therefore, it is really the immutable choice. It is the item that should never change in the life of the employee.
Ok... So employee numbers, social security numbers, and even birthdates change. Those are extremely rare conditions that have to be handled as exceptions. In my discussion this evening, we debated using DNA code as a primary key and my coworker dismissed it because apparently, as he told me, your DNA can change over time. Well, we're not talking about using the circumference of a tree as an identifier here. This is immutable and unique data. Ask any forensic scientist. So there are some extremely unlikely exceptions to the uniqueness. Those do not change the fact that you cannot go into the database and change the data for the DNA code and still be talking about the same person. If I go in the database and change the identity column, the only thing changed is the identity column. We're still talking about the same person.
So, that's my identity crisis, or to be more specific, my identity-column crisis.
Comments:
<< Home
Dale, are we talking logical or physical? I know you saw my bolg "Surrogate Keys….The Devil’s spawn" http://weblogs.sqlteam.com/brettk/archive/2004/06/09/1530.aspx, and the only thing I mention where a surrogate may be a good idea in a physical implementation is an address...you know, street, city, state, zip. Physically creating a primary key on this would not be a very good idea for join purposes. It's just to large. I mentioned that an IDENTITY might be a good idea for this. Don't like surrogates at all? THen maybe lattitude and longitude would fly be me...as long as you could encode every single address...otherwise it would be another problem.
As far as your respected co-worker? Better look more closely. IDENTITIES on everything. Did they mention that they also would put unique constraints/indexes on the natural keys as well? If not, the call the funny farm. If they did, they're looking for paper thin indexing anf they would realize a performance boost. BUT! This is at the sake of having to increase the number of joins...so where is the gain really. I mean a StateID (IDENTITY) is just plain silly.
Good luck, Thanks for reading.
As far as your respected co-worker? Better look more closely. IDENTITIES on everything. Did they mention that they also would put unique constraints/indexes on the natural keys as well? If not, the call the funny farm. If they did, they're looking for paper thin indexing anf they would realize a performance boost. BUT! This is at the sake of having to increase the number of joins...so where is the gain really. I mean a StateID (IDENTITY) is just plain silly.
Good luck, Thanks for reading.
I am not sure why you would use the primary key in an address for a join. Wouldn’t the address table have a foreign key relationship to the person, business, or other entity to which the address belonged? In a foreign key relationship, it is the child table that has the reference to the parent, not the other way around. And that holds true in a one to one, or a one to many relationship. So the address table would have a person column. The person table would not have a reference to the address table.
The unique constraint and index in your example are the proper items to be your primary key and, as I stated in my blog, they are the defacto primary key, regardless of what you configured in your table structure, because they are what you actually use to maintain uniqueness in your data.
If you really need to narrow your joins, and you have done performance testing to prove that you have to violate good design to eek out a little better performance (am I editorializing too much here? *LOL*) - and there are times that has to be done - then don't use the identity column as your primary key; use your real primary key, the one that guarantees uniqueness, as the primary key and use the identity column to point your foreign key relationship in the child tables to. To use the two (the identity column and the unique constraint) in any other order is just plain backwards.
I did work on a project for an organization that had an address table which the people table had a foreign key relationship to. Their idea was that if multiple people in their database shared the same address, they only had to store the address once. As I pointed out to them, multiple people share the first name Mike, too. Perhaps they should have a FirstName table and the people table could have a foreign key relationship to it.
This whole thing about narrow indexes is just ludicrous. My coworker went on about the effects of wide indexes in joins and the effect on the indexes in those joins. My response to that should have been, but I didn't think of it until too late, “What is the purpose of a join?” The answer, almost without a doubt would have been, “to combine data.”
Well, that is absolutely backwards from the purpose of a join. The purpose of a join is not at all to combine data. The purpose of the join is to allow you to separate data and, therefore, not have to store multiple copies. If all you want to do is to combine data, then put it in a flat file, or in a single, but wide, table. There you have it all combined, and no skinny little indexes to worry about.
If you take a million row person table and a separate table of what kind of car they drive, the cars table only has a few thousand rows. Violating data integrity rules and making the data unreadable in order to shave a disk access or two off of reading a table of a few thousand rows is just not necessary. The leaf tables in a join, by the very definition of a relational database, will usually contain significantly less rows than the root table except in the case where the join is just a vertical partition - to which I go back to my first answer, put it back in a flat table if you are worried about disk access. So what difference does it make? Other than in perhaps a few very large databases, practically none.
In my experience in troubleshooting and optimizing databases for clients, it is poorly written procedures, poorly designed tables, and no indexing or poor indexing, that almost always solve the performance problems.
And most databases in the world are much too small to ever need to worry about whether or not they can get a few percent improvement in performance by narrowing keys or by merging tables. All this bad design is in order to solve a problem that almost never exists.
Thanks for your reply.
Dale
The unique constraint and index in your example are the proper items to be your primary key and, as I stated in my blog, they are the defacto primary key, regardless of what you configured in your table structure, because they are what you actually use to maintain uniqueness in your data.
If you really need to narrow your joins, and you have done performance testing to prove that you have to violate good design to eek out a little better performance (am I editorializing too much here? *LOL*) - and there are times that has to be done - then don't use the identity column as your primary key; use your real primary key, the one that guarantees uniqueness, as the primary key and use the identity column to point your foreign key relationship in the child tables to. To use the two (the identity column and the unique constraint) in any other order is just plain backwards.
I did work on a project for an organization that had an address table which the people table had a foreign key relationship to. Their idea was that if multiple people in their database shared the same address, they only had to store the address once. As I pointed out to them, multiple people share the first name Mike, too. Perhaps they should have a FirstName table and the people table could have a foreign key relationship to it.
This whole thing about narrow indexes is just ludicrous. My coworker went on about the effects of wide indexes in joins and the effect on the indexes in those joins. My response to that should have been, but I didn't think of it until too late, “What is the purpose of a join?” The answer, almost without a doubt would have been, “to combine data.”
Well, that is absolutely backwards from the purpose of a join. The purpose of a join is not at all to combine data. The purpose of the join is to allow you to separate data and, therefore, not have to store multiple copies. If all you want to do is to combine data, then put it in a flat file, or in a single, but wide, table. There you have it all combined, and no skinny little indexes to worry about.
If you take a million row person table and a separate table of what kind of car they drive, the cars table only has a few thousand rows. Violating data integrity rules and making the data unreadable in order to shave a disk access or two off of reading a table of a few thousand rows is just not necessary. The leaf tables in a join, by the very definition of a relational database, will usually contain significantly less rows than the root table except in the case where the join is just a vertical partition - to which I go back to my first answer, put it back in a flat table if you are worried about disk access. So what difference does it make? Other than in perhaps a few very large databases, practically none.
In my experience in troubleshooting and optimizing databases for clients, it is poorly written procedures, poorly designed tables, and no indexing or poor indexing, that almost always solve the performance problems.
And most databases in the world are much too small to ever need to worry about whether or not they can get a few percent improvement in performance by narrowing keys or by merging tables. All this bad design is in order to solve a problem that almost never exists.
Thanks for your reply.
Dale
Dale,
I can understand and accept most of your argument but there are some issues.
My understading is that some database systems require all Foreign Keys to refer to the Primary Key in the source table. Foreign Keys cannot refer to unique indexes.
If your Primary Key is SSN, First Name, Last Name and Date of Birth, then all tables which refer to this must also have SSN, First Name, Last Name and Date of Birth in their foreign keys.
If one of the women gets married and chooses to change her last name to her married name, this will require an update across all tables that use this as the Foreign Key.
Not all data can be clearily identified as unique by it's attributes.
In Spain, they have strict rules about identifying their SSN equivalent as a unique individual (made up of First Name, (Father's) Surname, Mother's Surname, DOB, Home Town at Birth, Hospital). However, I met two cousins for whom all these attributes were the same and this caused genuine difficulties for the authorities.
One of the reason for normalisation was to ensure managability by keeping an attribute in one location.
If the facts about a person are spread across your tables as foreign keys because they form the 'correct' primary key , then the database design might be worse not better by application of the full normalisation rules.
My wife was expecting identical twins once - I was told that their DNA, like the rest of them, would be identical. Apart from tattooing them, how would one know which baby was which?
From the other side, I have concerns about identity values as Primary Keys for performance reasons. They tend to create hot spots. New records are usually an increment of the last new record and, therefore, cause the writes to occur at all the same physical locations. These writes have to queue up to get access to those locations.
Niall.
I can understand and accept most of your argument but there are some issues.
My understading is that some database systems require all Foreign Keys to refer to the Primary Key in the source table. Foreign Keys cannot refer to unique indexes.
If your Primary Key is SSN, First Name, Last Name and Date of Birth, then all tables which refer to this must also have SSN, First Name, Last Name and Date of Birth in their foreign keys.
If one of the women gets married and chooses to change her last name to her married name, this will require an update across all tables that use this as the Foreign Key.
Not all data can be clearily identified as unique by it's attributes.
In Spain, they have strict rules about identifying their SSN equivalent as a unique individual (made up of First Name, (Father's) Surname, Mother's Surname, DOB, Home Town at Birth, Hospital). However, I met two cousins for whom all these attributes were the same and this caused genuine difficulties for the authorities.
One of the reason for normalisation was to ensure managability by keeping an attribute in one location.
If the facts about a person are spread across your tables as foreign keys because they form the 'correct' primary key , then the database design might be worse not better by application of the full normalisation rules.
My wife was expecting identical twins once - I was told that their DNA, like the rest of them, would be identical. Apart from tattooing them, how would one know which baby was which?
From the other side, I have concerns about identity values as Primary Keys for performance reasons. They tend to create hot spots. New records are usually an increment of the last new record and, therefore, cause the writes to occur at all the same physical locations. These writes have to queue up to get access to those locations.
Niall.
Having done it both ways, identity columns are hugely convenient. Social security numbers can be entered incorrectly, same with birthdays. Dependent table relationships get very nasty. Joe (who I have met so I can call him Joe) lives in somewhat of an ivory tower on this one. (IMHO)
http://peterkellner.net
http://peterkellner.net
Thanks for replying, Niall, I appreciate your comment.
I agree with your first point. In fact, Oracle will not (as I understand it) allow you to use other than the primary key in a foreign key relationship and, in fact, the SQL standard also forbids it. And my answer to that is, don't use identity columns, use the primary key. Problem solved.
Your second point, is accurate. All tables which refer to the primary key would then have to hold all the key items.
Your third point, about changing the name of a woman when she marries is correct. In that case you do have to update all the tables. It is a one time operation and not a performance issue. In this case, the identity of the person actually changes. So update the tables.
Fourth and beyond: all entities are unique by definition. An entity is a distinct instance of a thing. There are unique attributes that define every entity. The question here is whether or not you otherwise require those attributes in your application and, therefore, have them stored in your data.
All of the points you make from the third point on are rare exceptions and should generally be handled as excpetions. In no way does an identity column solve any of those problems. In each case, the world around those persons or objects has to establish a definition that separates one similar item from another. And in their worlds, none of the definitions will be based upon the identity column of your database. If you need to establish an identifier, it would naturally have to exist outside of the database and that identifier is what you should use inside your database.
Consider your Spain example. Would an identity column in the school's database do any good for identifying the cousins in the drivers license bureau? Not at all. It doesn't identify anything except the row. So both of these government agencies have no way of knowing they are dealing with the same person. The uniqueness has to exist outside of the confines of the database as well.
If you can add a unique constraint to prevent duplicates, then you have a candidate key for use as a primary key and obviously don't require the identity column.
I agree with your first point. In fact, Oracle will not (as I understand it) allow you to use other than the primary key in a foreign key relationship and, in fact, the SQL standard also forbids it. And my answer to that is, don't use identity columns, use the primary key. Problem solved.
Your second point, is accurate. All tables which refer to the primary key would then have to hold all the key items.
Your third point, about changing the name of a woman when she marries is correct. In that case you do have to update all the tables. It is a one time operation and not a performance issue. In this case, the identity of the person actually changes. So update the tables.
Fourth and beyond: all entities are unique by definition. An entity is a distinct instance of a thing. There are unique attributes that define every entity. The question here is whether or not you otherwise require those attributes in your application and, therefore, have them stored in your data.
All of the points you make from the third point on are rare exceptions and should generally be handled as excpetions. In no way does an identity column solve any of those problems. In each case, the world around those persons or objects has to establish a definition that separates one similar item from another. And in their worlds, none of the definitions will be based upon the identity column of your database. If you need to establish an identifier, it would naturally have to exist outside of the database and that identifier is what you should use inside your database.
Consider your Spain example. Would an identity column in the school's database do any good for identifying the cousins in the drivers license bureau? Not at all. It doesn't identify anything except the row. So both of these government agencies have no way of knowing they are dealing with the same person. The uniqueness has to exist outside of the confines of the database as well.
If you can add a unique constraint to prevent duplicates, then you have a candidate key for use as a primary key and obviously don't require the identity column.
Dale,
Interesting blog! This may be trivial, but I like the idea of using integer keys because the database engine can join on them more efficiently (or so I have been led to believe), whether these keys were generated by an identity column or not. That being said, the identity column is just a convenience. Also I have had some good results using integer surrogate keys in my data warehouse tables design. I will concede I am not a purist when it comes to database design theory, but I do respect your viewpoint!
Norm
Interesting blog! This may be trivial, but I like the idea of using integer keys because the database engine can join on them more efficiently (or so I have been led to believe), whether these keys were generated by an identity column or not. That being said, the identity column is just a convenience. Also I have had some good results using integer surrogate keys in my data warehouse tables design. I will concede I am not a purist when it comes to database design theory, but I do respect your viewpoint!
Norm
Niall: Primary keys do not have to be clustered. You CAN make them non-clustered and choose a different clustering index. That would alleviate the "hot spot" situation you describe.
Dale: I use artificial primary keys via IDENTITY for my primary keys. Why? Because it reduces the number of index pages. Having a wide index like SSN/Last/First/DOB would increase the number of index pages needed to find data. At the end of the day, it's all about performance.
Dale: I use artificial primary keys via IDENTITY for my primary keys. Why? Because it reduces the number of index pages. Having a wide index like SSN/Last/First/DOB would increase the number of index pages needed to find data. At the end of the day, it's all about performance.
You are mistaken. Using identity columns as primary keys increases the number of index pages.
Now you have to maintain two indexes, one on what should have been the primary key, so that you can maintain that second unique constraint/index for data integrity purposes, and one for the faux primary key.
It doesn't decrease the number of index pages read when writing data. Both indexes must be checked for uniqueness and both indexes must be updated when doing inserts or if SQL Server moves the data in the file.
It doesn't decrease the number of index pages for your users. Your users would not search for me in your database by looking for 242984298238923. They're going to enter in Preston, Dale (or something like that). You're going to search by Preston, Dale (or something like that), using the wider LastName_FirstName index. If there are multiple Preston, Dale's in your database you're going to retrieve all of the Preston, Dale's plus their birthdate, or other identifying information so that the user can select me out of all of the Preston, Dale's in the world – and there are a lot of them. Don't you wish you had that wider covering index now?
Identity columns represent how computers use computers. Natural keys represent how people use computers. And it's not all about performance; it's all about people. And by making your database work how people use it, you get better performance in real life situations.
Narrowing the index is only one aspect of performance. If how many index records were returned in a single 8K page was the only thing that effected SQL Server performance, you would be right. But it is not.
I agree that you don't have to use the primary key as the clustered index. In fact, I almost never use the primary key as the clustered index. Clustered indexes should consist of colums that both exist and are used in large groups. That will significantly increase the performance when you are selecting by state, or by last name, for instance.
Now you have to maintain two indexes, one on what should have been the primary key, so that you can maintain that second unique constraint/index for data integrity purposes, and one for the faux primary key.
It doesn't decrease the number of index pages read when writing data. Both indexes must be checked for uniqueness and both indexes must be updated when doing inserts or if SQL Server moves the data in the file.
It doesn't decrease the number of index pages for your users. Your users would not search for me in your database by looking for 242984298238923. They're going to enter in Preston, Dale (or something like that). You're going to search by Preston, Dale (or something like that), using the wider LastName_FirstName index. If there are multiple Preston, Dale's in your database you're going to retrieve all of the Preston, Dale's plus their birthdate, or other identifying information so that the user can select me out of all of the Preston, Dale's in the world – and there are a lot of them. Don't you wish you had that wider covering index now?
Identity columns represent how computers use computers. Natural keys represent how people use computers. And it's not all about performance; it's all about people. And by making your database work how people use it, you get better performance in real life situations.
Narrowing the index is only one aspect of performance. If how many index records were returned in a single 8K page was the only thing that effected SQL Server performance, you would be right. But it is not.
I agree that you don't have to use the primary key as the clustered index. In fact, I almost never use the primary key as the clustered index. Clustered indexes should consist of colums that both exist and are used in large groups. That will significantly increase the performance when you are selecting by state, or by last name, for instance.
Dale,
I respect your view only consider it too radical.
I have tried to summarise pros and cons and here is the list.
Many thanks,
Dragan
• Identity is not the best way to introduce the surrogate key, but is acceptable in many cases.
• Nothing stops us from having surrogate key and unique constraint for candidate key in the same table.
• Using surrogate key in the parent table can make child table smaller and joins faster and easier to handle.
Exp1: CreditCard table and Transaction table. Credit card number is treated as character string and can be quite long. We do need to search Transaction table for certain card number, hence the index. If we have CardID as int (or bigint) as surrogate key in CreditCard table instead of CardNumber the index seek in Transaction table will be faster circa 10 times or more (for SQL Server, DB2 and Sybase but not Oracle) especially if we have transactionId bigint in transaction table as PK and clustered index. Of course we find the cardID in CreditCard table first but remember that table is 10, 20 or over 100 times smaller than transaction table.
Exp2: Let’s say we have four columns that uniquely identify the entity and have the business meaning. One or two of columns are long varchar or even worse nvarchar. If we make them PK we will have to drag those four columns to child table that can be significant in size. For every join we will have to write the complex join condition. Combine that with joining three more tables designed in the same manner. Having the surrogate key makes both the inserts and joins way more efficient. We can (and shall) still have unique index on those four columns in order to reflect business requirement.
• It is a bad idea to use identity for generating IDs in lookup tables. These values are often hardcoded and if we re-generate look-up tables (during migration or conversion process) different IDs can be assigned creating complete chaos in the database.
• If we want to hide surrogate key from users we can use views or better stored procedures. There are many other columns that we don’t want every user to see.
• Reordering identities in the table and all child tables sounds pretty ridicules to me, but would leave the database in consistent state. If I change the PK and update underlining foreign keys (why on the earth would someone do that?) I still have same children being in relation with same parents.
• Surrogate key has as much to do with employee name as the employee ID, which is just a number assigned to particular employee. If Employee ID can identify the employee in the database so can the surrogate key. Only the surrogate key is more convenient for the RDBMS to handle. Employee Id can be wrong in the database due to data entry error. Correcting this error is easier if it wasn’t primary key. (Whoever worked on data cleansing knows what I am talking about).
• It is almost impossible to horizontally partition table with identity column.
• It is not possible at all to insert data in the table with identity column as PK that has children as a set operation. It has to be done one row at the time.
• It is the child table where we can see the most benefit of surrogate key.
• If not handled properly identity can be the reason of data corruption
• Identity can be used to make clustered index key monotonously increasing integer. It makes all inserts faster and for tables that are rarely updated save us of data pages fragmentation. It doesn’t need to be primary key for this purpose. It also improves date-time range queries because data pages are physically sorted in chronological order.
• It is true that identity column that is clustered index is de facto rowid. That is the real purpose of clustered index to let us choose what is going to be rowid otherwise the system generates one for us. Making rowid primary key truly seems like heresy from logical design point of view, only it gives us a tool to compete relational database against mainframe. It is simply quantum leap in performance.
I respect your view only consider it too radical.
I have tried to summarise pros and cons and here is the list.
Many thanks,
Dragan
• Identity is not the best way to introduce the surrogate key, but is acceptable in many cases.
• Nothing stops us from having surrogate key and unique constraint for candidate key in the same table.
• Using surrogate key in the parent table can make child table smaller and joins faster and easier to handle.
Exp1: CreditCard table and Transaction table. Credit card number is treated as character string and can be quite long. We do need to search Transaction table for certain card number, hence the index. If we have CardID as int (or bigint) as surrogate key in CreditCard table instead of CardNumber the index seek in Transaction table will be faster circa 10 times or more (for SQL Server, DB2 and Sybase but not Oracle) especially if we have transactionId bigint in transaction table as PK and clustered index. Of course we find the cardID in CreditCard table first but remember that table is 10, 20 or over 100 times smaller than transaction table.
Exp2: Let’s say we have four columns that uniquely identify the entity and have the business meaning. One or two of columns are long varchar or even worse nvarchar. If we make them PK we will have to drag those four columns to child table that can be significant in size. For every join we will have to write the complex join condition. Combine that with joining three more tables designed in the same manner. Having the surrogate key makes both the inserts and joins way more efficient. We can (and shall) still have unique index on those four columns in order to reflect business requirement.
• It is a bad idea to use identity for generating IDs in lookup tables. These values are often hardcoded and if we re-generate look-up tables (during migration or conversion process) different IDs can be assigned creating complete chaos in the database.
• If we want to hide surrogate key from users we can use views or better stored procedures. There are many other columns that we don’t want every user to see.
• Reordering identities in the table and all child tables sounds pretty ridicules to me, but would leave the database in consistent state. If I change the PK and update underlining foreign keys (why on the earth would someone do that?) I still have same children being in relation with same parents.
• Surrogate key has as much to do with employee name as the employee ID, which is just a number assigned to particular employee. If Employee ID can identify the employee in the database so can the surrogate key. Only the surrogate key is more convenient for the RDBMS to handle. Employee Id can be wrong in the database due to data entry error. Correcting this error is easier if it wasn’t primary key. (Whoever worked on data cleansing knows what I am talking about).
• It is almost impossible to horizontally partition table with identity column.
• It is not possible at all to insert data in the table with identity column as PK that has children as a set operation. It has to be done one row at the time.
• It is the child table where we can see the most benefit of surrogate key.
• If not handled properly identity can be the reason of data corruption
• Identity can be used to make clustered index key monotonously increasing integer. It makes all inserts faster and for tables that are rarely updated save us of data pages fragmentation. It doesn’t need to be primary key for this purpose. It also improves date-time range queries because data pages are physically sorted in chronological order.
• It is true that identity column that is clustered index is de facto rowid. That is the real purpose of clustered index to let us choose what is going to be rowid otherwise the system generates one for us. Making rowid primary key truly seems like heresy from logical design point of view, only it gives us a tool to compete relational database against mainframe. It is simply quantum leap in performance.
Dale,
I respect your view only consider it too radical.
I have tried to summarise pros and cons and here is the list.
Many thanks,
Dragan
• Identity is not the best way to introduce the surrogate key, but is acceptable in many cases.
• Nothing stops us from having surrogate key and unique constraint for candidate key in the same table.
• Using surrogate key in the parent table can make child table smaller and joins faster and easier to handle.
Exp1: CreditCard table and Transaction table. Credit card number is treated as character string and can be quite long. We do need to search Transaction table for certain card number, hence the index. If we have CardID as int (or bigint) as surrogate key in CreditCard table instead of CardNumber the index seek in Transaction table will be faster circa 10 times or more (for SQL Server, DB2 and Sybase but not Oracle) especially if we have transactionId bigint in transaction table as PK and clustered index. Of course we find the cardID in CreditCard table first but remember that table is 10, 20 or over 100 times smaller than transaction table.
Exp2: Let’s say we have four columns that uniquely identify the entity and have the business meaning. One or two of columns are long varchar or even worse nvarchar. If we make them PK we will have to drag those four columns to child table that can be significant in size. For every join we will have to write the complex join condition. Combine that with joining three more tables designed in the same manner. Having the surrogate key makes both the inserts and joins way more efficient. We can (and shall) still have unique index on those four columns in order to reflect business requirement.
• It is a bad idea to use identity for generating IDs in lookup tables. These values are often hardcoded and if we re-generate look-up tables (during migration or conversion process) different IDs can be assigned creating complete chaos in the database.
• If we want to hide surrogate key from users we can use views or better stored procedures. There are many other columns that we don’t want every user to see.
• Reordering identities in the table and all child tables sounds pretty ridicules to me, but would leave the database in consistent state. If I change the PK and update underlining foreign keys (why on the earth would someone do that?) I still have same children being in relation with same parents.
• Surrogate key has as much to do with employee name as the employee ID, which is just a number assigned to particular employee. If Employee ID can identify the employee in the database so can the surrogate key. Only the surrogate key is more convenient for the RDBMS to handle. Employee Id can be wrong in the database due to data entry error. Correcting this error is easier if it wasn’t primary key. (Whoever worked on data cleansing knows what I am talking about).
• It is almost impossible to horizontally partition table with identity column.
• It is not possible at all to insert data in the table with identity column as PK that has children as a set operation. It has to be done one row at the time.
• It is the child table where we can see the most benefit of surrogate key.
• If not handled properly identity can be the reason of data corruption
• Identity can be used to make clustered index key monotonously increasing integer. It makes all inserts faster and for tables that are rarely updated save us of data pages fragmentation. It doesn’t need to be primary key for this purpose. It also improves date-time range queries because data pages are physically sorted in chronological order.
• It is true that identity column that is clustered index is de facto rowid. That is the real purpose of clustered index to let us choose what is going to be rowid otherwise the system generates one for us. Making rowid primary key truly seems like heresy from logical design point of view, only it gives us a tool to compete relational database against mainframe. It is simply quantum leap in performance.
I respect your view only consider it too radical.
I have tried to summarise pros and cons and here is the list.
Many thanks,
Dragan
• Identity is not the best way to introduce the surrogate key, but is acceptable in many cases.
• Nothing stops us from having surrogate key and unique constraint for candidate key in the same table.
• Using surrogate key in the parent table can make child table smaller and joins faster and easier to handle.
Exp1: CreditCard table and Transaction table. Credit card number is treated as character string and can be quite long. We do need to search Transaction table for certain card number, hence the index. If we have CardID as int (or bigint) as surrogate key in CreditCard table instead of CardNumber the index seek in Transaction table will be faster circa 10 times or more (for SQL Server, DB2 and Sybase but not Oracle) especially if we have transactionId bigint in transaction table as PK and clustered index. Of course we find the cardID in CreditCard table first but remember that table is 10, 20 or over 100 times smaller than transaction table.
Exp2: Let’s say we have four columns that uniquely identify the entity and have the business meaning. One or two of columns are long varchar or even worse nvarchar. If we make them PK we will have to drag those four columns to child table that can be significant in size. For every join we will have to write the complex join condition. Combine that with joining three more tables designed in the same manner. Having the surrogate key makes both the inserts and joins way more efficient. We can (and shall) still have unique index on those four columns in order to reflect business requirement.
• It is a bad idea to use identity for generating IDs in lookup tables. These values are often hardcoded and if we re-generate look-up tables (during migration or conversion process) different IDs can be assigned creating complete chaos in the database.
• If we want to hide surrogate key from users we can use views or better stored procedures. There are many other columns that we don’t want every user to see.
• Reordering identities in the table and all child tables sounds pretty ridicules to me, but would leave the database in consistent state. If I change the PK and update underlining foreign keys (why on the earth would someone do that?) I still have same children being in relation with same parents.
• Surrogate key has as much to do with employee name as the employee ID, which is just a number assigned to particular employee. If Employee ID can identify the employee in the database so can the surrogate key. Only the surrogate key is more convenient for the RDBMS to handle. Employee Id can be wrong in the database due to data entry error. Correcting this error is easier if it wasn’t primary key. (Whoever worked on data cleansing knows what I am talking about).
• It is almost impossible to horizontally partition table with identity column.
• It is not possible at all to insert data in the table with identity column as PK that has children as a set operation. It has to be done one row at the time.
• It is the child table where we can see the most benefit of surrogate key.
• If not handled properly identity can be the reason of data corruption
• Identity can be used to make clustered index key monotonously increasing integer. It makes all inserts faster and for tables that are rarely updated save us of data pages fragmentation. It doesn’t need to be primary key for this purpose. It also improves date-time range queries because data pages are physically sorted in chronological order.
• It is true that identity column that is clustered index is de facto rowid. That is the real purpose of clustered index to let us choose what is going to be rowid otherwise the system generates one for us. Making rowid primary key truly seems like heresy from logical design point of view, only it gives us a tool to compete relational database against mainframe. It is simply quantum leap in performance.
I'm the guy that Dale Preston argued with about PK usage. Recently meeting Dale, I wasn't aware that I was baited into wasting 2 hours (not including this posting) on this debate. This isn't a new argument. It is simply surrogate vs. natural/intelligent keys. I'll let you do the web and newsgroup searches on it. This is one of those issues that has a life of its own, kind of like an urban legend. It also has an internet gang, led by Celko, who help each other in forums like this one. It has divided academia, but has been long settled in the "real" business world. You have people who occassionally can't let go of natural keys, especially if they highly admire the book and/or professor they learned them from. Especially if they are a member of the above gang. But for the most part, they are only used in specialized situations, with "eyes wide open" to their disadvantages.
At Dale's current employer, we have had to clean up some messes created by an "individual" who continues to use natural keys. Unfortunately, the applications broke down as the identifying data, business rules, or requirements grew/changed. ("Oh! So that's why they invented surrogate keys!") As the database was poorly architected, the fix to the above changes were almost always expensive, and difficult to justify with the non-technical customer base (why should they have to pay for it?). Especially if the customer developed a false confidence in the previous so-called programmer/architect/dba/analyst.
There are too many usefull articles and newsgroup pile-ons to mention. Depending on your "real world" vs. academia background, some may be more readable than others. Your own web search and/or experience should justify.
If, after you have searched the issue out yourself, you still are unsure as to whether to use surrogate or natural keys, post a question to the forum of the database product you are using (i.e. MS SQL, Oracle, Sybase, DB2). If someone really needs it, I'll spend the time debunking and/or answering questions on this forum. But you'll have to post your request within the next two weeks, as I'll drop it from my radar after then.
I apologize if my annoyance is coming through. My goal is simply to walk you through this cycle quickly, so that you won't waste too much more of your time.
At Dale's current employer, we have had to clean up some messes created by an "individual" who continues to use natural keys. Unfortunately, the applications broke down as the identifying data, business rules, or requirements grew/changed. ("Oh! So that's why they invented surrogate keys!") As the database was poorly architected, the fix to the above changes were almost always expensive, and difficult to justify with the non-technical customer base (why should they have to pay for it?). Especially if the customer developed a false confidence in the previous so-called programmer/architect/dba/analyst.
There are too many usefull articles and newsgroup pile-ons to mention. Depending on your "real world" vs. academia background, some may be more readable than others. Your own web search and/or experience should justify.
If, after you have searched the issue out yourself, you still are unsure as to whether to use surrogate or natural keys, post a question to the forum of the database product you are using (i.e. MS SQL, Oracle, Sybase, DB2). If someone really needs it, I'll spend the time debunking and/or answering questions on this forum. But you'll have to post your request within the next two weeks, as I'll drop it from my radar after then.
I apologize if my annoyance is coming through. My goal is simply to walk you through this cycle quickly, so that you won't waste too much more of your time.
I think I have violated a principle that is probably more important than primary keys. People are more important than computers. Facts are facts, but that right there is the truth.
I want to apologize to Dave, and everyone viewing this blog. I'll see Dave tomorrow at work, and apologize in person there. I let my frustration and annoyance get the better of me, and it was wrong. Again I'm sorry.
I want to apologize to Dave, and everyone viewing this blog. I'll see Dave tomorrow at work, and apologize in person there. I let my frustration and annoyance get the better of me, and it was wrong. Again I'm sorry.
I came from a DEC Rdb and Oracle background before starting to work with Access and SQL Server a few years back. I was quite surprised at the insistance/preference from Microsoft's software that rows have ids rather than real primary keys.
It seemed to take the database one more step away from reality and violate normalization rules.
It took me quite a while to grok the clash between Oracle tradition Microsoft tradition. (I think these traditions may grow out of Microsoft's roots in small, easy to use, non-rigorous databases like Access, while Oracle came from high-end, purists.)
I have to agree that identity columns should not be added just to provide unique row identifiers, though they often are.
However, I believe that there is sometimes an argument for adding identifiers to business entities that don't otherwise have concise natural keys. But, these have to be reflected back in the business rules so that they become real primary keys, albeit artificial.
It seemed to take the database one more step away from reality and violate normalization rules.
It took me quite a while to grok the clash between Oracle tradition Microsoft tradition. (I think these traditions may grow out of Microsoft's roots in small, easy to use, non-rigorous databases like Access, while Oracle came from high-end, purists.)
I have to agree that identity columns should not be added just to provide unique row identifiers, though they often are.
However, I believe that there is sometimes an argument for adding identifiers to business entities that don't otherwise have concise natural keys. But, these have to be reflected back in the business rules so that they become real primary keys, albeit artificial.
Very interesting reading, Dale, and everybody else.
In general, I agree that using "real" data as the PK is the way to go, but I am uncomfortable making it as global a statement as Dale is making. In all of Celko's rants on this subject, IMHO, it all boils down to the data integrity issue. Using an IDENTITY column does not prohibit duplicate data being entered. And if you have duplicate data, then you have a problem. What if the rows are different? Which one is correct? This is the main and only legitimate argument to NOT use IDENTITY columns.
However, I can see situations where the above situation may not necessarily be undesired. I do a lot of work in the area of conversions, interfaces and integration.
Consider the following business scenario. Let's say we have a legacy system that pushes personnel data to a SQL database on a nightly basis. Let's say the mechanism to push the data is FTP and once the file is successfully written to the FTP site, the legacy system is done (it has no ability to accept an error message back if the actual import into the database fails because of bad data).
Let's further say that the client instructs you that the most important part of this integration is that "no messages get lost...EVER". This is not a fictitious scenario, by the way.
In this situation, if the PK were defined using SSN or anything other real data and a duplicate personnel record were sent. The ftp site would accept the message & the legacy system would be happy. The database import process would then error and legacy system would have no knowledge of said failure. You could argue that the import process should be smart enough to move the failed records to some error table or that local IT personnel should review error logs. Those are valid points but if, in our design, the SQL database should accurately reflect the data in the legacy system at all times, then the correct thing to do is import the second row even if it is duplicated. Then some error report would trigger the removal of one of the records from the source – the LEGACY system and the delete would then flow to the SQL system. If we reject the record upon import, then the systems are not in-synch.
Guys, this situation is not at all far fetched. Old, mainframe based legacy systems sending flat files to SQL systems is pretty common in my experience. And most of the time, the legacy system is still king and SQL must simply reflect what is in the old system. And…most legacy systems like this are far from relational databases.
Like I said in the beginning…I generally agree that INDENTITY columns shouldn’t be used as PK’s, but I also feel that the client requirements dictate the database design, not the other way around. A dba should not be telling the functional managers how their business should run in order to support the databases relational model.
Thanks,
Chris
In general, I agree that using "real" data as the PK is the way to go, but I am uncomfortable making it as global a statement as Dale is making. In all of Celko's rants on this subject, IMHO, it all boils down to the data integrity issue. Using an IDENTITY column does not prohibit duplicate data being entered. And if you have duplicate data, then you have a problem. What if the rows are different? Which one is correct? This is the main and only legitimate argument to NOT use IDENTITY columns.
However, I can see situations where the above situation may not necessarily be undesired. I do a lot of work in the area of conversions, interfaces and integration.
Consider the following business scenario. Let's say we have a legacy system that pushes personnel data to a SQL database on a nightly basis. Let's say the mechanism to push the data is FTP and once the file is successfully written to the FTP site, the legacy system is done (it has no ability to accept an error message back if the actual import into the database fails because of bad data).
Let's further say that the client instructs you that the most important part of this integration is that "no messages get lost...EVER". This is not a fictitious scenario, by the way.
In this situation, if the PK were defined using SSN or anything other real data and a duplicate personnel record were sent. The ftp site would accept the message & the legacy system would be happy. The database import process would then error and legacy system would have no knowledge of said failure. You could argue that the import process should be smart enough to move the failed records to some error table or that local IT personnel should review error logs. Those are valid points but if, in our design, the SQL database should accurately reflect the data in the legacy system at all times, then the correct thing to do is import the second row even if it is duplicated. Then some error report would trigger the removal of one of the records from the source – the LEGACY system and the delete would then flow to the SQL system. If we reject the record upon import, then the systems are not in-synch.
Guys, this situation is not at all far fetched. Old, mainframe based legacy systems sending flat files to SQL systems is pretty common in my experience. And most of the time, the legacy system is still king and SQL must simply reflect what is in the old system. And…most legacy systems like this are far from relational databases.
Like I said in the beginning…I generally agree that INDENTITY columns shouldn’t be used as PK’s, but I also feel that the client requirements dictate the database design, not the other way around. A dba should not be telling the functional managers how their business should run in order to support the databases relational model.
Thanks,
Chris
I have argued with Mr Celko quite often about this subject in the newsgroups, as well as others and I can never get any answer other than they are phyiscal locators, you also seems to hold. There is nothing truly physical about them other than how they are generated. Just because they make get used like this in the clustered index is a physical issue that means didley when we are using them in SQL. You can cluster on something else if you want to and still have an identity column. By this logic, does the clustered natural key is now a physical locator? For practical reasons we do have to worry with the physical nature of performance on occasion (like most days in a production environment.)
Every value has to come from somewhere, as as a surrogate value, identities come from some magic built into SQL Server. Better they implement the magic than me. The funny thing is I agree with most everything that is said about having verifiable natural keys, I just think that having a surrogate key ALSO is incredible valuable.
I extend this a bit on my blog in this post (http://spaces.msn.com/members/drsql/Blog/cns!1pqLLlQgw9hZKLV1K4eLatKw!440.entry) but the gist is that if you have an enforced natural key, then it works for me from a what is important point of view.
Every value has to come from somewhere, as as a surrogate value, identities come from some magic built into SQL Server. Better they implement the magic than me. The funny thing is I agree with most everything that is said about having verifiable natural keys, I just think that having a surrogate key ALSO is incredible valuable.
I extend this a bit on my blog in this post (http://spaces.msn.com/members/drsql/Blog/cns!1pqLLlQgw9hZKLV1K4eLatKw!440.entry) but the gist is that if you have an enforced natural key, then it works for me from a what is important point of view.
This blog was presented to the SSWUG.Org group. I sent this as a response to the editorial.
Of Keys, Indexes, Primary Keys and Identities...
Mr. Preston’s article is one of true DBA spirit. His statements, “A primary key, by definition, uniquely identifies an entity, not a data row” and “It is the primary key because that is what we are using to uniquely identify the data in the row - not the row, but the data in the row!” are so very true. As a Development DBA, I have tried to instill the mantra “if it does not describe the key, don’t put it in the table” theory with my own development teams. While some of these individuals are very senior level application programmers, it takes a bit of handholding to digest this rule. I would like to share an example of my primary key design for an Employee structure.
Currently, I am involved in a HR project to provide employee data to operational systems. The HR data is stored in a hosted solution and the company would like to use it for reporting, populating CRM and network systems, and automation of certain business processes. The company needs to protect, and rightly so, sensitive employee data from the curious and the malicious. To facilitate this, I have created an Employee schema that covers data from departments, teams and individuals to employee actions taken or milestones achieved. We have scheduled a pull of HR data using DTS from the hosted solution to this schema. My example, relating to Mr. Preston’s, is the Employee table itself. The hosted HR solution assigns each employee a number. So I have a decision to make. I could, if I followed the words of Mr. Preston’s co-worker, create an Identity column in the table that has no real relationship to the data and a second column or even table to hold the Employee number from the HR system. But why would I do this, to simply have an Identity? That type of logic is almost laughable. You have a unique key already, the Employee number. The Identity column would only serve as a row location and nothing else which would render it utterly useless. The Employee number is one of the most important pieces of data in that row, next to the person’s date of birth and SSN. It relates the Employee to payroll, benefits and in my integration all tasks, notes, and data changes to customer records. The data in the row describes the key; it is not an autonomous or unrelated value. As Mr. Preston said, try changing the Employee numbers before the next payroll and see what happens.
This same scenario is used for customer account numbers, sales codes, marketing tracking and many other identifiers by countless systems I have worked on and helped create.
Here is the key (sorry for the pun)
The data in your table should describe the key. If you cannot find a unique value in your table, it should probably be broken into more than one table.
I do not want to preach that Identity columns should never be used; indeed they can be very helpful. Truly, there are many times that I use and find value in them. Instead, I want to impart that there may be a more appropriate key in the table and that it should be used.
Of Keys, Indexes, Primary Keys and Identities...
Mr. Preston’s article is one of true DBA spirit. His statements, “A primary key, by definition, uniquely identifies an entity, not a data row” and “It is the primary key because that is what we are using to uniquely identify the data in the row - not the row, but the data in the row!” are so very true. As a Development DBA, I have tried to instill the mantra “if it does not describe the key, don’t put it in the table” theory with my own development teams. While some of these individuals are very senior level application programmers, it takes a bit of handholding to digest this rule. I would like to share an example of my primary key design for an Employee structure.
Currently, I am involved in a HR project to provide employee data to operational systems. The HR data is stored in a hosted solution and the company would like to use it for reporting, populating CRM and network systems, and automation of certain business processes. The company needs to protect, and rightly so, sensitive employee data from the curious and the malicious. To facilitate this, I have created an Employee schema that covers data from departments, teams and individuals to employee actions taken or milestones achieved. We have scheduled a pull of HR data using DTS from the hosted solution to this schema. My example, relating to Mr. Preston’s, is the Employee table itself. The hosted HR solution assigns each employee a number. So I have a decision to make. I could, if I followed the words of Mr. Preston’s co-worker, create an Identity column in the table that has no real relationship to the data and a second column or even table to hold the Employee number from the HR system. But why would I do this, to simply have an Identity? That type of logic is almost laughable. You have a unique key already, the Employee number. The Identity column would only serve as a row location and nothing else which would render it utterly useless. The Employee number is one of the most important pieces of data in that row, next to the person’s date of birth and SSN. It relates the Employee to payroll, benefits and in my integration all tasks, notes, and data changes to customer records. The data in the row describes the key; it is not an autonomous or unrelated value. As Mr. Preston said, try changing the Employee numbers before the next payroll and see what happens.
This same scenario is used for customer account numbers, sales codes, marketing tracking and many other identifiers by countless systems I have worked on and helped create.
Here is the key (sorry for the pun)
The data in your table should describe the key. If you cannot find a unique value in your table, it should probably be broken into more than one table.
I do not want to preach that Identity columns should never be used; indeed they can be very helpful. Truly, there are many times that I use and find value in them. Instead, I want to impart that there may be a more appropriate key in the table and that it should be used.
Louis,
By the way, I have your WROX book. And when I discussed this point first with my coworker, I said that I have never read anywhere about using disassociated values as the primary key.
I guess I had forgotten that from your book. Strange, too, that I would forget that since I always considered your book one of the strong influences in my always-follow-proper-design-rules way of thinking.
If your new book is as good, I recommend it.
By the way, I have your WROX book. And when I discussed this point first with my coworker, I said that I have never read anywhere about using disassociated values as the primary key.
I guess I had forgotten that from your book. Strange, too, that I would forget that since I always considered your book one of the strong influences in my always-follow-proper-design-rules way of thinking.
If your new book is as good, I recommend it.
I used to be a "business key" kind of guy. But I've evolved into a surrogate (identity) key person over the past 5-7 years. The biggest problem I have w/business keys is not at the high-level, they make some sense there (high level is things like customer, employee, item, etc.)
The problem comes in on the lower-level child rows. From a business point of view, the child rows will have the parent keys in their business keys. Grow this model out and you end up with business keys that are 15 to 20 fields long. Not only is that inefficient, but it's downright annoying from a usage point of view.
Using surrogate keys solves this problem. Sure, you _can_ access the row using the business key, but if you know what row you're looking for, you only need one column -- mich nicer!
-Walden
The problem comes in on the lower-level child rows. From a business point of view, the child rows will have the parent keys in their business keys. Grow this model out and you end up with business keys that are 15 to 20 fields long. Not only is that inefficient, but it's downright annoying from a usage point of view.
Using surrogate keys solves this problem. Sure, you _can_ access the row using the business key, but if you know what row you're looking for, you only need one column -- mich nicer!
-Walden
Walden,
Thanks for your comment.
You state that by using an identity column value you only need one column versus potentially 15 to 20 columns wide using natural or, as you call them, business columns.
Perhaps I am missing something. What do you mean by growing it out to 15 or 20 columns? Are you referring to a single parent table with a 15 to 20 column wide primary key?
If you're referring to having multiple parents referring to the same child, since it is the parent key that has to be in the child table and the parents would have different identity values, the number of columns required would not change.
Thanks for your comment.
You state that by using an identity column value you only need one column versus potentially 15 to 20 columns wide using natural or, as you call them, business columns.
Perhaps I am missing something. What do you mean by growing it out to 15 or 20 columns? Are you referring to a single parent table with a 15 to 20 column wide primary key?
If you're referring to having multiple parents referring to the same child, since it is the parent key that has to be in the child table and the parents would have different identity values, the number of columns required would not change.
I'm a bit confused by this surrogate example that has been used.
The distinction that is being made between a surrogate ID in a DB and an Employee ID in a business doesn't seem very distinct to me. Isn't an employee ID just a surrogate ID that was created outside the DB and then imported into it? It relates to payroll and benefits, because payroll and benefits didn't want to store someone's payment information by referencing their first name, last name, DOB and SSN, right? Isn't that the exact same situation we're facing here?
The distinction that is being made between a surrogate ID in a DB and an Employee ID in a business doesn't seem very distinct to me. Isn't an employee ID just a surrogate ID that was created outside the DB and then imported into it? It relates to payroll and benefits, because payroll and benefits didn't want to store someone's payment information by referencing their first name, last name, DOB and SSN, right? Isn't that the exact same situation we're facing here?
Hi Dale,
As an illustration of Walden's point, consider a sales reporting solution that stores quotas by time, geography, user role, customer type, and product category, and the quota can be either dollar or unit based. The natural key for this table has 6 keys:
TimePeriodID,
GeographyID,
RoleID,
CustomerTypeID,
ProductCategoryID
QuotaUnitTypeID
Walden is saying that he's seen databases where "pushing the keys down" has resulted in tables with 15 or 20 fields in the key, like if I needed to join the Quotas table to an equally complex UserRoleCustomerRank table (UserID, RoleID, CustomerTypeID, CustomerRankID) and that in those situations writing queries becomes a heck of a chore.
I agree with Walden from experience, but I would also say that having all those keys protects you in ways that are hard to appreciate at first. It's very hard to have data integrity errors when you're using natural keys, and in very complex systems where you can lose track of whether a certain insert is going to affect a table seven joins away, you're suddenly very thankful for the protection. The queries are definitely a hassle, but the system stops you from making subtle mistakes down the road, which is worth a fortune.
Post a Comment
As an illustration of Walden's point, consider a sales reporting solution that stores quotas by time, geography, user role, customer type, and product category, and the quota can be either dollar or unit based. The natural key for this table has 6 keys:
TimePeriodID,
GeographyID,
RoleID,
CustomerTypeID,
ProductCategoryID
QuotaUnitTypeID
Walden is saying that he's seen databases where "pushing the keys down" has resulted in tables with 15 or 20 fields in the key, like if I needed to join the Quotas table to an equally complex UserRoleCustomerRank table (UserID, RoleID, CustomerTypeID, CustomerRankID) and that in those situations writing queries becomes a heck of a chore.
I agree with Walden from experience, but I would also say that having all those keys protects you in ways that are hard to appreciate at first. It's very hard to have data integrity errors when you're using natural keys, and in very complex systems where you can lose track of whether a certain insert is going to affect a table seven joins away, you're suddenly very thankful for the protection. The queries are definitely a hassle, but the system stops you from making subtle mistakes down the road, which is worth a fortune.
<< Home





