Monday, September 8, 2008

what is Surrogate Keys ?

the following articale from this link
http://decipherinfosys.wordpress.com/2007/02/01/surrogate-keys-vs-natural-keys-for-primary-key/

Surrogate Keys vs Natural Keys for Primary Key?

This topic probably is one of those that you cannot get any two database developers/DBAs to agree upon. Everyone has their own opinion about this and it is also one of the most discussed topics over the web when it comes to data modeling. Rather than taking any side :-), we are just listing out our experiences when it comes to chosing between a surrogate key vs the natural keys for the tables.0

Surrogate Key:

Surrogate keys are keys that have no “business” meaning and are solely used to identify a record in the table. Such keys are either database generated (example: Identity in SQL Server, Sequence in Oracle, Sequence/Identity in DB2 UDB etc.) or system generated values (like generated via a table in the schema).

Natural Key:

Keys are natural if the attribute it represents is used for identification independently of the database schema. What this basically means is that the keys are natural if people use them example: Invoice-Numbers, Tax-Ids, SSN etc.

Design considerations for choosing the Primary Key:
Primary Key should meet the following requirements:

- It should be not null, Unique and should apply to all rows.

- It should be minimal (i.e. less number of columns in the PK: ideally it should be 1, if using composite keys, then make sure that those are surrogates and using integer family data-types).

- It should be stable over a period of time (should not change i.e. update to the PK columns should not happen).

Keeping these in mind, here are the pros and cons of Surrogate vs. Natural keys:

Surrogate Key

I prefer surrogate keys to be DB controlled rather than being controlled via a next-up table in the schema since that is a more scalable approach.

Pros:

- Business Logic is not in the keys.

- Small 4-byte key (the surrogate key will most likely be an integer and SQL Server for example requires only 4 bytes to store it, if a bigint, then 8 bytes).

- Joins are very fast.

- No locking contentions because of unique constraint (this refers to the waits that get developed when two sessions are trying to insert the same unique business key) as the surrogates get generated by the DB and are cached - very scalable.

Cons:

- An additional index is needed. In SQL Server, the PK constraint will always creates a unique index, in Oracle, if an index already exists, PK creation will use that index for uniqueness enforcement (not a con in Oracle).

- Cannot be used as a search key.

- If it is database controlled, for products that support multiple databases, different implementations are needed, example: identity in SS2k, before triggers and sequences in Oracle, identity/sequence in DB2 UDB.

- Always requires a join when browsing the child table(s).

Natural Key

Pros:

- No additional Index.

- Can be used as a search key.

Cons:

- If not chosen wisely (business meaning in the key(s)), then over a period of time additions may be required to the PK and modifications to the PK can occur.

- If using strings, joins are a bit slower as compared to the int data-type joins, storage is more as well. Since storage is more, less data-values get stored per index page. Also, reading strings is a two step process in some RDBMS: one to get the actual length of the string and second to actually perform the read operation to get the value.

- Locking contentions can arise if using application driven generation mechanism for the key.

- Can’t enter a record until value is known since the value has some meaning.

Choosing Surrogate vs. Natural Keys:

There is no rule of thumb in this case. It has to be evaluated table by table:

- If we can identify an appropriate natural key that meets the three criteria for it to be a PK column, we should use it. Look-up tables and configuration tables are typically ok.

- Data-Type for the PK: the smaller the better, choose an integer or a short-character data type. It also ensures that the joins will be faster. This becomes even more important if you are going to make the PK as a clustered index since non-clustered indexes are built off the clustered index. RDBMS processes integer data values faster than the character data values because it converts characters to ASCII equivalent values before processing, which is an extra step.

No comments: