Search This Blog & Web

Tuesday, May 29, 2012

Hash functions to improve string comparison and DB desgin


Scenario

I have gone through some difficult experience to compare strings much time during database design and Querying and think about we need to get some different solution to perform same operation. Many times there are requirements to create Varchar field as Primary column like Account Number in Chart of Account table in Accounts software or Comments column in any table with varchar(max) or (100) etc.

Problem

When we try to join string based column or use in our where clause even to find a single record with like or equals clause it increase our query cost. I came across same scenario in recent time to compare a string based column to just check whether it equals to parameter string value or not and it took 76% of total query cost. SQL SERVER Optimizer hints to create a cover index using Primary Key plus this column which improve query performance by at least 50% but at the same time I came across Hash Key function that satisfied my needs, storage and optimization. So I am going to discuss what is Hash key and how can we implement it using SQL SERVER 2008 and above. I have gathered information from many sources and msdn is one of them.

What is Hash Function

Hash is the value (int or varbinary) result of an algorithm (known as a hash function) applied to a given string. You just need to provide your string as input and you will get a unique hash value as an output. If we provide a complete page string to this function and then change just a character to pass value to it. It will return different values. There is small possibility to repeat same value and this will known as hash collision.

Where can I use Hash Function?

·         Security implementation as encryption of string data.
·         Reduce network traffic for cross DB queries because of small size required to compare instead of string values.
·         Less space requires to store and campare values because it returns int and binary values instead of string characters.
·         Performance increase due to small db types and indexes uses if we implement index on it.
·         Avoid whole string comparisons and requires comparing just int values as checksum.
·         If we create hash column in our DB design along with string columns then it is easy for us to implement joins on these column that will act like string joins as these are hash values of string values.

Careful using Hash Function

·         Hash collision may return more than your expected result set by returning same hash value. For this reason, CheckSum might return more values then you expected but if you want to use this column to identify your column changes then consider HashBytes. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.
·          
Types of Hash function

SQL SERVER provides multiple hash functions like CheckSum, CheckSum_Binary, HashBytes etc. I am going to put example against CheckSum and HashBytes to just shows the function of hash values in logic.

Hash Keys in Database Design

You can efficiently design your table to use Hash key columns along with regular string column but it depends on your string column requirements in queries. Like in following table design I have created a varchar(1000) column and store 2 row in it. When we try to compare run query on this table to some value from String1 column it will compare as string and consume 100s of bytes of space and resources as well.

===================== Starts =======================
--- drop table if already exists
drop table Hashfunction
go

create table Hashfunction
(id int identity primary key,
            string1 varchar(1000),   
            checksumCol as CHECKSUM(STRING1) ,
            HashByteCol as HashBytes('SHA1',string1)  )

Insert into Hashfunction(string1)
values ('Returns the checksum value computed over a row of a table, or over a list of expressions. CHECKSUM is intended for use in building hash indexes.'),
('Returns the MD2, MD4, MD5, SHA, or SHA1 hash of its input.')
===================== Code Ends =======================

If you look at the CheckSumCol and HashByteCol. I have added two computed columns to add hash key values and if you look at their parameter passed is String1 column it means CheckSum and HashByte column will return hash values against String values for both rows as shown in following picture




CHECKSUM

Returns the checksum value computed over a row of a table, or over a list of expressions. Output of checksum value is integer. CHECKSUM is intended for use in building hash indexes. You can create Checksum value for more than one column like

select CHECKSUM('this is my test','this is checksum test')
returns -506357463

CHECKSUM returns an error if any column is of noncomparable data type. Noncomparable data types are text, ntext, image, XML, and cursor, and also sql_variant with any one of the preceding types as its base type.
CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. If one of the values in the expression list changes, the checksum of the list also generally changes. The order of expressions affects the resultant value of CHECKSUM.

The CHECKSUM value is dependent upon the collation. The same value stored with a different collation will return a different CHECKSUM value.

Hash Indexes

CHECKSUM computes a hash value, called the checksum, over its list of columns or arguments. The hash value is intended for use in building hash indexes. If the arguments to CHECKSUM are columns, and an index is built over the computed CHECKSUM value, the result is a hash index. This can be used for equality searches over the columns.

---- Hash Indexe on CheckSum int Column
CREATE INDEX Hashfunction_index ON Hashfunction (CheckSumCol);

The checksum index can be used as a hash index, particularly to improve indexing speed when the column to be indexed is a long character column. The checksum index can be used for equality searches.

--- Comparision2 as Int
select * from Hashfunction where CheckSumCol =  CHECKSUM('Returns the checksum value computed over a row of a table, or over a list of expressions. CHECKSUM is intended for use in building hash indexes.')

Creating the index on the computed column materializes the checksum column, and any changes to the column value will effect to the computed column. Alternatively, an index could be built directly on the column indexed.

HASHBYTES

Despite of CheckSum function that returns int value hashbyte returns varbinary value max up to 8000 bytes. There are extreamly less chances of collision in hashbytes instead of checksum. Hash byte function use hash algorithms MD2 | MD4 | MD5 | SHA | SHA1
The output conforms to the algorithm standard: 128 bits (16 bytes) for MD2, MD4, and MD5; 160 bits (20 bytes) for SHA and SHA1.
In above created table you can see HashBytes computed column in Hashfunction table. You can use it as following
--- Comparision3 as VarBinary
select * from Hashfunction where HashByteCol = HASHBYTES('SHA1','Returns the checksum value computed over a row of a table, or over a list of expressions. CHECKSUM is intended for use in building hash indexes.')

Allowed input values are limited to 8000 bytes. If you pass values more than 8000 values it will generate hash value only for first 8000 characters like

DECLARE @v varchar(max);
SET @v = REPLICATE ('A', 9000); -- Input longer than 8,000 bytes  SELECT LEN(@v) AS VarcharLength, HashBytes('SHA1', @v) AS HashValue;SET @v = REPLICATE ('A', 8000); -- Input exactly 8,000 bytes  SELECT LEN(@v) AS VarcharLength, HashBytes('SHA1', @v) AS HashValue;

The output is as follows:
VarcharLength        HashValue 
8000                0xD2967D6425E56C18BA979EEFB4E0DBD1269D9BC9
8000                0xD2967D6425E56C18BA979EEFB4E0DBD1269D9BC9

No comments: