I am an extreme moderate

June 10, 2012

xxhash256 update

Filed under: Uncategorized — niezmierniespokojny @ 11:41 am

One of the drawbacks of the previous version of the hash was limited avalanching, each bit of the input could affect at most 64 bits of the output.
I decided to fix it by adding slight interactions between internal state variables. The process is slow and has high latency, but the function was designed for large inputs, so it doesn’t matter much. Also, I changed the main loop so that multiplications occur less often, so it’s something between the strong and the fast hash with a better avalanching.
To make the interactions really small, I unrolled the main loop and the code got faster in the process, clocking at 3024 MB/s (previously – the fast version did 3008 and the fast one – 2751 MB/s), so I obsoleted the fast hash. Well, I could unroll it too, but I don’t think it’s worthwhile.

/*
   xxHash256 - A fast checksum algorithm
   Copyright (C) 2012, Yann Collet & Maciej Adamczyk.
   BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)

   Redistribution and use in source and binary forms, with or without
   modification, are permitted provided that the following conditions are
   met:

       * Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
       * Redistributions in binary form must reproduce the above
   copyright notice, this list of conditions and the following disclaimer
   in the documentation and/or other materials provided with the
   distribution.

   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

************************************************************************
   This file contains a super-fast hash function for checksumming
   purposes, designed for large (1 KB++) inputs.
   Known limitations:
    * they use 64-bit math and will not be so fast on 32-bit machines
    * on architectures that disallow unaligned memory access, the input
      must be 8-byte aligned. Aligning it on other architectures
      is not needed, but will improve performance.
    * it produces different results on big and small endian.
  
   Changelog:
    v0: initial release
    v1: the strong hash is faster and stronger, it obsoletes the fast one
*/

//**************************************
// Includes
//**************************************
#include "stddef.h"
#include "stdint.h"
#include "string.h"


//**************************************
// Macros
//**************************************
#define _rotl(x,r) ((x << r) | (x >> (64 - r)))

//**************************************
// Constants
//**************************************
#define PRIME 11400714819323198393ULL

//******************************
// Hash functions
//******************************
void XXH_256(const void* input, size_t len, uint64_t* out)
{
    const uint8_t* p = (uint8_t*)input;
    const uint8_t* const bEnd = p + len;
    uint64_t v1 = len * PRIME;
    uint64_t v2 = v1;
    uint64_t v3 = v1;
    uint64_t v4 = v1;

    const size_t big_loop_step = 4 * 4 * sizeof(uint64_t);
    const size_t small_loop_step = 4 * sizeof(uint64_t);
    // Set the big loop limit early enough, so the well-mixing small loop can be executed twice after it
    const uint8_t* const big_loop_limit   = bEnd - big_loop_step - 2 * small_loop_step;
    const uint8_t* const small_loop_limit = bEnd - small_loop_step;

    while (p < big_loop_limit)
    {
        v1 = _rotl(v1, 29) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v2 = _rotl(v2, 31) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v3 = _rotl(v3, 33) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v4 = _rotl(v4, 35) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v1 += v2 *= PRIME;
        v1 = _rotl(v1, 29) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v2 = _rotl(v2, 31) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v3 = _rotl(v3, 33) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v4 = _rotl(v4, 35) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v2 += v3 *= PRIME;
        v1 = _rotl(v1, 29) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v2 = _rotl(v2, 31) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v3 = _rotl(v3, 33) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v4 = _rotl(v4, 35) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v3 += v4 *= PRIME;
        v1 = _rotl(v1, 29) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v2 = _rotl(v2, 31) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v3 = _rotl(v3, 33) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v4 = _rotl(v4, 35) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v4 += v1 *= PRIME;
    }

    while (p < small_loop_limit)
    {
        v1 = _rotl(v1, 29) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v2 += v1 *= PRIME;
        v2 = _rotl(v2, 31) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v3 += v2 *= PRIME;
        v3 = _rotl(v3, 33) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v4 += v3 *= PRIME;
        v4 = _rotl(v4, 35) + (*(uint64_t*)p); p+=sizeof(uint64_t);
        v1 += v4 *= PRIME;
    }
    memcpy(out, p, bEnd - p);

    out[0] += v1;
    out[1] += v2;
    out[2] += v3;
    out[3] += v4;
}

5 Comments »

  1. Hi Maciej,
    thanks to your testbench I obtained valuable info on how some fast 32/64 hashers behave on my laptop T7500 2200MHz (Everest says 5395MB/s Memory Read):

    I wrote revision 3 of FNV1A_Tesla as 64bit counterpart of my latest DIAMANTINE slasher FNV1A_Yoshimura.

    As console screenshots:

    As console text dumps:

    E:\DOUBLOON_hash_micro-package_r3>RUNME_64bit.BAT

    E:\DOUBLOON_hash_micro-package_r3>benchmark_Intel_12.1_O2.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
    memcpy: 108 ms, 209715202 bytes = 1851 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    CityHash64 1.0.3
    209715210 (x 1.000) 3333 MB/s 3389 MB/s 273e15 277e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15
    fnv1a-jesteress v2
    209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    fnv1a-yoshimura v2
    209715206 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    fnv1a-tesla3 v2
    209715210 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15
    xxhash-fast r3
    209715206 (x 1.000) 4000 MB/s 4000 MB/s 327e15 327e15
    xxhash-strong r3
    209715206 (x 1.000) 2816 MB/s 2816 MB/s 230e15 230e15
    xxhash256 r3
    209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    E:\DOUBLOON_hash_micro-package_r3>benchmark_Intel_12.1_O3.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
    memcpy: 109 ms, 209715202 bytes = 1834 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 3278 MB/s 3333 MB/s 268e15 273e15
    CityHash64 1.0.3
    209715210 (x 1.000) 3333 MB/s 3278 MB/s 273e15 268e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 3278 MB/s 3278 MB/s 268e15 268e15
    fnv1a-jesteress v2
    209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    fnv1a-yoshimura v2
    209715206 (x 1.000) 3921 MB/s 3921 MB/s 321e15 321e15
    fnv1a-tesla3 v2
    209715210 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    xxhash-fast r3
    209715206 (x 1.000) 3636 MB/s 3636 MB/s 297e15 297e15
    xxhash-strong r3
    209715206 (x 1.000) 2777 MB/s 2777 MB/s 227e15 227e15
    xxhash256 r3
    209715234 (x 1.000) 3773 MB/s 3773 MB/s 309e15 309e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    E:\DOUBLOON_hash_micro-package_r3>benchmark_Intel_12.1_fast.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
    memcpy: 110 ms, 209715202 bytes = 1818 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 2380 MB/s 2380 MB/s 195e15 195e15
    CityHash64 1.0.3
    209715210 (x 1.000) 2105 MB/s 2105 MB/s 172e15 172e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 3508 MB/s 3508 MB/s 287e15 287e15
    fnv1a-jesteress v2
    209715206 (x 1.000) 3389 MB/s 3389 MB/s 277e15 277e15
    fnv1a-yoshimura v2
    209715206 (x 1.000) 4000 MB/s 4000 MB/s 327e15 327e15
    fnv1a-tesla3 v2
    209715210 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15
    xxhash-fast r3
    209715206 (x 1.000) 3773 MB/s 3773 MB/s 309e15 309e15
    xxhash-strong r3
    209715206 (x 1.000) 2777 MB/s 2777 MB/s 227e15 227e15
    xxhash256 r3
    209715234 (x 1.000) 3921 MB/s 3921 MB/s 321e15 321e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    E:\DOUBLOON_hash_micro-package_r3>benchmark_Microsoft_VS2010_Ox.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
    memcpy: 111 ms, 209715202 bytes = 1801 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 4444 MB/s 4444 MB/s 364e15 364e15
    CityHash64 1.0.3
    209715210 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15
    fnv1a-jesteress v2
    209715206 (x 1.000) 3333 MB/s 3278 MB/s 273e15 268e15
    fnv1a-yoshimura v2
    209715206 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    fnv1a-tesla3 v2
    209715210 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15
    xxhash-fast r3
    209715206 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15
    xxhash-strong r3
    209715206 (x 1.000) 2857 MB/s 2857 MB/s 234e15 234e15
    xxhash256 r3
    209715234 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    E:\DOUBLOON_hash_micro-package_r3>

    The benchmark:
    http://www.sanmayce.com/Fastest_Hash/DOUBLOON_hash_micro-package_r3.zip

    Thanks again
    Regards

    Comment by sanmayce — March 16, 2013 @ 4:59 pm

  2. Hi, allow me to provide some feedback:

    Test: fsbench-0.13.3 modified a little by Kaze
    OS: Windows 7 x64
    Machine: Laptop Core 2 T7500 2.2GHz
    Note: All the four executables compiled as 64bit
    Download: http://www.sanmayce.com/Downloads/fsbench-0.13.3_Kaze.7z

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>dir main*.exe
    Volume in drive D is S640_Vol5
    Volume Serial Number is F85D-148B

    Directory of D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src

    06/04/2013 09:21 PM 421,888 main_Intel_12.1_fast.exe
    06/04/2013 09:20 PM 549,888 main_Intel_12.1_O2.exe
    06/04/2013 09:20 PM 556,544 main_Intel_12.1_O3.exe
    06/04/2013 09:24 PM 343,552 main_Microsoft_VS2010_Ox.exe
    4 File(s) 1,871,872 bytes
    0 Dir(s) 31,017,021,440 bytes free

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>COMPILEME_RUNME_Microsoft.BAT

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>if exist main_Microsoft_VS2010_Ox.exe goto bumshakalaka

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>main_Microsoft_VS2010_Ox.exe CityHash128 CityHash64 SpookyHash xxhash xxhash256 murmur3_x86_32 murmur_x64_128 FNV1a-Jesteress FNV1a-Tesla FNV1a-Tesla3 FNV1a-YoshimitsuTRIADii FNV1a-YoshimitsuTRIADiiXMM FNV1a-Yoshimura -i77 200MB_as_one_line.TXT
    memcpy: 113 ms, 209715202 bytes = 1769 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15
    CityHash64 1.0.3
    209715210 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15
    xxhash r29
    209715206 (x 1.000) 3389 MB/s 3389 MB/s 277e15 277e15
    xxhash256 1
    209715234 (x 1.000) 4255 MB/s 4255 MB/s 348e15 348e15
    murmur3_x86_32 2012-02-29
    209715206 (x 1.000) 1851 MB/s 1851 MB/s 151e15 151e15
    murmur_x64_128 2012-02-29
    209715218 (x 1.000) 2816 MB/s 2816 MB/s 230e15 230e15
    FNV1a-Jesteress 2013-05-12
    209715206 (x 1.000) 3278 MB/s 3278 MB/s 268e15 268e15
    FNV1a-Tesla 2013-05-12
    209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15
    FNV1a-Tesla3 2013-05-12
    209715210 (x 1.000) 4347 MB/s 4444 MB/s 356e15 364e15
    FNV1a-YoshimitsuTRIADii 2013-05-12
    209715206 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15
    FNV1a-YoshimitsuTRIADiiXMM 2013-05-12
    209715206 (x 1.000) 4545 MB/s 4545 MB/s 372e15 372e15
    FNV1a-Yoshimura 2013-05-12
    209715210 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>COMPILEME_RUNME_Intel.BAT

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>if exist main_Intel_12.1_O2.exe goto bumshakalaka

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>main_Intel_12.1_O2.exe CityHash128 CityHash64 SpookyHash xxhash xxhash256 murmur3_x86_32 murmur_x64_128 FNV1a-Jesteress FNV1a-Tesla FNV1a-Tesla3 FNV1a-YoshimitsuTRIADii FNV1a-YoshimitsuTRIADiiXMM FNV1a-Yoshimura -i77 200MB_as_one_line.TXT
    memcpy: 108 ms, 209715202 bytes = 1851 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 3333 MB/s 3389 MB/s 273e15 277e15
    CityHash64 1.0.3
    209715210 (x 1.000) 3448 MB/s 3448 MB/s 282e15 282e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    xxhash r29
    209715206 (x 1.000) 3448 MB/s 3448 MB/s 282e15 282e15
    xxhash256 1
    209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    murmur3_x86_32 2012-02-29
    209715206 (x 1.000) 1538 MB/s 1538 MB/s 126e15 126e15
    murmur_x64_128 2012-02-29
    209715218 (x 1.000) 2150 MB/s 2150 MB/s 176e15 176e15
    FNV1a-Jesteress 2013-05-12
    209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    FNV1a-Tesla 2013-05-12
    209715210 (x 1.000) 4651 MB/s 4651 MB/s 381e15 381e15
    FNV1a-Tesla3 2013-05-12
    209715210 (x 1.000) 4444 MB/s 4444 MB/s 364e15 364e15
    FNV1a-YoshimitsuTRIADii 2013-05-12
    209715206 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15
    FNV1a-YoshimitsuTRIADiiXMM 2013-05-12
    209715206 (x 1.000) 4545 MB/s 4545 MB/s 372e15 372e15
    FNV1a-Yoshimura 2013-05-12
    209715210 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>main_Intel_12.1_O3.exe CityHash128 CityHash64 SpookyHash xxhash xxhash256 murmur3_x86_32 murmur_x64_128 FNV1a-Jesteress FNV1a-Tesla FNV1a-Tesla3 FNV1a-YoshimitsuTRIADii FNV1a-YoshimitsuTRIADiiXMM FNV1a-Yoshimura -i77 200MB_as_one_line.TXT
    memcpy: 108 ms, 209715202 bytes = 1851 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 3225 MB/s 3225 MB/s 264e15 264e15
    CityHash64 1.0.3
    209715210 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 3225 MB/s 3225 MB/s 264e15 264e15
    xxhash r29
    209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    xxhash256 1
    209715234 (x 1.000) 3773 MB/s 3773 MB/s 309e15 309e15
    murmur3_x86_32 2012-02-29
    209715206 (x 1.000) 1515 MB/s 1515 MB/s 124e15 124e15
    murmur_x64_128 2012-02-29
    209715218 (x 1.000) 2105 MB/s 2105 MB/s 172e15 172e15
    FNV1a-Jesteress 2013-05-12
    209715206 (x 1.000) 3333 MB/s 3333 MB/s 273e15 273e15
    FNV1a-Tesla 2013-05-12
    209715210 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15
    FNV1a-Tesla3 2013-05-12
    209715210 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    FNV1a-YoshimitsuTRIADii 2013-05-12
    209715206 (x 1.000) 3636 MB/s 3636 MB/s 297e15 297e15
    FNV1a-YoshimitsuTRIADiiXMM 2013-05-12
    209715206 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    FNV1a-Yoshimura 2013-05-12
    209715210 (x 1.000) 3921 MB/s 3921 MB/s 321e15 321e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>main_Intel_12.1_fast.exe CityHash128 CityHash64 SpookyHash xxhash xxhash256 murmur3_x86_32 murmur_x64_128 FNV1a-Jesteress FNV1a-Tesla FNV1a-Tesla3 FNV1a-YoshimitsuTRIADii FNV1a-YoshimitsuTRIADiiXMM FNV1a-Yoshimura -i77 200MB_as_one_line.TXT
    memcpy: 108 ms, 209715202 bytes = 1851 MB/s
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    CityHash128 1.0.3
    209715218 (x 1.000) 2409 MB/s 2409 MB/s 197e15 197e15
    CityHash64 1.0.3
    209715210 (x 1.000) 2127 MB/s 2150 MB/s 174e15 176e15
    SpookyHash 2012-03-30
    209715218 (x 1.000) 3636 MB/s 3636 MB/s 297e15 297e15
    xxhash r29
    209715206 (x 1.000) 3448 MB/s 3448 MB/s 282e15 282e15
    xxhash256 1
    209715234 (x 1.000) 4166 MB/s 4166 MB/s 341e15 341e15
    murmur3_x86_32 2012-02-29
    209715206 (x 1.000) 1538 MB/s 1538 MB/s 126e15 126e15
    murmur_x64_128 2012-02-29
    209715218 (x 1.000) 2150 MB/s 2150 MB/s 176e15 176e15
    FNV1a-Jesteress 2013-05-12
    209715206 (x 1.000) 3389 MB/s 3389 MB/s 277e15 277e15
    FNV1a-Tesla 2013-05-12
    209715210 (x 1.000) 4545 MB/s 4545 MB/s 372e15 372e15
    FNV1a-Tesla3 2013-05-12
    209715210 (x 1.000) 4347 MB/s 4347 MB/s 356e15 356e15
    FNV1a-YoshimitsuTRIADii 2013-05-12
    209715206 (x 1.000) 3773 MB/s 3773 MB/s 309e15 309e15
    FNV1a-YoshimitsuTRIADiiXMM 2013-05-12
    209715206 (x 1.000) 4444 MB/s 4444 MB/s 364e15 364e15
    FNV1a-Yoshimura 2013-05-12
    209715210 (x 1.000) 4081 MB/s 4081 MB/s 334e15 334e15
    Codec version args
    C.Size (C.Ratio) C.Speed D.Speed C.Eff. D.Eff.
    done… (77×1 iteration(s)).

    D:\Downloads\_2013_Jun-03\fsbench-0.13.3\src>

    I am confused, for some reason above speeds defy ‘logic’ (or better said ‘ignorance’), 32bit code (for some functions) behaves much better than their 64bit counterpart?! I spotted that ‘abnormality’ in my strstr() variants as well.
    Personally, my favorite is ‘FNV1a-YoshimitsuTRIADii’, it is the fastest 32bit hash I have ever seen, yet with /O3 and /fast it disappoints, grmbl.

    Maciek, I want to salute you with a song that made my 2012 year complete: http://www.youtube.com/watch?v=QyQahb7PXak
    What Ioanna did I call ‘immortal’: ‘… tell the salt marsh …’ – transcends all the trash.

    Comment by sanmayce — June 5, 2013 @ 12:21 pm

  3. Hi again, to make speed dumps more interesting I updated my micro-fork of your bench (hash part):

    Test: fsbench-0.13.3 modified a little by Kaze
    OS: Windows 7 x64
    Machine: Laptop Core 2 T7500 2.2GHz
    Note: All the four/four executables compiled as 32bit/64bit
    Download: http://www.sanmayce.com/Downloads/fsbench-0.13.3_Kaze.7z
    Results: http://www.sanmayce.com/Downloads/fsbench-0.13.3_Kaze_CONSOLE_T7500.txt
    Results: http://www.sanmayce.com/Downloads/fsbench-0.13.3_Kaze_CONSOLE_T7500_sorted.txt

    Below, I attempted to answer the dummy question ‘What function/compiler/option/width quad allows maximum linear speed?’:

    murmur_x64_128 200MB 0187 MB/s 0187 MB/s Intel_12.1_O3_32bit
    murmur_x64_128 200MB 0188 MB/s 0187 MB/s Intel_12.1_O2_32bit
    murmur_x64_128 200MB 0190 MB/s 0189 MB/s Intel_12.1_fast_32bit
    murmur_x64_128 200MB 0372 MB/s 0372 MB/s Microsoft_VS2010_Ox_32bit
    CityHash64 200MB 0421 MB/s 0421 MB/s Intel_12.1_fast_32bit
    CityHash128 200MB 0438 MB/s 0436 MB/s Intel_12.1_fast_32bit
    CityHash128 200MB 0643 MB/s 0643 MB/s Microsoft_VS2010_Ox_32bit
    CityHash64 200MB 0696 MB/s 0694 MB/s Intel_12.1_O3_32bit
    CityHash64 200MB 0699 MB/s 0696 MB/s Intel_12.1_O2_32bit
    CityHash64 200MB 0699 MB/s 0699 MB/s Microsoft_VS2010_Ox_32bit
    CityHash128 200MB 0706 MB/s 0709 MB/s Intel_12.1_O2_32bit
    CityHash128 200MB 0709 MB/s 0706 MB/s Intel_12.1_O3_32bit
    vhash 200MB 0727 MB/s 0763 MB/s Intel_12.1_O3_32bit
    vhash 200MB 0729 MB/s 0729 MB/s Microsoft_VS2010_Ox_32bit
    fletcher4 200MB 0790 MB/s 0796 MB/s Microsoft_VS2010_Ox_32bit
    vhash 200MB 0826 MB/s 0819 MB/s Intel_12.1_O2_32bit
    vhash 200MB 0858 MB/s 0862 MB/s Intel_12.1_fast_32bit
    uhash 200MB 0888 MB/s 0896 MB/s Intel_12.1_fast_32bit
    uhash 200MB 0909 MB/s 0909 MB/s Intel_12.1_O3_32bit
    uhash 200MB 0917 MB/s 0921 MB/s Intel_12.1_O2_32bit
    fletcher4 200MB 0930 MB/s 0938 MB/s Intel_12.1_O3_32bit
    fletcher4 200MB 0938 MB/s 0938 MB/s Intel_12.1_fast_32bit
    uhash 200MB 1041 MB/s 1041 MB/s Microsoft_VS2010_Ox_64bit
    fletcher4 200MB 1069 MB/s 1069 MB/s Intel_12.1_O2_32bit
    uhash 200MB 1081 MB/s 1081 MB/s Intel_12.1_fast_64bit
    uhash 200MB 1104 MB/s 1104 MB/s Intel_12.1_O2_64bit
    uhash 200MB 1104 MB/s 1104 MB/s Intel_12.1_O3_64bit
    uhash 200MB 1156 MB/s 1156 MB/s Microsoft_VS2010_Ox_32bit
    FNV1a-Tesla3 200MB 1257 MB/s 1257 MB/s Microsoft_VS2010_Ox_32bit
    FNV1a-Tesla3 200MB 1315 MB/s 1333 MB/s Intel_12.1_fast_32bit
    FNV1a-Tesla3 200MB 1324 MB/s 1315 MB/s Intel_12.1_O2_32bit
    FNV1a-Tesla3 200MB 1333 MB/s 1333 MB/s Intel_12.1_O3_32bit
    FNV1a-Tesla 200MB 1379 MB/s 1388 MB/s Intel_12.1_O3_32bit
    FNV1a-Tesla 200MB 1388 MB/s 1418 MB/s Intel_12.1_fast_32bit
    SpookyHash 200MB 1408 MB/s 1408 MB/s Microsoft_VS2010_Ox_32bit
    FNV1a-Tesla 200MB 1470 MB/s 1470 MB/s Microsoft_VS2010_Ox_32bit
    SpookyHash 200MB 1470 MB/s 1481 MB/s Intel_12.1_O3_32bit
    FNV1a-Tesla 200MB 1515 MB/s 1503 MB/s Intel_12.1_O2_32bit
    murmur3_x86_32 200MB 1515 MB/s 1515 MB/s Intel_12.1_O3_64bit
    murmur3_x86_32 200MB 1526 MB/s 1526 MB/s Intel_12.1_O2_64bit
    murmur3_x86_32 200MB 1550 MB/s 1550 MB/s Intel_12.1_fast_64bit
    murmur3_x86_32 200MB 1550 MB/s 1550 MB/s Intel_12.1_O3_32bit
    murmur3_x86_32 200MB 1562 MB/s 1562 MB/s Intel_12.1_fast_32bit
    murmur3_x86_32 200MB 1562 MB/s 1562 MB/s Intel_12.1_O2_32bit
    xxhash256 200MB 1587 MB/s 1587 MB/s Microsoft_VS2010_Ox_32bit
    SpookyHash 200MB 1612 MB/s 1612 MB/s Intel_12.1_fast_32bit
    SpookyHash 200MB 1724 MB/s 1724 MB/s Intel_12.1_O2_32bit
    murmur3_x86_32 200MB 1851 MB/s 1851 MB/s Microsoft_VS2010_Ox_64bit
    xxhash256 200MB 1851 MB/s 1869 MB/s Intel_12.1_O3_32bit
    xxhash256 200MB 1869 MB/s 1869 MB/s Intel_12.1_fast_32bit
    xxhash256 200MB 1886 MB/s 1886 MB/s Intel_12.1_O2_32bit
    murmur3_x86_32 200MB 1886 MB/s 1886 MB/s Microsoft_VS2010_Ox_32bit
    CrapWow 200MB 1904 MB/s 1904 MB/s Intel_12.1_O3_32bit
    CrapWow 200MB 1923 MB/s 1923 MB/s Intel_12.1_fast_32bit
    CrapWow 200MB 2083 MB/s 2083 MB/s Intel_12.1_fast_64bit
    murmur_x64_128 200MB 2150 MB/s 2150 MB/s Intel_12.1_O2_64bit
    murmur_x64_128 200MB 2173 MB/s 2173 MB/s Intel_12.1_fast_64bit
    murmur_x64_128 200MB 2173 MB/s 2173 MB/s Intel_12.1_O3_64bit
    CityHash64 200MB 2197 MB/s 2197 MB/s Intel_12.1_fast_64bit
    CrapWow 200MB 2197 MB/s 2197 MB/s Intel_12.1_O2_32bit
    CrapWow 200MB 2325 MB/s 2325 MB/s Microsoft_VS2010_Ox_32bit
    CityHash128 200MB 2439 MB/s 2439 MB/s Intel_12.1_fast_64bit
    CrapWow 200MB 2564 MB/s 2564 MB/s Intel_12.1_O2_64bit
    vhash 200MB 2631 MB/s 2631 MB/s Intel_12.1_O3_64bit
    vhash 200MB 2702 MB/s 2666 MB/s Intel_12.1_O2_64bit
    vhash 200MB 2739 MB/s 2739 MB/s Intel_12.1_fast_64bit
    fletcher4 200MB 2739 MB/s 2739 MB/s Intel_12.1_O2_64bit
    CrapWow 200MB 2739 MB/s 2739 MB/s Microsoft_VS2010_Ox_64bit
    CrapWow 200MB 2816 MB/s 2816 MB/s Intel_12.1_O3_64bit
    murmur_x64_128 200MB 2816 MB/s 2816 MB/s Microsoft_VS2010_Ox_64bit
    fletcher2 200MB 2857 MB/s 2857 MB/s Intel_12.1_O3_32bit
    fletcher2 200MB 2857 MB/s 2898 MB/s Intel_12.1_fast_32bit
    fletcher4 200MB 2985 MB/s 2985 MB/s Microsoft_VS2010_Ox_64bit
    fletcher4 200MB 3076 MB/s 3076 MB/s Intel_12.1_fast_64bit
    fletcher4 200MB 3076 MB/s 3076 MB/s Intel_12.1_O3_64bit
    vhash 200MB 3125 MB/s 3125 MB/s Microsoft_VS2010_Ox_64bit
    CityHash128 200MB 3125 MB/s 3225 MB/s Intel_12.1_O3_64bit
    fletcher2 200MB 3225 MB/s 3225 MB/s Microsoft_VS2010_Ox_32bit
    SpookyHash 200MB 3278 MB/s 3278 MB/s Intel_12.1_O3_64bit
    FNV1a-Jesteress 200MB 3278 MB/s 3278 MB/s Microsoft_VS2010_Ox_64bit
    CityHash128 200MB 3333 MB/s 3278 MB/s Intel_12.1_O2_64bit
    FNV1a-Jesteress 200MB 3333 MB/s 3333 MB/s Intel_12.1_O2_64bit
    CityHash64 200MB 3333 MB/s 3333 MB/s Intel_12.1_O3_64bit
    xxhash 200MB 3333 MB/s 3333 MB/s Intel_12.1_O3_64bit
    CityHash64 200MB 3389 MB/s 3389 MB/s Intel_12.1_O2_64bit
    FNV1a-Jesteress 200MB 3389 MB/s 3389 MB/s Intel_12.1_O3_32bit
    FNV1a-Jesteress 200MB 3389 MB/s 3389 MB/s Intel_12.1_O3_64bit
    FNV1a-Jesteress 200MB 3389 MB/s 3389 MB/s Microsoft_VS2010_Ox_32bit
    xxhash 200MB 3389 MB/s 3389 MB/s Microsoft_VS2010_Ox_64bit
    FNV1a-Jesteress 200MB 3448 MB/s 3448 MB/s Intel_12.1_fast_32bit
    FNV1a-Jesteress 200MB 3448 MB/s 3448 MB/s Intel_12.1_fast_64bit
    xxhash 200MB 3448 MB/s 3448 MB/s Intel_12.1_fast_64bit
    SpookyHash 200MB 3448 MB/s 3448 MB/s Intel_12.1_fast_64bit
    xxhash 200MB 3448 MB/s 3448 MB/s Intel_12.1_O2_64bit
    xxhash 200MB 3448 MB/s 3448 MB/s Intel_12.1_O3_32bit
    xxhash 200MB 3448 MB/s 3448 MB/s Microsoft_VS2010_Ox_32bit
    xxhash 200MB 3508 MB/s 3508 MB/s Intel_12.1_fast_32bit
    xxhash 200MB 3508 MB/s 3508 MB/s Intel_12.1_O2_32bit
    FNV1a-Jesteress 200MB 3508 MB/s 3508 MB/s Intel_12.1_O2_32bit
    fletcher2 200MB 3636 MB/s 3636 MB/s Intel_12.1_O2_32bit
    FNV1a-YoshimitsuTRIADii 200MB 3703 MB/s 3703 MB/s Intel_12.1_fast_64bit
    FNV1a-YoshimitsuTRIADii 200MB 3773 MB/s 3773 MB/s Intel_12.1_O3_32bit
    FNV1a-YoshimitsuTRIADii 200MB 3773 MB/s 3773 MB/s Intel_12.1_O3_64bit
    FNV1a-YoshimitsuTRIADii 200MB 3846 MB/s 3846 MB/s Intel_12.1_fast_32bit
    xxhash256 200MB 3846 MB/s 3846 MB/s Intel_12.1_O3_64bit
    xxhash256 200MB 4000 MB/s 4000 MB/s Intel_12.1_fast_64bit
    FNV1a-Yoshimura 200MB 4000 MB/s 4000 MB/s Intel_12.1_O3_32bit
    FNV1a-Yoshimura 200MB 4081 MB/s 4081 MB/s Intel_12.1_fast_32bit
    FNV1a-Yoshimura 200MB 4081 MB/s 4081 MB/s Intel_12.1_fast_64bit
    SpookyHash 200MB 4081 MB/s 4081 MB/s Intel_12.1_O2_64bit
    xxhash256 200MB 4081 MB/s 4081 MB/s Intel_12.1_O2_64bit
    FNV1a-Yoshimura 200MB 4081 MB/s 4081 MB/s Intel_12.1_O3_64bit
    FNV1a-YoshimitsuTRIADii 200MB 4081 MB/s 4081 MB/s Microsoft_VS2010_Ox_32bit
    CityHash128 200MB 4081 MB/s 4081 MB/s Microsoft_VS2010_Ox_64bit
    FNV1a-Yoshimura 200MB 4081 MB/s 4081 MB/s Microsoft_VS2010_Ox_64bit
    SpookyHash 200MB 4081 MB/s 4081 MB/s Microsoft_VS2010_Ox_64bit
    FNV1a-Yoshimura 200MB 4166 MB/s 4166 MB/s Intel_12.1_O2_32bit
    FNV1a-Yoshimura 200MB 4166 MB/s 4166 MB/s Intel_12.1_O2_64bit
    FNV1a-Yoshimura 200MB 4166 MB/s 4166 MB/s Microsoft_VS2010_Ox_32bit
    xxhash256 200MB 4166 MB/s 4166 MB/s Microsoft_VS2010_Ox_64bit
    FNV1a-YoshimitsuTRIADii 200MB 4255 MB/s 4255 MB/s Intel_12.1_O2_64bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4255 MB/s 4255 MB/s Intel_12.1_O3_32bit
    FNV1a-Tesla 200MB 4255 MB/s 4255 MB/s Intel_12.1_O3_64bit
    FNV1a-YoshimitsuTRIADii 200MB 4255 MB/s 4255 MB/s Microsoft_VS2010_Ox_64bit
    CityHash64 200MB 4255 MB/s 4255 MB/s Microsoft_VS2010_Ox_64bit
    fletcher2 200MB 4255 MB/s 4347 MB/s Intel_12.1_O3_64bit
    FNV1a-Tesla3 200MB 4347 MB/s 4347 MB/s Intel_12.1_fast_64bit
    FNV1a-Tesla 200MB 4347 MB/s 4347 MB/s Intel_12.1_fast_64bit
    fletcher2 200MB 4347 MB/s 4347 MB/s Intel_12.1_fast_64bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4347 MB/s 4347 MB/s Intel_12.1_fast_64bit
    FNV1a-YoshimitsuTRIADii 200MB 4347 MB/s 4347 MB/s Intel_12.1_O2_32bit
    FNV1a-Tesla3 200MB 4347 MB/s 4347 MB/s Intel_12.1_O2_64bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4347 MB/s 4347 MB/s Intel_12.1_O3_64bit
    FNV1a-Tesla3 200MB 4347 MB/s 4347 MB/s Intel_12.1_O3_64bit
    FNV1a-Tesla3 200MB 4347 MB/s 4347 MB/s Microsoft_VS2010_Ox_64bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4444 MB/s 4444 MB/s Intel_12.1_fast_32bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4444 MB/s 4444 MB/s Intel_12.1_O2_64bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4444 MB/s 4444 MB/s Microsoft_VS2010_Ox_64bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4545 MB/s 4545 MB/s Intel_12.1_O2_32bit
    fletcher2 200MB 4545 MB/s 4545 MB/s Intel_12.1_O2_64bit
    FNV1a-Tesla 200MB 4545 MB/s 4545 MB/s Intel_12.1_O2_64bit
    fletcher2 200MB 4545 MB/s 4545 MB/s Microsoft_VS2010_Ox_64bit
    FNV1a-YoshimitsuTRIADiiXMM 200MB 4545 MB/s 4651 MB/s Microsoft_VS2010_Ox_32bit
    FNV1a-Tesla 200MB 4651 MB/s 4651 MB/s Microsoft_VS2010_Ox_64bit

    On second thought, it is better not to answer.

    Heavy benchmarks, as always, indicate how an etude goes outwith the underkill-overkill zone.

    For all the 8 executables:
    FNV1a-YoshimitsuTRIADii kills as it should residing in range 3703-4347 MB/s.
    FNV1a-Yoshimura slashes in range 4000-4166 MB/s, but YoshimitsuTRIADii has better dispersion.

    Comment by sanmayce — June 7, 2013 @ 6:31 am

  4. Hi,

    You may want to consider the following changes for the sake of making the code prettier and (IMO) more readable:

    uint64_t *q = (uint64_t*)p;
    while ((uint8_t*)q < big_loop_limit)
    {
    v1 = _rotl(v1, 29) + *(q++);
    v2 = _rotl(v2, 31) + *(q++);
    v3 = _rotl(v3, 33) + *(q++);
    v4 = _rotl(v4, 35) + *(q++);
    v1 += v2 *= PRIME;
    v1 = _rotl(v1, 29) + *(q++);
    v2 = _rotl(v2, 31) + *(q++);
    v3 = _rotl(v3, 33) + *(q++);
    v4 = _rotl(v4, 35) + *(q++);
    v2 += v3 *= PRIME;
    v1 = _rotl(v1, 29) + *(q++);
    v2 = _rotl(v2, 31) + *(q++);
    v3 = _rotl(v3, 33) + *(q++);
    v4 = _rotl(v4, 35) + *(q++);
    v3 += v4 *= PRIME;
    v1 = _rotl(v1, 29) + *(q++);
    v2 = _rotl(v2, 31) + *(q++);
    v3 = _rotl(v3, 33) + *(q++);
    v4 = _rotl(v4, 35) + *(q++);
    v4 += v1 *= PRIME;
    }

    while ((uint8_t*)q < small_loop_limit)
    {
    v1 = _rotl(v1, 29) + *(q++);
    v2 += v1 *= PRIME;
    v2 = _rotl(v2, 31) + *(q++);
    v3 += v2 *= PRIME;
    v3 = _rotl(v3, 33) + *(q++);
    v4 += v3 *= PRIME;
    v4 = _rotl(v4, 35) + *(q++);
    v1 += v4 *= PRIME;
    }
    memcpy(out, q, bEnd – (uint8_t*)q);

    I tested this and it generated the same hashes. The code that comes out of the compiler should literally be the same as yours.

    Thanks.

    Comment by nburns — November 24, 2015 @ 12:52 am

    • Oi Paula! Em Palermo Soho é melhor do que ficar na Recoleta? Eu e meu marido estamos pensando em ir pela primeira vez em se2#8bro&emt30; e como gostaríamos de aproveitar para conhecer a maioria dos passeios a pé, onde vc indicaria? To adorando mto seu blog!!! Bjão

      Comment by Storm — March 15, 2017 @ 8:26 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: