little thread: 8월 2014

2014년 8월 26일 화요일

Lock-free는 과연 빠른가? (3)

이전 글의 테스트 소스를 좀 더 실전 상황에 맞게 수정해 보았다.

먼저 아래와 같이 각 단위테스트마다 do_something 함수를 호출하여 일반 코드와 유사한 구조를 가지도록 하였다. 참고로 일반 상황에서는 간헐적으로 thread가 blocking 되거나 wait/sleep 상태에 돌입할 수 있는데, 이 테스트에서는 배제하였다. Lock-Free 로직을 통한 별렬성을 최대한 발휘할 수 있도록 고려한 것이다.

LONG gSomeData[1024];
static void do_something() {
for (int i = 0; i < 1000; i ++) {
gSomeData[i] += 1;
}
}

DWORD WINAPI test_lock_free(LPVOID param) {
int cntTest = (int)param;
for (int i = 0; i < TEST_LOOP; i++) {
volatile LONG* p = gSharedData;
for (int j = 0; j < cntTest; j++) {
::InterlockedIncrement(p);
p++;
}
do_something();
}
return 0;
}

또한, SpinLock 의 lock() 함수는 아래와 같이 단순화하였고,

void lock() { int cnt = cntSpin; while (!tryLock()) { if (-- cnt <= 0) { Sleep(1); cnt = cntSpin; } } }

SlowSpinLock 과 FastSpinLock 은 각각 아래와 같이 초기화하였다.

static SpinLock gSlowSpinLock(20); // cntSpin=20
static SpinLock gFastSpinLock(20000); // cntSpin=20000

테스트 결과는 아래와 같다.

=========== Threads = 1, Test loop = 640*1000 ===============
atomic_update | lock_free | slow_lock | fast_lock | crit_sect
1 359 374 374 359
2 375 374 359 374
3 375 358 375 374
4 375 374 359 374
5 390 375 374 374

=========== Threads = 2, Test loop = 320*1000 ===============
atomic_update | lock_free | slow_lock | fast_lock | crit_sect
1 250 359 359 265
2 265 328 327 265
3 266 358 375 265
4 281 359 343 265
5 281 327 359 281

=========== Threads = 4, Test loop = 160*1000 ===============
atomic_update | lock_free | slow_lock | fast_lock | crit_sect
1 234 358 359 219
2 249 343 359 203
3 265 343 328 203
4 249 359 359 218
5 250 359 359 202

=========== Threads = 8, Test loop = 80*1000 ===============
atomic_update | lock_free | slow_lock | fast_lock | crit_sect
1 219 343 343 219
2 187 343 328 218
3 187 343 344 218
4 203 343 297 218
5 187 343 328 234

=========== Threads = 16, Test loop = 40*1000 ===============
atomic_update | lock_free | slow_lock | fast_lock | crit_sect
1 187 312 297 218
2 203 312 265 218
3 203 312 265 219
4 203 296 281 218
5 187 312 281 234

=========== Threads = 32, Test loop = 20*1000 ===============
atomic_update | lock_free | slow_lock | fast_lock | crit_sect
1 203 296 250 234
2 187 312 250 218
3 203 296 234 219
4 203 280 250 218
5 203 297 234 234

=========== Threads = 64, Test loop = 10*1000 ===============
atomic_update | lock_free | slow_lock | fast_lock | crit_sect
1 202 390 359 234
2 203 452 359 250
3 218 328 312 234
4 203 343 312 234
5 218 375 343 312

결론적으로 일반적인 상황에서 lock-free 와 critical-section 간의 차이는 거의 없으며, spin_lock의 효율성은 크지 않는 것으로 보인다.

한 편, 자바 코드의 실행결과는 어떻게 될까? 실은 그 의외의 결과 때문에 이 글을 올리게 되었다. 보시다시피 모든 실행결과가 C로 작성된 테스트코드보다 빠르다.
아래는 Linux 상에서 실행한 Java 코드 수행 결과이다. 굳이 리눅스 상에서 테스트를 한 이유는 pthread_mutex 테스트 결과가 매우 저조해서, 이를 내부적으로 사용하는 자바의 퍼포먼스를 직접 검증하기 위한 것이다. 편이를 위해 옆에 pthread_mutex 테스트 값을 추가하였다.

=========== Threads = 1, Test loop = 640*1000  ===============

atomic_update = 1 : java = 214   pthread = 1411

atomic_update = 2 : java = 214   pthread = 1402

atomic_update = 3 : java = 214   pthread = 1407

atomic_update = 4 : java = 216   pthread = 1402

atomic_update = 5 : java = 216   pthread = 1406

=========== Threads = 2, Test loop = 320*1000  ===============

atomic_update = 1 : java = 212   pthread = 2015

atomic_update = 2 : java = 214   pthread = 2176

atomic_update = 3 : java = 214   pthread = 1983

atomic_update = 4 : java = 213   pthread = 2079

atomic_update = 5 : java = 216   pthread = 2143

=========== Threads = 4, Test loop = 160*1000  ===============

atomic_update = 1 : java = 160   pthread = 2437

atomic_update = 2 : java = 172   pthread = 2455

atomic_update = 3 : java = 166   pthread = 2404

atomic_update = 4 : java = 172   pthread = 2415

atomic_update = 5 : java = 173   pthread = 2347

=========== Threads = 8, Test loop = 80*1000 ================

atomic_update = 1 : java = 158   pthread = 2498

atomic_update = 2 : java = 159   pthread = 2564

atomic_update = 3 : java = 153   pthread = 2518

atomic_update = 4 : java = 160   pthread = 2526

atomic_update = 5 : java = 156   pthread = 2438

=========== Threads = 16, Test loop = 40*1000  ===============

atomic_update = 1 : java = 169   pthread = 2634

atomic_update = 2 : java = 153   pthread = 2584

atomic_update = 3 : java = 159   pthread = 2543

atomic_update = 4 : java = 164   pthread = 2550

atomic_update = 5 : java = 161   pthread = 2536

=========== Threads = 32, Test loop = 20*1000  ===============

atomic_update = 1 : java = 164   pthread = 2650

atomic_update = 2 : java = 158   pthread = 2627

atomic_update = 3 : java = 163   pthread = 2483

atomic_update = 4 : java = 161   pthread = 2469

atomic_update = 5 : java = 163   pthread = 2535

=========== Threads = 64, Test loop = 10*1000 ===============

atomic_update = 1 : java = 164  pthread = 2668

atomic_update = 2 : java = 166  pthread = 2651

atomic_update = 3 : java = 164  pthread = 2711

atomic_update = 4 : java = 167  pthread = 2780

atomic_update = 5 : java = 163  pthread = 2741

테스트 결과, do_something 함수의 내부 루프가 100이하이고, 쓰레드 수가 32개 이하인 경우, 위에 사용된 spin-lock 코드가 가장 빠르며, 200~500 사이는 lock-free 가 가장 빠르다.
자바 코드는 내부 루프가 50 인 경우에도 쓰레드 수가 64개 이상이면 가장 빠르다. 이후 내부 루프가 800 이상이 되면 거의 모든 테스트 항목에서 Java 코드 수행 속도가 가장 빨라진다.
이는 단순히 Sleep()을 사용한 위의 spin-lock 코드보다, 자바 모니터의 락 검사 메카니즘과 Thread-scheduling이 훨씬 더 효율적임을 의미한다.

굳이 쓰레드간 충돌에 의해 스케쥴링이 발생해선 안되는 상황, 즉 병렬컴퓨팅이 필요한 상황이 아니더라도 변수 한 두개를 InterlockedXxx 함수를 써서 변경하는 것은 속도 여부를 떠나, 구현과 사후관리를 위해서도 권장할만한 사항일 것이다.
다만 그 이상의 복잡한 상황이라면 일단은 주저없이 CriticalSection과 java synchronized 를 쓰는 것이 바람직하다. 실행환경이 한정되고 고정적이지 않는 한, 시스템이 제공하는 동기화 API 이상의 퍼포먼스를 내는 것은 매우 어려울 뿐더러, 안정성도 심하게 떨어뜨린다. 관련 최적화는 시스템이 매우 안정화된 상태로 미루는 것이 맞을 것이다.

멀티쓰레드 쓰레드 코딩 시 가장 중요한 두 가지 이슈는 퍼포먼스와 안정성이라 하겠다. 위 테스트 결과가 관련 시스템 설계 및 구현 시에 도움이 되기를 바란다.

2014년 8월 22일 금요일

Lock-free는 과연 빠른가? (2)

수십개의 쓰레드가 1~n개의 변수를 각각 atomic하게 변경하는 함수를 매우 빈번하게 호출한다고 가정하자.
해당 함수를 lock-free 로 구현한 버젼이 있고, spin-lock 이나 mutex 로 구현한 버젼이 각각 있다. 어떤 함수가 더 빠를까? 게다가 쓰레드 수가 충분히 많다면? 그리고 C++ 로 구현한 lock-free 함수와 자바 언어로 구현한 함수를 비교하면?
아마도 많은 개발자들이 쉽게 아래와 같이 대답할 것 같다.
1) 병렬처리 기법 즉, Lock-Free를 이용한 함수가 더 빠를 것이다.
2) 쓰레드 수가 많아질수록 Lock-free 기법을 이용한 함수가 더 빠를 것이다.
3) 질문자 수준이 의심된다. 당연히 C언어 Lock-Free 가 월등히 빠를 것이다.

과연 그럴까?
결론부터 말하면 위의 세 개의 답은 모두 틀렸다.
아래는 테스트에 사용된 소스코드와 그 결과값이다.
물론, 아래 테스트 결과는 매우 제한된 조건에서만 유의미하다. 다만, 잘못된 상식을 벗어나나게 하는 용도로는 큰 효과가 있을 것이다.

include <Windows.h>
#include <stdio.h>

using namespace std;

CRITICAL_SECTION cs;

class SpinLock {
public:
SpinLock(int cntSpin) {
this->spin_lock = 0;
this->cntSpin = cntSpin;
}

void lock() {
int v;
while (::InterlockedCompareExchange(&spin_lock, (LONG)&v, 0) != 0) {
if (cntSpin <= 0) {
::Sleep(1);
}
else {
for (int i = 0; i < cntSpin; i ++) {
_asm nop;
}
}
}
}

void unlock() {
spin_lock = 0;
}

class Auto {
public:
Auto(SpinLock* spinLock) {
this->spinLock = spinLock;
spinLock->lock();
}

~Auto(){
spinLock->unlock();
}

private:
SpinLock* spinLock;
};

private:
volatile LONG spin_lock;
bool cntSpin;
};

static SpinLock gSlowSpinLock(0);
static SpinLock gFastSpinLock(1);
volatile LONG gSharedData[20] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, };
static int TEST_LOOP;

DWORD WINAPI test_slow_lock(LPVOID param) {
int cntTest = (int) param;
for (int i = 0; i < TEST_LOOP; i++) {
SpinLock::Auto lock(&gSlowSpinLock);
volatile LONG* p = (LONG*)gSharedData;
for (int j = 0; j < cntTest; j++) {
*p += 1;
p++;
}
}
return 0;
}

DWORD WINAPI test_fast_lock(LPVOID param) {
int cntTest = (int) param;
for (int i = 0; i < TEST_LOOP; i++) {
SpinLock::Auto lock(&gFastSpinLock);
volatile LONG* p = (LONG*)gSharedData;
for (int j = 0; j < cntTest; j++) {
*p += 1;
p++;
}
}
return 0;
}

DWORD WINAPI test_lock_free(LPVOID param) {
int cntTest = (int)param;
for (int i = 0; i < TEST_LOOP; i++) {
volatile LONG* p = gSharedData;
for (int j = 0; j < cntTest; j++) {
::InterlockedIncrement(p);
p++;
}
}
return 0;
}

DWORD WINAPI test_critical_section(LPVOID param) {
int cntTest = (int) param;
for (int i = 0; i < TEST_LOOP; i++) {
::EnterCriticalSection(&cs);
volatile LONG* p = (LONG*)gSharedData;
for (int j = 0; j < cntTest; j++) {
*p += 1;
p++;
}
::LeaveCriticalSection(&cs);
}
return 0;
}

static const int MAX_THREADS = 32;
int doLockTest(int cntThread, LPTHREAD_START_ROUTINE test_fn, int cntAtomicUpdate) {

HANDLE th[MAX_THREADS];
memset((void*)gSharedData, 0, cntAtomicUpdate*sizeof(int));

ULONGLONG t0 = ::GetTickCount64();
for (int i = 0; i < cntThread; ++i)
th[i] = CreateThread(NULL, NULL, test_fn, (void *)cntAtomicUpdate, NULL, NULL);

WaitForMultipleObjects(cntThread, th, TRUE, INFINITE);
ULONGLONG t1 = ::GetTickCount64();

for (int i = 0; i < cntAtomicUpdate; ++i) {
if (gSharedData[i] != TEST_LOOP * cntThread) {
printf("!!! Invalid test function");
exit(-1);
}
}

return (int)(t1 - t0);
}

int _tmain(int argc, _TCHAR* argv[])
{
::InitializeCriticalSection(&cs);
for (int cntThread = 1; cntThread <= MAX_THREADS; cntThread *= 2) {
TEST_LOOP = 12800 / cntThread * 1000;
printf("\n\n=========== Threads = %d, Test loop = %d*1000 ===============\n",
cntThread, TEST_LOOP/1000);
printf("atomic_update | lock_free | slow_spin | fast_spin | crit_sect \n");
for (int cntAtomicUpdate = 1; cntAtomicUpdate <= 5; cntAtomicUpdate ++) {
int t1 = doLockTest(cntThread, test_lock_free, cntAtomicUpdate);
int t2 = doLockTest(cntThread, test_slow_lock, cntAtomicUpdate);
int t3 = doLockTest(cntThread, test_fast_lock, cntAtomicUpdate);
int t4 = doLockTest(cntThread, test_critical_section, cntAtomicUpdate);
printf("% 7d % 14d % 11d % 11d % 11d\n", cntAtomicUpdate, t1, t2, t3, t4);
}
}
return 0;
}

아래가 위 소스코드를 실행한 결과이다.

=========== Threads = 1, Test loop = 12800*1000 ===============
atomic_update | lock_free | slow_spin | fast_spin | crit_sect
1 109 125 125 281
2 203 124 141 265
3 296 141 140 297
4 374 140 141 296
5 453 156 140 296

=========== Threads = 2, Test loop = 6400*1000 ===============
atomic_update | lock_free | slow_spin | fast_spin | crit_sect
1 359 125 452 749
2 609 140 437 733
3 827 140 531 592
4 1061 141 499 702
5 1139 140 936 686

=========== Threads = 4, Test loop = 3200*1000 ===============
atomic_update | lock_free | slow_spin | fast_spin | crit_sect
1 344 124 1014 874
2 608 141 1014 874
3 873 141 1076 593
4 1107 141 1045 874
5 1060 141 1310 624

=========== Threads = 8, Test loop = 1600*1000 ===============
atomic_update | lock_free | slow_spin | fast_spin | crit_sect
1 312 125 1747 874
2 546 125 1778 952
3 795 141 1778 593
4 1092 140 1794 1045
5 1092 156 1935 593

=========== Threads = 16, Test loop = 800*1000 ===============
atomic_update | lock_free | slow_spin | fast_spin | crit_sect
1 296 140 3573 920
2 531 140 2793 936
3 811 140 2808 593
4 1076 141 3167 1060
5 1092 156 3339 593

=========== Threads = 32, Test loop = 400*1000 ===============
atomic_update | lock_free | slow_spin | fast_spin | crit_sect
1 296 156 5881 921
2 530 156 4945 952
3 811 156 4883 577
4 1092 156 6209 998
5 1108 172 7004 593

아마 위 결과를 보고 믿지 못하는 분도 계실지 모르겠다. 그런 경우 소스코드를 직접 실행해 보시기 바란다.
보다시피 쓰레드 수가 2 이상인 경우, 단 한 개의 변수를 변경할 때도 SlowSpinLock 은 다른 모든 방식보다 보다 2~3배 이상 빠르다. (첨고로, FastSpinLock 은 다른 모든 방식보다 매우 큰 차이로 느리다.)
이러한 차이는 SlowSpinLock 이 매우 효과적으로 쓰레드간 경쟁상태를 최소화하기 때문에 발생하는 것으로 판단된다. CriticalSection은 FastSpinLock을 일부 포함하고 있어서 어정쩡한 결과가 나오는 것이라면, SlowSpinLock 과 유사하게 구현된 java synchronized 는 어떤 결과가 나올까?
아래가 그 결과이다. Windows 용 JDK 1.6.3 에서 테스트한 것이다.
놀랍게도 Thread 수에 관계없이 atomic 하게 변경하는 변수가 2 또는 3개 이상이면 java synchronized 가 C++ 로 구현한 Lock-free 코드보다 더 빨라지기 시작한다.

=========== Threads = 1, Test loop = 12800*1000 ===============
atomic_update = 1 : java_synchronized = 181
atomic_update = 2 : java_synchronized = 181
atomic_update = 3 : java_synchronized = 179
atomic_update = 4 : java_synchronized = 180
atomic_update = 5 : java_synchronized = 184

=========== Threads = 2, Test loop = 6400*1000 ===============
atomic_update = 1 : java_synchronized = 414
atomic_update = 2 : java_synchronized = 572
atomic_update = 3 : java_synchronized = 482
atomic_update = 4 : java_synchronized = 472
atomic_update = 5 : java_synchronized = 426

=========== Threads = 4, Test loop = 3200*1000 ===============
atomic_update = 1 : java_synchronized = 576
atomic_update = 2 : java_synchronized = 555
atomic_update = 3 : java_synchronized = 559
atomic_update = 4 : java_synchronized = 490
atomic_update = 5 : java_synchronized = 513

=========== Threads = 8, Test loop = 1600*1000 ===============
atomic_update = 1 : java_synchronized = 493
atomic_update = 2 : java_synchronized = 501
atomic_update = 3 : java_synchronized = 586
atomic_update = 4 : java_synchronized = 526
atomic_update = 5 : java_synchronized = 530

=========== Threads = 16, Test loop = 800*1000 ===============
atomic_update = 1 : java_synchronized = 530
atomic_update = 2 : java_synchronized = 538
atomic_update = 3 : java_synchronized = 524
atomic_update = 4 : java_synchronized = 522
atomic_update = 5 : java_synchronized = 569

=========== Threads = 32, Test loop = 400*1000 ===============
atomic_update = 1 : java_synchronized = 574
atomic_update = 2 : java_synchronized = 571
atomic_update = 3 : java_synchronized = 563
atomic_update = 4 : java_synchronized = 557
atomic_update = 5 : java_synchronized = 568

과연 자바는 최강의 멀티쓰레드 플랫폼이라 불릴만하다.

참고로, pthread_mutex 를 사용하면 CriticalSection 보다 2~3배 느려진다. Linux에서 Java 버젼도 이런 영향을 받아선지 성능이 뚝 떨어지지만 그래도 pthread_mutex 보다는 2배 이상 빠르다.

결론은 멀티쓰레드 코딩 방식에는 절대 강자는 없으며, 병렬컴퓨팅이 명확히 필요한 경우가 아니라면 lock-free 에 집착할 필요가 없다는 것이다. 위 테스트 결과처럼, 때로는 매우 단순한 CAS 함수와 Sleep 함수 하나 만으로도 매우 탁월한 퍼포먼스를 얻어낼 수 있다. Thread 간 충돌이 빈번히 발생하는 상황에서 Thread 블로킹을 최소화하려면 Lock-Free가 필요하지만 오히려 Lock을 적절히 사용하여 Thread 간 충돌을 최소화하여 퍼포먼스를 더 높일 수 있는 방법도 존재하는 것이다. 즉, 때로는 Lock-free 가, 때로는 FastSpinLock 이, 때로는 CriticalSection 이 가장 빠를 때가 있으며, 그때 그때 상황에 맞춰 해결하여야 한다. 다만, 자바 프로그래머라면 이런 고민에서 완전히 해방감을 느껴도 무방하다. ^^

누군가 C++11 의 mutex 관련 테스트자료도 올려주시면 고맙겠는데...
참고로, 아래에 자바 소스 코드도 첨부합니다.

public class TestMain {
static int gSharedData[] = new int[20];
static int TEST_LOOP;

static class JSync extends Thread {
JSync(int c) {
this.cntTest = c;
start();
}
public void run() {
for (int i = 0; i < TEST_LOOP; i++) {
synchronized (gSharedData) {
for (int j = 0; j < cntTest; j++) {
gSharedData[j] ++;
}
}
}
}
int cntTest;
}

static final int MAX_THREADS = 32;
static int doLockTest(int cntThread, int cntAtomicUpdate, boolean testSync) throws InterruptedException {
Thread th[] = new Thread[MAX_THREADS];
for (int i = 0; i < cntAtomicUpdate; i ++) {
gSharedData[i] = 0;
}

long t0 = System.currentTimeMillis();

for (int i = 0; i < cntThread; ++i) {
th[i] = new JSync(cntAtomicUpdate);
}
for (int i = 0; i < cntThread; ++i) {
th[i].join();
}

long t1 = System.currentTimeMillis();

for (int i = 0; i < cntAtomicUpdate; ++i) {
if (gSharedData[i] != TEST_LOOP * cntThread) {
System.out.print("!!! Invalid test function");
System.exit(-1);
}
}

return (int)(t1 - t0);
}

public static void main(String[] args) {

try {
for (int cntThread = 1; cntThread <= MAX_THREADS; cntThread *= 2) {
TEST_LOOP = 12800 / cntThread * 1000;
System.out.print("\n\n=========== Threads = " + cntThread + ", Test loop = " + TEST_LOOP/1000 + "*1000 ===============\n");

doLockTest(cntThread, 1, true);
for (int cntAtomicUpdate = 1; cntAtomicUpdate <= 5; cntAtomicUpdate += 1) {
int t1 = doLockTest(cntThread, cntAtomicUpdate, true);
System.out.print("atomic_update = " + cntAtomicUpdate
+ " : java_synchronized = " + t1 + "\n");
}
}
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

2014년 8월 15일 금요일

Lock-Free 는 과연 빠른가?

아래의 테스트 코드로 Lock을 사용한 알고리즘과 LockFree 알고리즘의 성능을 간단히 비교해보았다.
아래 코드는 n개의 변수를 atomic하게 변경할 때, 두 방식의 퍼포먼스를 비교하기 위한 것이다. Lock-free 구현 시 2개 이상의 변수를 atomic 하게 변경하는 범용적인 알고리즘은 존재하지 않으므로, 각각 독립된 n개 변수를 변경하는 것으로 비교하였다.

아래 코드는 쓰레드 간 충돌이 발생하지 않는 상황에서 테스트되었다. 일반적인 멀티쓰레드 프로그램 내에서 동일변수(또는 집합)을 동시에 참조하는 경우는 극히 드물게 발생하므로 그로 인한 실질적인 성능 감소는 무시할 수 있기 때문이다.

// Test code

CRITICAL_SECTION cs;

void compare_lock_free(int cntTest) {
const int TEST_LOOP = 1000 * 10000;
volatile LONG v[20] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, };

ULONGLONG t0 = ::GetTickCount64();
for (int i = 0; i < TEST_LOOP; i ++) {
::EnterCriticalSection(&cs);
volatile LONG* p = v;
for (int j = 0; j < cntTest; j ++) {
_asm mov eax, p;
_asm inc [eax];
p ++;
}
::LeaveCriticalSection(&cs);
}
ULONGLONG t1 = ::GetTickCount64();
for (int i = 0; i < TEST_LOOP; i ++) {
volatile LONG* p = v;
for (int j = 0; j < cntTest; j ++) {
_asm mov eax, p;
_asm lock inc [eax];
p ++;
}
}
ULONGLONG t2 = ::GetTickCount64();
for (int i = 0; i < TEST_LOOP; i ++) {
volatile LONG* p = v;
for (int j = 0; j < cntTest; j ++) {
::InterlockedIncrement(p);
p++;
}
}
ULONGLONG t3 = ::GetTickCount64();
printf("test = %d : lock = %d, lock_free 1 = %d, lock_free 2 = %d\n",
cntTest, (int)(t1-t0), (int)(t2-t1), (int)(t3-t2));
}

// 실행 결과.
test = 1 : lock = 202, lock_free 1 = 110, lock_free 2 = 78
test = 2 : lock = 234, lock_free 1 = 171, lock_free 2 = 172
test = 3 : lock = 218, lock_free 1 = 250, lock_free 2 = 234
test = 4 : lock = 234, lock_free 1 = 343, lock_free 2 = 281
test = 5 : lock = 234, lock_free 1 = 437, lock_free 2 = 358
test = 6 : lock = 234, lock_free 1 = 515, lock_free 2 = 421
test = 7 : lock = 250, lock_free 1 = 593, lock_free 2 = 499
test = 8 : lock = 250, lock_free 1 = 686, lock_free 2 = 562
test = 9 : lock = 265, lock_free 1 = 733, lock_free 2 = 624
test = 10 : lock = 281, lock_free 1 = 811, lock_free 2 = 702

보다시피 3번 이상 atomic 연산이 필요한 경우, CriticalSection을 사용한 알고리즘이 더욱 빨리 실행된다. 일부러 함께 비교한 InterlockedIncreement() 함수가 assembly 코드보다 더 효율적인 것으로 나오는 것도 참고할 필요가 있을 것이다. 일부러 assembly 쓸 필요가 없다는. ^^

BeginCriticalSection() 과 EnterCriticalSection() 내부에서 _asm lock 이 각각 1회 이상 사용되므로, 결국 _asm lock 호출 회수에 의해 성능차이가 발생하는 것이라 하겠다. 위 코드를 pthread mutex 를 사용하는 것으로 바꾸어도 결과는 마찬가지일 것이다. x86의 _asm lock 오퍼레이션은 수백 사이클 이상 소모되는 매우 고비용 명령어임을 위 결과로 확인할 수 있다.

단 하나의 변수만을 치환할 때 CriticalSection 과 lock-free 알고리즘의 차이는 100ms / 천만, 즉 10ns 정도 차이가 발생한다. 서버 성능에 따라 좀 더 차이가 있을 수는 있겠으나, 일반적인 상황에서 이 정도 차이는 무시해도 무방할 정도이다.

다음 글에서는 쓰레드간 충돌이 매우 빈번하게 발생하는 경우에 대해 다뤄보도록 하겠다.