How To Help Apps Keep Up In A Multicore World

Updating your source code for speed on multithreaded systems is no longer an option. To stay competitive, solution providers must rewrite or optimize for today's parallel systems. Whether you're looking at modifying your existing single-core applications or starting over from scratch, there are some critical issues you need to consider to keep your apps from going critical.

First of all is the myth that an app written for a single-core system will automatically run faster on a multicore system. This is false. Even if the target operating system is multicore-savvy, applications written for systems with a single-core processor do not typically run faster unless optimized for the multithreading and other parallel capabilities offered by multicore host system.

In fact, they can sometimes run slower or freeze altogether under a multicore system and operating system due to race conditions or deadlock, even if the program is recomplied using a multi-core compiler. Explained simply, a deadlock can occur when two cores enter an endless "for-next" loop in an attempt to modify the same global variable.

Consider Amdahl's Law, which states that unless an entire application is optimized for parallel processing, the code that remains will be the performance bottleneck. For example, if only 90 percent of a program is optimized for parallel execution, then the serial code in the remaining 10 percent will restrict the program's ultimate performance gain to a maximum factor of 10. Performance will likely be far less.

id
unit-1659132512259
type
Sponsored post

Addressing deadlocks, Java developers have been helped by the addition of the concurrency libraries in Java 5. But deadlocks and other related issues can still exist in any application, and are best addressed using an applicaiton profiler that's multicore aware, such as AMD's CodeAnalyst, or Intel's VTune Amplifier XE.

Application debugging on single-core and multicore systems primarily differs in the way program tasks are divided up. On single-core systems, where there's a single execution pipeline, tasks execute sequentially. When building and debugging apps for multi-core systems, developers must keep in mind which parts of their program are executing in which thread of which processor or processor core, and whether the cores are sharing the same or different memory maps. And if your apps take advantage of processor cache, then keeping your L1 and L2 caches straight becomes critical.

If that weren't enough, there's much more in store when debugging multicore. In addition to executing tasks in threads, which are managed by the applications and are therefore under the developer's complete control, applications also can accomplish parallel execution with processes, which are managed by the operating system. An advantage of using processes is that each may contain its own bit of information about the program's state, in essence giving it some persistence and control of its destiny.

Next: Adding Multithreading Capabilities Unlike threading, task-based parallelism cannot be "added on" to an application. The ability to split a program's functionality into one or more tasks must be part of the program's original design. Also, tasks can't communicate with each other directly; they instead must rely on whatever interprocess communication is available through the operating system.

Short of re-engineering your applications, adding multi-threading capabilities -- parceling longer tasks that are not sensitive to sequence into smaller ones -- can often be done without too much extra coding. The trick is to avoid breaking the work into so many smaller threads that the time or hardware overhead to set up and break down the threads supersedes the processor cycles saved.

Intel's Shameem Akhter and Jason Roberts, in an EE Times article, suggest three techniques for adding threading to an application.

First is to "let OpenMP do the work." The Open Multiprocessing API includes a set of compiler instructions, routines and environment variables that Unix and Windows developers can add to their C/C++, Fortran and other applications to influence run-time behavior and aid in the implementation of parallel development. According to Akhter and Roberts, OpenMP lets the programmer specify loop iterations instead of threads, implements the optimal number of threads and handles the management of those threads.

A second technique is to implement a thread pool, a program construct that maintains a set of long-lived software threads and does away with the resource-sapping initialization process of short-terms ones. Rather than initializing and killing a thread for each task, thread pools keep threads alive to service whatever tasks are necessary, always finishing a task before taking on a new one. Thread pool implementations include QueueUserWorkItem in Windows, .NET's ThreadPool and Java's Executor.

A third option is for the developer to build his own task scheduler, with the method of choice called "work stealing," according to Akhter and Roberts. With this method, each thread is assigned its own list of tasks, but can steal from the lists of other threads if it runs out of its own tasks. This method is said to be among the more efficient because it "yields good cache usage" by reusing data that is "hot in its cache," and when it has to steal work, acts as a natural load balancer. The trick, they say, is to "bias the stealing" toward large tasks "so that the thief can stay busy for a while."

Also staying busy will be developers as they adapt their older applications for the realities of the new world of multi-core processors and multi-processor computing platforms.

Part of that new world, according to a 2009 study, single-core processors will have all but disappeared by 2013. Will the same be true of your apps?