Distributed Performance Analysis for R
R is the most widely used programming language for statisticians in research, especially in biostatistics and bioinformatics with high-dimensional data sets. Here an excessively high amount of resources is needed. Our existing tool traceR allows the user to profile the resource usage of an application to locate bottlenecks and develop new optimizations. traceR was previously limited to non-parallelized R applications. Parallel computing however is becoming a more and more popular option to reduce the effective runtime of compute-bound R applications. Therefore we have enabled the profiling of such applications with traceR.
Compared to existing profiling tools such as Rprof, traceR is directly integrated with the R interpreter. This enables the generation of more detailed and accurate data about memory behavior and runtime usage of an R application. For example, data about the size and the number of memory allocations needed during execution is provided. Since the gain from parallel execution can be negated if the memory requirements of all parallel processes exceed the capacity of the system, this data can serve as a constraint to determine the maximum amount of parallelization. The information gathered using traceR can be used to guide scheduling decisions to allow efficient resource utilization. Such decisions are especially important if the hardware system is heterogeneous or if the jobs have varying resource requirements depending on the input data.
In this talk we will present our profiling tool traceR and how to apply it to analyze parallel R programs.