Balavarun Pedapudi -

Balavarun Pedapudi supervised by Prof. Kishore Kothapalli received his Master of Science in Computer Science and Engineering (CSE). Here’s a summary of his research work on Minimum Spanning Tree Computation using Parallel Heterogeneous External Memory:

Massive graph datasets underpin applications in web search, social networks, transportation, and biomedicine, but their size increasingly exceeds the on-board memory of modern GPUs. While GPUs offer substantial parallelism and memory bandwidth, many real-world inputs fit in CPU memory but not in GPU memory, making naive GPU-only methods inapplicable and unified-memory or zero-copy approaches unpredictable on irregular workloads. This thesis introduces the Parallel Heterogeneous External Memory (PHEM) model—a principled framework for processing graphs that exceed GPU memory while leveraging both CPU and GPU effectively. PHEM targets the regime where the full edge set must be streamed to the device, but the vertex set and modest auxiliary state fit on the GPU. The model partitions an input edge stream into a CPU portion and GPUbatchesandoverlapscommunicationwithcomputationsothattheGPUprocessesbatchiwhile batch i+1 is transferred. A tunable parameter controls the CPU–GPU work split, enabling adaptation to hardware and dataset characteristics. We present a generic algorithmic template with three customizable components—CPUCompute, GPUCompute, and GPUMerge—that maintains incremental state across batches and minimizes device-space footprint, requiring only one or two passes over the input. We instantiate the model for minimum spanning tree computation (PHEM-MST). On the GPU, we design a memory-efficient Bor˚ uvka variant on an edge-list representation that avoids symmetric duplicates and uses a two-phase atomic procedure to find best neighbors, followed by iterative shortcutting and filtering. In parallel, the CPU executes a filter-based Kruskal to prune non-MST edges from its partition. A final GPU-side merge combines the CPU and GPU forests; correctness follows from the optimal substructure of MST. We evaluated eleven graphs spanning road networks, web crawls, social networks, synthetic generators, and biomedical literature—up to 11.6B edges—on two heterogeneous systems (NVIDIA L4, 24 GB; NVIDIA A100, 40 GB) paired with AMD EPYC CPUs. Against state-of-the-art CPU and GPU baselines (PBBS, ECL-MST) and a monolithic GPU variant, PHEM-MST processes all datasets, including those that exceed GPU memory for baselines, and achieves speedups up to 3.5×. Performance improves further on the A100, underscoring the role of device memory bandwidth. We analyze the impact of batch count and show dataset-specific optima that balance compute, contention, and transfer overheads; explicit batching with overlapped transfers yields more predictable performance than unified-memory mechanisms on irregular access patterns. Contributions include the PHEM model and reusable framework for heterogeneous, out-of-memory graph processing; an efficient PHEM-MST design and implementation; and a comprehensive empirical study with guidance for tuning and portability. We discuss limitations (parameter sensitivity, structure aware partitioning) and extensions (multi-GPU scaling, disk-backed execution). Beyond MST, the model naturally extends to BFS, connected components, and PageRank on graphs that outgrow GPU memory.

December 2025