Benchmarking OpenGL Geometry Performance: A Practical Guide

How to Build an OpenGL Geometry Benchmark — Tests, Metrics, and ResultsBuilding a robust OpenGL geometry benchmark lets you measure how efficiently a GPU and driver handle geometric workloads: vertex processing, tessellation, culling, draw submission, and the throughput of vertex/index buffers. This guide walks through goals, test design, implementation details, metrics to collect, how to run experiments consistently, and how to present and interpret results.


Goals and scope

  • Primary goal: measure geometry-stage performance (vertex fetch, vertex shading, primitive assembly, tessellation, culling) independently of fragment-heavy workloads.
  • Secondary goals: compare drivers/GPU architectures, evaluate effects of API usage patterns (draw calls, instancing, buffer usage), and reveal bottlenecks (CPU submission, memory bandwidth, shader ALU limits).
  • Scope decisions: test only OpenGL (up to a target version, e.g., 4.6), include tessellation and indirect/compute-driven draws optionally, and avoid heavy fragment shaders or high-resolution render targets that shift bottleneck to rasterization.

High-level test types

Design multiple complementary tests to isolate different subsystems:

  1. Microbenchmarks — isolate single behaviors:
    • Vertex fetch throughput: large vertex buffers, simple passthrough vertex shader.
    • Attribute count/stride tests: varying vertex formats (position only → many attributes).
    • Index buffer vs non-indexed draws.
    • Draw call overhead: many small draws vs few large draws.
    • Instancing: single mesh drawn with many instances.
  2. Tessellation tests — vary tessellation levels and evaluation shader complexity to stress tessellation control/eval stages.
  3. Culling & CPU-bound tests — perform CPU frustum culling or software LOD selection to measure CPU vs GPU balance.
  4. Real-world scene tests — a few representative geometry-heavy scenes (city, vegetation, meshes with high vertex counts) to measure practical performance.
  5. Stress tests — extreme counts of vertices/primitives to find throughput limits and driver/hardware failure points.

Testbed and reproducibility

  • Target specific OpenGL version (recommendation: OpenGL 4.6 if available). Document required extensions (ARB_vertex_attrib_binding, ARB_draw_indirect, ARB_multi_draw_indirect, ARB_buffer_storage, etc.).
  • Use stable, well-known drivers and record driver versions, OS, GPU model, and CPU. Save full hardware/software configuration with each run.
  • Run with consistent OS power settings (disable power-saving features), GPU power profiles set to “performance” where available, and run tests multiple times to capture variance.
  • Use a dedicated benchmark mode in your app that disables vsync, overlays, OS compositor, and other background tasks where possible.

Implementation details

Framework:

  • Create a small, self-contained OpenGL application in C++ (or Rust) using a cross-platform window/context API (GLFW, SDL2). Use glad or GLEW for function loading.
  • Use timer APIs with high resolution (std::chrono::high_resolution_clock or platform-specific high-res timers).

Rendering pipeline:

  • Minimal fragment work: use a simple passthrough fragment shader that writes a constant color to avoid fragment bottleneck. Consider using glPolygonMode(GL_POINT) or very small viewport/target to reduce rasterization cost if you must eliminate rasterization further.
  • Use separable shader programs for vertex/tessellation stages, and provide shader permutations to toggle complexity (e.g., number of arithmetic ops, texture fetches).
  • Avoid blending, multisampling, or expensive state changes unless testing those specifically.

Buffers and memory:

  • Use persistent mapped buffers (ARB_buffer_storage) for high-throughput streaming tests and compare with classic glBufferSubData for CPU-bound tests.
  • Test different index sizes (GL_UNSIGNED_SHORT vs GL_UNSIGNED_INT).
  • For static geometry, place vertex data in STATIC_DRAW buffers; for streaming, use STREAM_DRAW or buffer storage with coherent mapping.

Draw call patterns:

  • Single large draw: one glDrawElements call with huge index count.
  • Many small draws: thousands of glDrawElements calls each with small primitive counts.
  • Instanced draws: glDrawElementsInstanced to stress instance attribute processing.
  • Indirect draws: glMultiDrawElementsIndirect to measure driver-side overhead.
  • Multi-draw and bindless (where available) — include ARB_multi_draw_indirect and NV_bindless in optional tests.

Shaders:

  • Vertex shader permutations:
    • Passthrough: transform position by MVP only.
    • ALU-heavy: add many operations (mix, dot, sin) to increase vertex stage ALU usage.
    • Fetch-heavy: reference many vertex attributes/texel fetches in VS (if supported).
  • Tessellation shaders: vary outer/inner tessellation levels and evaluation complexity.

Timing measurements:

  • GPU timings: use glQueryCounter + GL_TIMESTAMP to measure GPU time for a sequence of draws. Use two timestamps (start/end) and glGetQueryObjectui64v for precise GPU time. For older drivers, fallback to glFinish + CPU timers (less accurate).
  • CPU timings: measure time to issue draw calls (submission time) excluding GPU sync with CPU timers.
  • Pipeline breakdown: combine GPU timestamps between pipeline stages if extension available (e.g., timer queries inside glBeginQuery/glEndQuery around specific dispatches).
  • Synchronization: avoid glFinish except when measuring full frame latency explicitly; use fences (glFenceSync / glClientWaitSync) when required for accurate partial timing.

Data to record each run:

  • GPU time (ns or ms)
  • CPU submission time (ms)
  • Number of vertices and primitives processed
  • Draw call count, instance count
  • Peak/average GPU memory bandwidth used (estimate from buffer sizes & streaming behavior)
  • Timestamp / machine state / driver version / power state

Metrics and derived values

Core measured metrics:

  • Frame time (ms) — GPU only (timestamp-based) and CPU submission time.
  • Vertices processed per second (VPS) = total_vertices / GPU_time.
  • Primitives processed per second (PPS) = total_primitives / GPU_time.
  • Draw calls per second (DPS) = draw_calls / CPU_submission_time.
  • Instances per second (IPS) = total_instances / GPU_time for instanced tests.

Derived throughput metrics:

  • Vertex throughput (vertices/sec) and vertex shader ALU utilization (proxy via varying shader complexity).
  • Index throughput (indices/sec).
  • Bandwidth usage (bytes/sec) — deduced from buffer upload patterns and mapped memory operations.
  • CPU overhead per draw (ms/draw) — CPU_submission_time / draw_calls.

Error bars and variance:

  • Run each test N times (recommend 10–20) and report mean ± standard deviation or 95% confidence interval.
  • Report minimum, median, and maximum to surface outliers (driver/OS interruptions).

Test matrix examples

Create a matrix combining variables to ensure coverage. Example:

  • Draw call count: {1, 10, 100, 1k, 10k}
  • Vertices per draw: {3, 100, 1k, 10k}
  • Shader complexity: {passthrough, medium, heavy}
  • Index type: {none, 16-bit, 32-bit}
  • Instancing: {1, 10, 1000}
  • Tessellation level: {0, 1, 4, 16, 64}

This results in many permutations — prioritize ones likely to show differences between GPUs/drivers.


Running experiments

  • Warm-up: run each test a few times before recording to ensure driver JIT/compilation is done and caches are populated.
  • Randomize test order between full runs to avoid thermal drift bias across tests.
  • Thermals: monitor GPU temperature and, if possible, run tests in a thermally controlled environment. Record temperatures with each run.
  • Power states: ensure consistent GPU clocks (use vendor tools to lock clocks if comparing across devices).
  • Background load: run tests on a clean system; close unnecessary apps and disable overlays (Steam, Discord).

Presenting results

Visualizations:

  • Line charts of VPS/PPS vs. vertices-per-draw or draw-call count.
  • Bar charts comparing GPUs/drivers for a single test scenario.
  • Heatmaps for large test matrix (axes = draw count vs vertices-per-draw, color = VPS).
  • Boxplots for variance across runs.

Include tables with raw numbers and metadata (GPU, driver, OS). Use logarithmic axes where throughput spans orders of magnitude.

Example table layout:

Test GPU time (ms) Vertices VPS (M) Draw calls CPU ms/draw
Small-draws, passthrough 120.4 120,000,000 0.996 10,000 0.012

Interpreting results and common patterns

  • High VPS but low DPS indicates GPU-heavy workload with few draws; CPU is not the bottleneck.
  • Low VPS with many small draws suggests CPU draw-call submission overhead or driver inefficiency.
  • Tessellation sensitivity: some GPUs excel at tessellation; measure with and without tessellation to isolate its cost.
  • Instancing helps reduce CPU overhead — look for scaling when instancing increases.
  • Vertex attribute format matters: many attributes or large strides reduce memory locality and vertex fetch throughput.
  • Driver/extension behavior: vendor drivers may optimize specific patterns (multi-draw, bindless), producing large differences. Include those in analysis.

Example pseudo-code snippets

Vertex passthrough shader (GLSL):

#version 460 core layout(location = 0) in vec3 inPosition; uniform mat4 uMVP; void main() {   gl_Position = uMVP * vec4(inPosition, 1.0); } 

Timestamp query pattern:

GLuint queries[2]; glGenQueries(2, queries); glQueryCounter(queries[0], GL_TIMESTAMP); // issue draw calls here glQueryCounter(queries[1], GL_TIMESTAMP); GLint64 startTime, endTime; glGetQueryObjecti64v(queries[0], GL_QUERY_RESULT, &startTime); glGetQueryObjecti64v(queries[1], GL_QUERY_RESULT, &endTime); double gpuMs = (endTime - startTime) / 1e6; 

Many-small-draws pattern (conceptual):

for (int i = 0; i < drawCount; ++i) {   // bind VAO for small mesh   glDrawElements(GL_TRIANGLES, indicesPerSmallMesh, GL_UNSIGNED_INT, (void*)(i * offset)); } 

Instanced draw:

glDrawElementsInstanced(GL_TRIANGLES, indexCount, GL_UNSIGNED_INT, 0, instanceCount); 

Pitfalls and gotchas

  • Vsync/compositor: always disable vsync for throughput measurements. Compositors can introduce variance.
  • Buffer streaming path: different drivers optimize buffer updates differently; test both mapping strategies and glBufferSubData.
  • GPU timers accuracy: some drivers may delay or batch timestamp queries — ensure usage pattern is supported and validated.
  • Thermal throttling: long runs can reduce clocks; monitor and control GPU clocks or present results with thermal state documented.
  • Driver optimizations: driver may eliminate work if outputs are not observed (dead-code elimination). Avoid this by ensuring results are consumed (readback or using results in subsequent visible pass) or use glMemoryBarrier and explicit synchronization where needed.
  • Comparing across APIs: results are specific to OpenGL semantics; do not assume parity with Vulkan/DirectX.

Example conclusions you might draw

  • GPU A processes more vertices per second in a single large draw, but GPU B handles many small draws better due to lower driver overhead — choose GPU based on expected workload.
  • Instancing dramatically reduces CPU overhead for many-object scenes; enabling instancing improved draw calls/sec by 10–50× in tests.
  • Tessellation levels above X cause a steep drop in throughput on GPU C, indicating a tessellation unit bottleneck.

Next steps and extensions

  • Add Vulkan and Direct3D 12 counterparts to compare API overhead and driver efficiency.
  • Add shader profiling (instrument ALU vs memory stalls) using vendor tools (Nsight, Radeon GPU Profiler).
  • Automate runs and result collection (JSON logs, CI integration).
  • Provide a downloadable dataset and scripts for reproducibility.

If you want, I can: produce a complete reference implementation (C++/GLFW) for the core tests, generate a test matrix CSV you can run, or draft the plots and tables layout for presenting results. Which would you like next?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *