How to Build an OpenGL Geometry Benchmark — Tests, Metrics, and ResultsBuilding a robust OpenGL geometry benchmark lets you measure how efficiently a GPU and driver handle geometric workloads: vertex processing, tessellation, culling, draw submission, and the throughput of vertex/index buffers. This guide walks through goals, test design, implementation details, metrics to collect, how to run experiments consistently, and how to present and interpret results.
Goals and scope
- Primary goal: measure geometry-stage performance (vertex fetch, vertex shading, primitive assembly, tessellation, culling) independently of fragment-heavy workloads.
- Secondary goals: compare drivers/GPU architectures, evaluate effects of API usage patterns (draw calls, instancing, buffer usage), and reveal bottlenecks (CPU submission, memory bandwidth, shader ALU limits).
- Scope decisions: test only OpenGL (up to a target version, e.g., 4.6), include tessellation and indirect/compute-driven draws optionally, and avoid heavy fragment shaders or high-resolution render targets that shift bottleneck to rasterization.
High-level test types
Design multiple complementary tests to isolate different subsystems:
- Microbenchmarks — isolate single behaviors:
- Vertex fetch throughput: large vertex buffers, simple passthrough vertex shader.
- Attribute count/stride tests: varying vertex formats (position only → many attributes).
- Index buffer vs non-indexed draws.
- Draw call overhead: many small draws vs few large draws.
- Instancing: single mesh drawn with many instances.
- Tessellation tests — vary tessellation levels and evaluation shader complexity to stress tessellation control/eval stages.
- Culling & CPU-bound tests — perform CPU frustum culling or software LOD selection to measure CPU vs GPU balance.
- Real-world scene tests — a few representative geometry-heavy scenes (city, vegetation, meshes with high vertex counts) to measure practical performance.
- Stress tests — extreme counts of vertices/primitives to find throughput limits and driver/hardware failure points.
Testbed and reproducibility
- Target specific OpenGL version (recommendation: OpenGL 4.6 if available). Document required extensions (ARB_vertex_attrib_binding, ARB_draw_indirect, ARB_multi_draw_indirect, ARB_buffer_storage, etc.).
- Use stable, well-known drivers and record driver versions, OS, GPU model, and CPU. Save full hardware/software configuration with each run.
- Run with consistent OS power settings (disable power-saving features), GPU power profiles set to “performance” where available, and run tests multiple times to capture variance.
- Use a dedicated benchmark mode in your app that disables vsync, overlays, OS compositor, and other background tasks where possible.
Implementation details
Framework:
- Create a small, self-contained OpenGL application in C++ (or Rust) using a cross-platform window/context API (GLFW, SDL2). Use glad or GLEW for function loading.
- Use timer APIs with high resolution (std::chrono::high_resolution_clock or platform-specific high-res timers).
Rendering pipeline:
- Minimal fragment work: use a simple passthrough fragment shader that writes a constant color to avoid fragment bottleneck. Consider using glPolygonMode(GL_POINT) or very small viewport/target to reduce rasterization cost if you must eliminate rasterization further.
- Use separable shader programs for vertex/tessellation stages, and provide shader permutations to toggle complexity (e.g., number of arithmetic ops, texture fetches).
- Avoid blending, multisampling, or expensive state changes unless testing those specifically.
Buffers and memory:
- Use persistent mapped buffers (ARB_buffer_storage) for high-throughput streaming tests and compare with classic glBufferSubData for CPU-bound tests.
- Test different index sizes (GL_UNSIGNED_SHORT vs GL_UNSIGNED_INT).
- For static geometry, place vertex data in STATIC_DRAW buffers; for streaming, use STREAM_DRAW or buffer storage with coherent mapping.
Draw call patterns:
- Single large draw: one glDrawElements call with huge index count.
- Many small draws: thousands of glDrawElements calls each with small primitive counts.
- Instanced draws: glDrawElementsInstanced to stress instance attribute processing.
- Indirect draws: glMultiDrawElementsIndirect to measure driver-side overhead.
- Multi-draw and bindless (where available) — include ARB_multi_draw_indirect and NV_bindless in optional tests.
Shaders:
- Vertex shader permutations:
- Passthrough: transform position by MVP only.
- ALU-heavy: add many operations (mix, dot, sin) to increase vertex stage ALU usage.
- Fetch-heavy: reference many vertex attributes/texel fetches in VS (if supported).
- Tessellation shaders: vary outer/inner tessellation levels and evaluation complexity.
Timing measurements:
- GPU timings: use glQueryCounter + GL_TIMESTAMP to measure GPU time for a sequence of draws. Use two timestamps (start/end) and glGetQueryObjectui64v for precise GPU time. For older drivers, fallback to glFinish + CPU timers (less accurate).
- CPU timings: measure time to issue draw calls (submission time) excluding GPU sync with CPU timers.
- Pipeline breakdown: combine GPU timestamps between pipeline stages if extension available (e.g., timer queries inside glBeginQuery/glEndQuery around specific dispatches).
- Synchronization: avoid glFinish except when measuring full frame latency explicitly; use fences (glFenceSync / glClientWaitSync) when required for accurate partial timing.
Data to record each run:
- GPU time (ns or ms)
- CPU submission time (ms)
- Number of vertices and primitives processed
- Draw call count, instance count
- Peak/average GPU memory bandwidth used (estimate from buffer sizes & streaming behavior)
- Timestamp / machine state / driver version / power state
Metrics and derived values
Core measured metrics:
- Frame time (ms) — GPU only (timestamp-based) and CPU submission time.
- Vertices processed per second (VPS) = total_vertices / GPU_time.
- Primitives processed per second (PPS) = total_primitives / GPU_time.
- Draw calls per second (DPS) = draw_calls / CPU_submission_time.
- Instances per second (IPS) = total_instances / GPU_time for instanced tests.
Derived throughput metrics:
- Vertex throughput (vertices/sec) and vertex shader ALU utilization (proxy via varying shader complexity).
- Index throughput (indices/sec).
- Bandwidth usage (bytes/sec) — deduced from buffer upload patterns and mapped memory operations.
- CPU overhead per draw (ms/draw) — CPU_submission_time / draw_calls.
Error bars and variance:
- Run each test N times (recommend 10–20) and report mean ± standard deviation or 95% confidence interval.
- Report minimum, median, and maximum to surface outliers (driver/OS interruptions).
Test matrix examples
Create a matrix combining variables to ensure coverage. Example:
- Draw call count: {1, 10, 100, 1k, 10k}
- Vertices per draw: {3, 100, 1k, 10k}
- Shader complexity: {passthrough, medium, heavy}
- Index type: {none, 16-bit, 32-bit}
- Instancing: {1, 10, 1000}
- Tessellation level: {0, 1, 4, 16, 64}
This results in many permutations — prioritize ones likely to show differences between GPUs/drivers.
Running experiments
- Warm-up: run each test a few times before recording to ensure driver JIT/compilation is done and caches are populated.
- Randomize test order between full runs to avoid thermal drift bias across tests.
- Thermals: monitor GPU temperature and, if possible, run tests in a thermally controlled environment. Record temperatures with each run.
- Power states: ensure consistent GPU clocks (use vendor tools to lock clocks if comparing across devices).
- Background load: run tests on a clean system; close unnecessary apps and disable overlays (Steam, Discord).
Presenting results
Visualizations:
- Line charts of VPS/PPS vs. vertices-per-draw or draw-call count.
- Bar charts comparing GPUs/drivers for a single test scenario.
- Heatmaps for large test matrix (axes = draw count vs vertices-per-draw, color = VPS).
- Boxplots for variance across runs.
Include tables with raw numbers and metadata (GPU, driver, OS). Use logarithmic axes where throughput spans orders of magnitude.
Example table layout:
Test | GPU time (ms) | Vertices | VPS (M) | Draw calls | CPU ms/draw |
---|---|---|---|---|---|
Small-draws, passthrough | 120.4 | 120,000,000 | 0.996 | 10,000 | 0.012 |
Interpreting results and common patterns
- High VPS but low DPS indicates GPU-heavy workload with few draws; CPU is not the bottleneck.
- Low VPS with many small draws suggests CPU draw-call submission overhead or driver inefficiency.
- Tessellation sensitivity: some GPUs excel at tessellation; measure with and without tessellation to isolate its cost.
- Instancing helps reduce CPU overhead — look for scaling when instancing increases.
- Vertex attribute format matters: many attributes or large strides reduce memory locality and vertex fetch throughput.
- Driver/extension behavior: vendor drivers may optimize specific patterns (multi-draw, bindless), producing large differences. Include those in analysis.
Example pseudo-code snippets
Vertex passthrough shader (GLSL):
#version 460 core layout(location = 0) in vec3 inPosition; uniform mat4 uMVP; void main() { gl_Position = uMVP * vec4(inPosition, 1.0); }
Timestamp query pattern:
GLuint queries[2]; glGenQueries(2, queries); glQueryCounter(queries[0], GL_TIMESTAMP); // issue draw calls here glQueryCounter(queries[1], GL_TIMESTAMP); GLint64 startTime, endTime; glGetQueryObjecti64v(queries[0], GL_QUERY_RESULT, &startTime); glGetQueryObjecti64v(queries[1], GL_QUERY_RESULT, &endTime); double gpuMs = (endTime - startTime) / 1e6;
Many-small-draws pattern (conceptual):
for (int i = 0; i < drawCount; ++i) { // bind VAO for small mesh glDrawElements(GL_TRIANGLES, indicesPerSmallMesh, GL_UNSIGNED_INT, (void*)(i * offset)); }
Instanced draw:
glDrawElementsInstanced(GL_TRIANGLES, indexCount, GL_UNSIGNED_INT, 0, instanceCount);
Pitfalls and gotchas
- Vsync/compositor: always disable vsync for throughput measurements. Compositors can introduce variance.
- Buffer streaming path: different drivers optimize buffer updates differently; test both mapping strategies and glBufferSubData.
- GPU timers accuracy: some drivers may delay or batch timestamp queries — ensure usage pattern is supported and validated.
- Thermal throttling: long runs can reduce clocks; monitor and control GPU clocks or present results with thermal state documented.
- Driver optimizations: driver may eliminate work if outputs are not observed (dead-code elimination). Avoid this by ensuring results are consumed (readback or using results in subsequent visible pass) or use glMemoryBarrier and explicit synchronization where needed.
- Comparing across APIs: results are specific to OpenGL semantics; do not assume parity with Vulkan/DirectX.
Example conclusions you might draw
- GPU A processes more vertices per second in a single large draw, but GPU B handles many small draws better due to lower driver overhead — choose GPU based on expected workload.
- Instancing dramatically reduces CPU overhead for many-object scenes; enabling instancing improved draw calls/sec by 10–50× in tests.
- Tessellation levels above X cause a steep drop in throughput on GPU C, indicating a tessellation unit bottleneck.
Next steps and extensions
- Add Vulkan and Direct3D 12 counterparts to compare API overhead and driver efficiency.
- Add shader profiling (instrument ALU vs memory stalls) using vendor tools (Nsight, Radeon GPU Profiler).
- Automate runs and result collection (JSON logs, CI integration).
- Provide a downloadable dataset and scripts for reproducibility.
If you want, I can: produce a complete reference implementation (C++/GLFW) for the core tests, generate a test matrix CSV you can run, or draft the plots and tables layout for presenting results. Which would you like next?
Leave a Reply