Benchmarking OpenGL Geometry Performance: A Practical Guide

How to Build an OpenGL Geometry Benchmark — Tests, Metrics, and ResultsBuilding a robust OpenGL geometry benchmark lets you measure how efficiently a GPU and driver handle geometric workloads: vertex processing, tessellation, culling, draw submission, and the throughput of vertex/index buffers. This guide walks through goals, test design, implementation details, metrics to collect, how to run experiments consistently, and how to present and interpret results.

Goals and scope

Primary goal: measure geometry-stage performance (vertex fetch, vertex shading, primitive assembly, tessellation, culling) independently of fragment-heavy workloads.
Secondary goals: compare drivers/GPU architectures, evaluate effects of API usage patterns (draw calls, instancing, buffer usage), and reveal bottlenecks (CPU submission, memory bandwidth, shader ALU limits).
Scope decisions: test only OpenGL (up to a target version, e.g., 4.6), include tessellation and indirect/compute-driven draws optionally, and avoid heavy fragment shaders or high-resolution render targets that shift bottleneck to rasterization.

High-level test types

Design multiple complementary tests to isolate different subsystems:

Microbenchmarks — isolate single behaviors:
- Vertex fetch throughput: large vertex buffers, simple passthrough vertex shader.
- Attribute count/stride tests: varying vertex formats (position only → many attributes).
- Index buffer vs non-indexed draws.
- Draw call overhead: many small draws vs few large draws.
- Instancing: single mesh drawn with many instances.
Tessellation tests — vary tessellation levels and evaluation shader complexity to stress tessellation control/eval stages.
Culling & CPU-bound tests — perform CPU frustum culling or software LOD selection to measure CPU vs GPU balance.
Real-world scene tests — a few representative geometry-heavy scenes (city, vegetation, meshes with high vertex counts) to measure practical performance.
Stress tests — extreme counts of vertices/primitives to find throughput limits and driver/hardware failure points.

Testbed and reproducibility

Target specific OpenGL version (recommendation: OpenGL 4.6 if available). Document required extensions (ARB_vertex_attrib_binding, ARB_draw_indirect, ARB_multi_draw_indirect, ARB_buffer_storage, etc.).
Use stable, well-known drivers and record driver versions, OS, GPU model, and CPU. Save full hardware/software configuration with each run.
Run with consistent OS power settings (disable power-saving features), GPU power profiles set to “performance” where available, and run tests multiple times to capture variance.
Use a dedicated benchmark mode in your app that disables vsync, overlays, OS compositor, and other background tasks where possible.

Implementation details

Framework:

Create a small, self-contained OpenGL application in C++ (or Rust) using a cross-platform window/context API (GLFW, SDL2). Use glad or GLEW for function loading.
Use timer APIs with high resolution (std::chrono::high_resolution_clock or platform-specific high-res timers).

Rendering pipeline:

Minimal fragment work: use a simple passthrough fragment shader that writes a constant color to avoid fragment bottleneck. Consider using glPolygonMode(GL_POINT) or very small viewport/target to reduce rasterization cost if you must eliminate rasterization further.
Use separable shader programs for vertex/tessellation stages, and provide shader permutations to toggle complexity (e.g., number of arithmetic ops, texture fetches).
Avoid blending, multisampling, or expensive state changes unless testing those specifically.

Buffers and memory:

Use persistent mapped buffers (ARB_buffer_storage) for high-throughput streaming tests and compare with classic glBufferSubData for CPU-bound tests.
Test different index sizes (GL_UNSIGNED_SHORT vs GL_UNSIGNED_INT).
For static geometry, place vertex data in STATIC_DRAW buffers; for streaming, use STREAM_DRAW or buffer storage with coherent mapping.

Draw call patterns:

Single large draw: one glDrawElements call with huge index count.
Many small draws: thousands of glDrawElements calls each with small primitive counts.
Instanced draws: glDrawElementsInstanced to stress instance attribute processing.
Indirect draws: glMultiDrawElementsIndirect to measure driver-side overhead.
Multi-draw and bindless (where available) — include ARB_multi_draw_indirect and NV_bindless in optional tests.

Shaders:

Vertex shader permutations:
- Passthrough: transform position by MVP only.
- ALU-heavy: add many operations (mix, dot, sin) to increase vertex stage ALU usage.
- Fetch-heavy: reference many vertex attributes/texel fetches in VS (if supported).
Tessellation shaders: vary outer/inner tessellation levels and evaluation complexity.

Timing measurements:

GPU timings: use glQueryCounter + GL_TIMESTAMP to measure GPU time for a sequence of draws. Use two timestamps (start/end) and glGetQueryObjectui64v for precise GPU time. For older drivers, fallback to glFinish + CPU timers (less accurate).
CPU timings: measure time to issue draw calls (submission time) excluding GPU sync with CPU timers.
Pipeline breakdown: combine GPU timestamps between pipeline stages if extension available (e.g., timer queries inside glBeginQuery/glEndQuery around specific dispatches).
Synchronization: avoid glFinish except when measuring full frame latency explicitly; use fences (glFenceSync / glClientWaitSync) when required for accurate partial timing.

Data to record each run:

GPU time (ns or ms)
CPU submission time (ms)
Number of vertices and primitives processed
Draw call count, instance count
Peak/average GPU memory bandwidth used (estimate from buffer sizes & streaming behavior)
Timestamp / machine state / driver version / power state

Metrics and derived values

Core measured metrics:

Frame time (ms) — GPU only (timestamp-based) and CPU submission time.
Vertices processed per second (VPS) = total_vertices / GPU_time.
Primitives processed per second (PPS) = total_primitives / GPU_time.
Draw calls per second (DPS) = draw_calls / CPU_submission_time.
Instances per second (IPS) = total_instances / GPU_time for instanced tests.

Derived throughput metrics:

Vertex throughput (vertices/sec) and vertex shader ALU utilization (proxy via varying shader complexity).
Index throughput (indices/sec).
Bandwidth usage (bytes/sec) — deduced from buffer upload patterns and mapped memory operations.
CPU overhead per draw (ms/draw) — CPU_submission_time / draw_calls.

Error bars and variance:

Run each test N times (recommend 10–20) and report mean ± standard deviation or 95% confidence interval.
Report minimum, median, and maximum to surface outliers (driver/OS interruptions).

Test matrix examples

Create a matrix combining variables to ensure coverage. Example:

Draw call count: {1, 10, 100, 1k, 10k}
Vertices per draw: {3, 100, 1k, 10k}
Shader complexity: {passthrough, medium, heavy}
Index type: {none, 16-bit, 32-bit}
Instancing: {1, 10, 1000}
Tessellation level: {0, 1, 4, 16, 64}

This results in many permutations — prioritize ones likely to show differences between GPUs/drivers.

Running experiments

Warm-up: run each test a few times before recording to ensure driver JIT/compilation is done and caches are populated.
Randomize test order between full runs to avoid thermal drift bias across tests.
Thermals: monitor GPU temperature and, if possible, run tests in a thermally controlled environment. Record temperatures with each run.
Power states: ensure consistent GPU clocks (use vendor tools to lock clocks if comparing across devices).
Background load: run tests on a clean system; close unnecessary apps and disable overlays (Steam, Discord).

Presenting results

Visualizations:

Line charts of VPS/PPS vs. vertices-per-draw or draw-call count.
Bar charts comparing GPUs/drivers for a single test scenario.
Heatmaps for large test matrix (axes = draw count vs vertices-per-draw, color = VPS).
Boxplots for variance across runs.

Include tables with raw numbers and metadata (GPU, driver, OS). Use logarithmic axes where throughput spans orders of magnitude.

Example table layout:

Test	GPU time (ms)	Vertices	VPS (M)	Draw calls	CPU ms/draw
Small-draws, passthrough	120.4	120,000,000	0.996	10,000	0.012

Interpreting results and common patterns

High VPS but low DPS indicates GPU-heavy workload with few draws; CPU is not the bottleneck.
Low VPS with many small draws suggests CPU draw-call submission overhead or driver inefficiency.
Tessellation sensitivity: some GPUs excel at tessellation; measure with and without tessellation to isolate its cost.
Instancing helps reduce CPU overhead — look for scaling when instancing increases.
Vertex attribute format matters: many attributes or large strides reduce memory locality and vertex fetch throughput.
Driver/extension behavior: vendor drivers may optimize specific patterns (multi-draw, bindless), producing large differences. Include those in analysis.

Example pseudo-code snippets

Vertex passthrough shader (GLSL):

#version 460 core layout(location = 0) in vec3 inPosition; uniform mat4 uMVP; void main() {   gl_Position = uMVP * vec4(inPosition, 1.0); }

Timestamp query pattern:

GLuint queries[2]; glGenQueries(2, queries); glQueryCounter(queries[0], GL_TIMESTAMP); // issue draw calls here glQueryCounter(queries[1], GL_TIMESTAMP); GLint64 startTime, endTime; glGetQueryObjecti64v(queries[0], GL_QUERY_RESULT, &startTime); glGetQueryObjecti64v(queries[1], GL_QUERY_RESULT, &endTime); double gpuMs = (endTime - startTime) / 1e6;

Many-small-draws pattern (conceptual):

for (int i = 0; i < drawCount; ++i) {   // bind VAO for small mesh   glDrawElements(GL_TRIANGLES, indicesPerSmallMesh, GL_UNSIGNED_INT, (void*)(i * offset)); }

Instanced draw:

glDrawElementsInstanced(GL_TRIANGLES, indexCount, GL_UNSIGNED_INT, 0, instanceCount);

Pitfalls and gotchas

Vsync/compositor: always disable vsync for throughput measurements. Compositors can introduce variance.
Buffer streaming path: different drivers optimize buffer updates differently; test both mapping strategies and glBufferSubData.
GPU timers accuracy: some drivers may delay or batch timestamp queries — ensure usage pattern is supported and validated.
Thermal throttling: long runs can reduce clocks; monitor and control GPU clocks or present results with thermal state documented.
Driver optimizations: driver may eliminate work if outputs are not observed (dead-code elimination). Avoid this by ensuring results are consumed (readback or using results in subsequent visible pass) or use glMemoryBarrier and explicit synchronization where needed.
Comparing across APIs: results are specific to OpenGL semantics; do not assume parity with Vulkan/DirectX.

Example conclusions you might draw

GPU A processes more vertices per second in a single large draw, but GPU B handles many small draws better due to lower driver overhead — choose GPU based on expected workload.
Instancing dramatically reduces CPU overhead for many-object scenes; enabling instancing improved draw calls/sec by 10–50× in tests.
Tessellation levels above X cause a steep drop in throughput on GPU C, indicating a tessellation unit bottleneck.

Next steps and extensions

Add Vulkan and Direct3D 12 counterparts to compare API overhead and driver efficiency.
Add shader profiling (instrument ALU vs memory stalls) using vendor tools (Nsight, Radeon GPU Profiler).
Automate runs and result collection (JSON logs, CI integration).
Provide a downloadable dataset and scripts for reproducibility.

If you want, I can: produce a complete reference implementation (C++/GLFW) for the core tests, generate a test matrix CSV you can run, or draft the plots and tables layout for presenting results. Which would you like next?

Benchmarking OpenGL Geometry Performance: A Practical Guide

Goals and scope

High-level test types

Testbed and reproducibility

Implementation details

Metrics and derived values

Test matrix examples

Running experiments

Presenting results

Interpreting results and common patterns

Example pseudo-code snippets

Pitfalls and gotchas

Example conclusions you might draw

Next steps and extensions

Comments

Leave a Reply Cancel reply

More posts

Jixi Pack vs. Competitors: Which One Reigns Supreme?

Top Keyboard Options: Finding the Perfect Fit for Your Needs

Beginner’s Guide to Setting Up Your MioMotion Device

Pulsonix vs. Competitors: Which PCB Design Tool Reigns Supreme?