Evaluates tests added in a PR for coverage, quality, edge cases, and test type appropriateness. Checks if tests cover the fix, finds gaps, and recommends lighter test types when possible. Prefer unit tests over device tests over UI tests. Triggers on: 'evaluate tests in PR', 'review test quality', 'are these tests good enough', 'check test coverage', 'is this test adequate', 'assess test coverage for PR'.
Evaluates the quality, coverage, and appropriateness of tests added in a PR. Produces a structured report with actionable findings.
# Auto-detect PR and base branch
pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1
# With explicit base branch
pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1 -BaseBranch "origin/main"
Run the script to get file categorization, convention checks, and anti-pattern detection:
pwsh .github/skills/evaluate-pr-tests/scripts/Gather-TestContext.ps1
This produces a report at CustomAgentLogsTmp/TestEvaluation/context.md with:
Read the fix files to understand:
Read each test file and evaluate against all criteria below. For each criterion, provide a verdict (✅ Pass, ⚠️ Concern, ❌ Fail) with explanation.
Output a structured evaluation report (see Output Format below).
Question: Does the test exercise the actual code paths changed by the fix?
How to check:
Red flags:
Example — Good:
// Fix: CollectionView.SelectedItem setter now clears selection when set to null
// Test: Sets SelectedItem to null and verifies selection is cleared
App.Tap("SelectItem");
App.Tap("ClearSelection"); // Sets SelectedItem = null
var text = App.FindElement("SelectionStatus").GetText();
Assert.That(text, Is.EqualTo("None")); // Directly tests the fix
Example — Bad:
// Fix: CollectionView.SelectedItem setter
// Test: Just checks CollectionView renders (doesn't test selection clearing)
App.WaitForElement("MyCollectionView");
Assert.That(true); // Proves nothing about the fix
Question: Does the test cover boundary conditions, or only the happy path?
Check for these common gaps:
| Gap Type | What to Look For |
|---|---|
| Null/empty | Does the fix handle null? Is it tested? |
| Boundary values | Min, max, zero, negative, very large |
| Repeated actions | Does calling the action twice cause issues? |
| Platform-specific | Does the bug only occur on certain platforms? |
| Async/timing | Does the fix involve async code? Race conditions? |
| State transitions | Does the test cover before→after state changes? |
| Error paths | What happens when the operation fails? |
| Combination effects | Does the fix interact with other properties/features? |
How to suggest missing edge cases:
if (x == null), if (x <= 0), try/catch blocksQuestion: Is this the lightest test type that can verify the fix?
Preference order (lightest → heaviest):
| Priority | Type | When Appropriate | Project |
|---|---|---|---|
| ⭐ 1st | Unit Test | Pure logic, property changes, data transformations, binding behavior, event wiring | *.UnitTests.csproj |
| ⭐ 1st | XAML Test | XAML parsing, XamlC compilation, source generation, markup extensions | Controls.Xaml.UnitTests |
| ⭐⭐ 2nd | Device Test | Platform-specific rendering, native API interaction, handler mapping | *.DeviceTests.csproj |
| ⭐⭐⭐ 3rd | UI Test | User interaction flows, visual layout, screenshot comparison, end-to-end scenarios | TestCases.Shared.Tests |
Decision tree:
Does the test need to interact with visual UI elements?
YES → Is it checking visual layout/appearance?
YES → UI test (VerifyScreenshot) ✅
NO → Could the interaction be tested via handler/control API?
YES → Device test ⭐⭐
NO → UI test ✅
NO → Does it need a platform/native context?
YES → Device test ⭐⭐
NO → Does it test XAML parsing/compilation?
YES → XAML test ⭐
NO → Unit test ⭐
Common "could be lighter" patterns:
| Current Test Does | Could Be Instead | Why |
|---|---|---|
| UI test: sets property, checks label text | Unit test | Property logic doesn't need UI |
| UI test: verifies event fires | Unit test | Event wiring is testable in isolation |
| UI test: checks control doesn't crash | Device test | Don't need Appium for crash testing |
| UI test: validates XAML binding | XAML test | Binding resolution is compile-time |
| Device test: checks property default | Unit test | Defaults don't need platform context |
Automated by the script. Review the script output for:
UI Tests:
IssueXXXXX.cs[Issue()] attribute on HostApp page[Category()] attribute — exactly ONE per test class (on the class or method, not both)_IssuesUITest base classWaitForElement before interactionsTask.Delay/Thread.Sleep#if ANDROID/#if IOSApplication.MainPage, Frame, Device.BeginInvokeOnMainThread)UITestEntry/UITestEditor for screenshot testsUnit Tests:
[Fact] or [Theory] attributes (xUnit)XAML Tests:
[Test] with [Values] XamlInflator parameterMauiXXXXXQuestion: Is this test likely to be flaky in CI?
| Risk Factor | Detection | Mitigation |
|---|---|---|
| Arbitrary delays | Task.Delay, Thread.Sleep | Use WaitForElement, retryTimeout |
| Missing waits | App.Tap without prior WaitForElement | Add explicit waits |
| Screenshot timing | VerifyScreenshot() without retryTimeout | Add retryTimeout: TimeSpan.FromSeconds(2) |
| Cursor blink | Entry/Editor in screenshot test | Use UITestEntry/UITestEditor |
| External URLs | WebView loading remote content | Use mock URLs or local content |
| Animation timing | Visual check after animation | Use retryTimeout |
| Global state | Test modifies Application.Current | Ensure cleanup in teardown |
Question: Does a similar test already exist?
Check the "Existing Similar Tests" section of the script output. If similar tests exist:
Question: Does the test run on all platforms affected by the fix?
Check the "Platform Scope Analysis" from the script:
.ios.cs compiles for both)Question: Are the assertions specific enough to catch regressions?
| Assertion Quality | Example | Verdict |
|---|---|---|
| ✅ Specific | Assert.That(label.Text, Is.EqualTo("Expected Value")) | Catches regression |
| ⚠️ Vague | Assert.That(label.Text, Is.Not.Null) | Too permissive |
| ❌ Meaningless | Assert.That(true) or no assertion | Proves nothing |
| ✅ Positional | Assert.That(rect.Y, Is.GreaterThan(safeAreaTop)) | Specific to layout fix |
| ⚠️ Brittle | Assert.That(rect.Y, Is.EqualTo(47)) | Magic number, will break |
Question: Do the files changed by the fix align with what the test exercises?
Red flags:
Issue12345 for a fix in CollectionView but only exercises Label renderingShell.cs but test only navigates a ContentPageProduce the evaluation report in this format:
## PR Test Evaluation Report
**PR:** #XXXXX — [Title]
**Test files evaluated:** [count]
**Fix files:** [count]
---
### Overall Verdict
[One of: ✅ Tests are adequate | ⚠️ Tests need improvement | ❌ Tests are insufficient]
[1-2 sentence summary of the most important finding]
---
### 1. Fix Coverage — [✅/⚠️/❌]
[Does the test exercise the code paths changed by the fix?]
### 2. Edge Cases & Gaps — [✅/⚠️/❌]
**Covered:**
- [edge case 1]
- [edge case 2]
**Missing:**
- [gap 1 — describe what should be tested and why]
- [gap 2]
### 3. Test Type Appropriateness — [✅/⚠️/❌]
**Current:** [UI Test / Device Test / Unit Test / XAML Test]
**Recommendation:** [Same / Could be lighter — explain why]
### 4. Convention Compliance — [✅/⚠️/❌]
[Summary from automated checks — list only issues found]
### 5. Flakiness Risk — [✅ Low / ⚠️ Medium / ❌ High]
[Specific risk factors identified]
### 6. Duplicate Coverage — [✅ No duplicates / ⚠️ Potential overlap]
[Similar existing tests found, if any]
### 7. Platform Scope — [✅/⚠️/❌]
[Does test coverage match the platforms affected by the fix?]
### 8. Assertion Quality — [✅/⚠️/❌]
[Are assertions specific enough to catch the actual bug?]
### 9. Fix-Test Alignment — [✅/⚠️/❌]
[Do the test and fix target the same code paths?]
---
### Recommendations
1. [Most important actionable recommendation]
2. [Second recommendation]
3. [...]
| File | Description |
|---|---|
CustomAgentLogsTmp/TestEvaluation/context.md | Automated context report from script |
| Problem | Cause | Solution |
|---|---|---|
| No changed files detected | Wrong base branch | Use -BaseBranch explicitly |
| No fix files detected | All changes are tests | Expected for test-only PRs |
| AutomationId mismatch | HostApp and test out of sync | Update one to match the other |
| Convention check false positive | Script regex too broad | Ignore and note in report |