TL; DR;
Finding the right “position” in the diff view for a GitHub Review Comment from a line number in static analysis report.
Current version of the GitHub Action is published. Help me test and improve it.
The Scene
You can read how this all started in my previous post. Now it’s a couple of months later and I’m finally trying to get this PR Review thing finished. I still don’t really want to get into some complicated scripts and API calls. Surely someone had done this already..
Well if they did, they weren’t bragging. Or maybe I’m not good with Google, both are possible.
Diff View Position
The position calculation is still twisting my head a little and looking at what the PR Diff looks in the API example is not encouraging me. But I can write a little script that maps the lines in a changed file to a position counter and replace the issue line with that. Shouldn’t be so hard. And I did find a couple of examples of people getting line numbers for git diff
in bash that I could use. That’s a promising route.
The position value equals the number of lines down from the first “@@” hunk header in the file you want to add a comment. The line just below the “@@” line is position 1, the next line is position 2, and so on. The position in the diff continues to increase through lines of whitespace and additional hunks until the beginning of a new file.
The most fitting script I found is this one from a guy called Michael Adams. Bit of voodoo to me, but I understand enough to trust it’s safe. The output format can be conquered simply by going line-by-line and using the string split function. I’m not confident in bash
though so I’ve decided to look at node
again. The last time I tried to use it to do some CI stuff a few years back it got a bit messy. This time I just need to read and write files though. And sure enough, it’s an absolute breeze. Finally some more familiar syntax.
Very soon I have a working version that uses the script from Michael to find the lines and then it just goes file by file, adding position++
in a map of changed line numbers. It increases the counter for each deleted line too and adds an offset for the start and end of each “hunk” because that’s how they are shown in the PR diff in GitHub.
Another little script then loads the JSON report and adjusts the line number based on the map. Any issues that don’t have a match in my map are outside the diff so I can discard them. Both issues solved!
function filterAndTranslatePositionReviewComments(allComments, positionMaps) {
let relevantComments = [];
allComments.forEach((comment) => {
let line = parseInt(comment.position);
let filename = comment.path;
if (!positionMaps.has(filename)) {
console.warn(`${filename} not in git diff`);
return;
}
let lineToPosition = positionMaps.get(filename);
if (!lineToPosition.has(line)) {
console.warn(`line ${line} is not in git diff for ${filename}`);
return;
}
comment.position = lineToPosition.get(line);
relevantComments.push(comment);
});
return relevantComments;
}
Except they’re not of course. The comments in the PR seem to be off by a line or two sometimes. That’s because the diff is not just showing the “new lines”, it also shows the lines those “new lines” replaced as “removed lines”. But they are not actually coming up in the “diff script” I have as deleted lines. Because they are not.
Dealing with Removed Lines
Brilliant idea alert! Use the same script I have but in the other “diff direction”; i.e. branch..base. This reverses the differences and the lines GitHub shows as “removed” now come up in my “new or changed” lines. I’m sitting with a notepad drawing PR diff boxes with line numbers and devising my algorithm for the position counting. Each time I think I have it, there is another “edge case” where it doesn’t add up. Surely there must be a better way.
Get Pull Request Endpoint
And there is. Who knows better what is or isn’t shown in the GitHub PR view than GitHub? I mean, I’m already reading files line by line. Maybe this won’t be so hard after all. Calling the Get Pull Request endpoint with this media type specified: application/vnd.github.v3.diff gets the following type of response. It contains all the lines from the PR Diff View including the “offset” lines. But it doesn’t have line numbers for the code lines.
diff --git a/invest-app/main/default/classes/InvestmentService.cls b/invest-app/main/default/classes/InvestmentService.cls
index 31c3212..1325796 100644
--- a/invest-app/main/default/classes/InvestmentService.cls
+++ b/invest-app/main/default/classes/InvestmentService.cls
@@ -4,6 +4,7 @@ public with sharing class InvestmentService {
public InvestmentService(Id investmentId) {
this.investmentId = investmentId;
+ System.debug('some test');
this.investment = [
SELECT
Id,
diff --git a/invest-app/main/default/classes/InvestmentStatisticsJob.cls b/invest-app/main/default/classes/InvestmentStatisticsJob.cls
index c283bb7..a33b825 100644
--- a/invest-app/main/default/classes/InvestmentStatisticsJob.cls
+++ b/invest-app/main/default/classes/InvestmentStatisticsJob.cls
@@ -10,12 +10,9 @@ public with sharing class InvestmentStatisticsJob implements Database.Batchable<
return [SELECT Id FROM Investment__c WHERE Id IN :this.investmentIds];
}
public void execute(Database.BatchableContext bc, List<Investment__c> scope) {
- for (Investment__c investment : scope) {
- InvestmentStatisticsService stats = new InvestmentStatisticsService(investment.Id, this.startDate);
- stats.process();
- stats.save();
- }
}
…
There are some examples online of how people tried to add those, but they seemed to lead me again into a vortex of complexity or using bash scripts I don’t really understand. But the hunk header contains the line number too. That should work.
I already have the script to replace issue line number with a calculated position offset. I just need to change the input of if and know when to skip a line. Let’s look at the format of the PR Diff again.
Parsing the Raw Diff
Each new diff hunk starts with the same couple of lines. I can work out the filename from the first one:
function getFileName(lineText) {
//diff --git a/force-app/main/default/classes/MyClass.cls b/force-app/main/default/classes/MyClass.cls
return lineText.split(' ')[3].substring(2);
}
And the line number to start at from the last. (see here what the numbers after @@ mean, I need the one with the +)
function getStartLineNumber(lineText) {
//@@ -35,16 +35,16 @@ private class MyClassTest {
return parseInt(lineText.split(' ')[2].split(',')[0].substring(1));
}
Then I can just go line by line and add the offset increase by 1 into my positions map. Whenever I see a “-” at the start, that’s a deleted line and I bump the offset up without recording the mapping. I also have to remember the last “position” in a hunk to use as a starting position in the next one from the same file.
function getPositionOffsetMap(diffData) {
let currentFileName;
let lines = diffData.split("\n");
let positionOffset = 1;
let currentLineNumber = 0;
let insideHunkHeader = false;
let lineToPositionMaps = new Map();
lines.forEach((oneLine) => {
if (isNewHunkStart(oneLine)) {
let newHunkFileName = getFileName(oneLine);
if (!currentFileName || newHunkFileName !== currentFileName) {
//new file
currentFileName = newHunkFileName;
positionOffset = 1;
insideHunkHeader = true;
console.log(`new file ${currentFileName}`);
return;
}
}
if (isHunkHeader(oneLine)) {
currentLineNumber = getStartLineNumber(oneLine);
insideHunkHeader = false;
console.log(`starting hunk at line ${currentLineNumber}`);
return;
}
if (insideHunkHeader) {
console.log('inside hunk header, not an interesting line');
return;
}
if (isRemovedLine(oneLine)) { //starts with -
positionOffset++;
console.log(`deleted line, not increasing line, but offset now ${positionOffset}`);
return;
}
let lineMap = lineToPositionMaps.has(currentFileName) ? lineToPositionMaps.get(currentFileName) : new Map();
lineMap.set(currentLineNumber++, positionOffset++);
console.log(`changed or new line ${currentLineNumber} ${positionOffset}`);
lineToPositionMaps.set(currentFileName, lineMap);
});
return lineToPositionMaps;
}
Actually Getting the Diff
With the script in place, I just need to add another API call to the Action and store the response body in a file it can process. For some strange reason I could not get that to work though. It seemed like the media type in request header was being ignored but for the life of me I could not work out why.
I did notice that the response contained a bunch of links though, one of which was a dedicated URL to get the PR diff. I don’t need to specify media type there and it works fine. Good..
https://patch-diff.githubusercontent.com/raw/USER/REPO/pull/PULL_NUMBER.diff
Until I try it in a private repo at the company I work with now. In the end, ditching octokit/request-action
and calling the API using curl
worked fine. I’m not happy to not know what went wrong with the original approach, but this works.. on with the job at hand.
- name: Get GitDiff from GitHub PR API
id: fetch_diff
shell: bash
run: |
curl --request GET \
--header 'authorization: Bearer ${{ inputs.github_token }}' \
--header 'content-type: application/json' \
--header 'Accept: application/vnd.github.v3.diff' \
--url https://api.github.com/repos/${{ github.repository }}/pulls/${{ github.event.number }} \
> diff.txt
Working Action
Awesome, action works and the comments in my PR Review are aligned perfectly. FINALLY!
Except It’s not the end of course. Running a bigger test fails again. This time it’s a 10s limit on the API processing time. The PR in question is getting around 70 comments and the processing takes too long. I’m not surprised.
The solution is obvious, but I remember reading about GitHub somehow not liking many API calls in very quick succession. I have to read up on this a little more. And of course I’ll have to add another script to split the comments and call the API in a loop. This is a task for another time (again). For now I’m just setting a hard limit on 40 comments. No-one is going to read more than that anyway.
I walk away beat one more time.. Except this time it’s actually of some use. So you can find it here including instructions on how to use it. It’s marked as not Prod ready of course 🙂
What’s Next
Configurability
Code Analyzer has moved on a bit since I started with this. I need to brush up on how to work with the rules best. For now the Action supports the -–category argument only and uses the -–normalize-severity flag. Proper configurability of the rules is needed.
Completeness
I need to support more than the 40 comments. Or do I? Well at least it should use 40 of the worst offences.
A big drawback of showing only issues from the PR Diff is that we can miss important problems. I could be adding some lines into an existing method making it too long, but if the start of the method doesn’t make it into the diff I miss the complexity warning. Same goes for adding a reasonable method to a border-line-too-big class. This one is going to need me to be a lot smarter about things.
Efficiency and Stability
It’d probably be a good idea to package this into a docker image. It wouldn’t have to download the CLI each time. I’d have it ready with a tested version.
The git diff filtering of the scanner is risky. There are many metadata types that are just not relevant for static analysis and if you do bit Pull Requests the list can sometimes grow to be too long to work as a bash command. Some basic filtering would be useful here.
Considering an API call to Get Pull Request and just reading the changed files from there is also an option. Although if the diff is working, why bother.
Atomicity
The big one! There is no “duplicate issue” protection if the Action runs twice on the same PR (like after rebase or added commit). Far from ideal.