WIP: Demo using Simple Case Folding#8515
WIP: Demo using Simple Case Folding#8515iSazonov wants to merge 45 commits intoPowerShell:masterfrom
Conversation
There was a problem hiding this comment.
This is the only place of new code injection.
0727391 to
c09b5cd
Compare
a98fe70 to
aff75e4
Compare
Use 32-bit sparsed array with BinarySearch Create init tests Got fast SpanFold and improve perf tests Improve CompareFolded Move files to new folders by role Enable perf test Add simple folded string comparer Use IgnoresAccessChecksToAttribute Configure visibility internal API Add comments. Fix for surrogates Update names and comments Add SimpleFoldedStringComparer constructor Use SimpleFoldedStringComparer in PSMemberInfoInternalCollection
Add AssertExtensions from CoreFX Use namespace System.Management.Automation.Unicode.Tests Add namespace in CharTests Remove duplicate tests from CharTests
Remove unneeded tests for comparer
Remove unneeded IndexOfFolded() tests
9939013 to
a03ec72
Compare
d1da272 to
311ccd7
Compare
311ccd7 to
57873c6
Compare
| /// <returns> | ||
| /// Returns folded string. | ||
| /// </returns> | ||
| //[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
There is a lot of commented code and attributes in this file. Did you write this all yourself?
There was a problem hiding this comment.
I started with zero experience in Unicode internals and c# in-depth optimizations. For the last two months I have been moving in small steps so that the code contains a lot of commented code which reflects my attempts to find fastest code. I'll remove its after the code will become stable.
All this code is written by me except for gen.cs that come from @tarekgh's gist.
| internal static partial class SimpleCaseFolding | ||
| { | ||
| private static readonly ushort[] L1 = | ||
| { |
There was a problem hiding this comment.
Can you add some comments about what L1 and L2 represent?
There was a problem hiding this comment.
An idea is to use an array for mapping (simple case folding) a source char to target char. A problem is that SimpleCase.txt contains very small account (1.5Kb) of Unicode code points from Plane0 (SMP) and Plane1 - it is 128 Kb that's a lot for the mapping array. In related Dotnet CoreFXlab issue dotnet/corefxlab#2610 @tarekgh suggested to use the 3-level (8-4-4) mapping technique (and published a gen.cs code in gist). It is ~6Kb. This turned out to be much slower than a single-level array and I tried a two-level array. It is ~10Kb.
Yesterday I found that I can use 1-level mapping for chars < 0x5ff and 2-level mapping for other chars. It is ~10Kb too and fastest case I have found. I'll push the commit today.
I guess the current code still has a lot of bugs and I consider it as "alfa". Following I plan to fix surrogates.
|
What is the startup time difference you see? |
|
@TravisEz13 My first measurement with a single-level array showed a speedup of ~8 percent. That's why I continued working on it. Of cause this result is not reliable although the standard powershell (Pester) tests pass successfully (only my xUnit tests fail currently). |
|
Talking to @daxian-dbw his assumption is the actual folding code would go into DotNet. |
There is no plan or decision that this will happen. it depends on the case and DotNet team will approve it. My thoughts so far is to create this in a separate NuGet package and if we see a demand on this functionality we can consider moving it to the core. |
|
I agree that this should be in CoreFX. At best, it will be CoreFX 3.1 and that means that PowerShell will get this advantage only a year and a half or two years from today. This is too long to wait. So I think it is best to work here and in CoreFX at the same time. I hope this works correctly. |
|
What locale are those stats for? |
|
@TravisEz13 In the test I used Russian chars. Really the result is true for chars with codepoints < 0x5ff (1-level mapping is used) that is most of Latin and European codepoints. For codepoint above a 2-level mapping is used and result is ~68 ms (vs 37 ms) that is still faster than CoreFX OrdinalIgnoreCase. |
|
This PR has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs within 10 days. |
|
Now we have a PR in corefxlab repo and I hope we get the experimental package soon. |
|
This PR has been automatically closed because it is stale. If you wish to continue working on the PR, please first update the PR, then reopen it. |
PR Summary
Related #8120.
The PR is only demo how we could use Simple Case Folding to speed up PowerShell engine.
Current code uses 2-level translation table to fold Unicode strings. The table is compact: full table is 128 Kb, the compressed table is ~10Kb.
New string compare works in 3-4 faster than standard ignore case comparer.
With using the new comparer in only one place you can also see a marked improvement in startup time.
Perhaps we could use the comparer in other places.
I push alternative code too so you could experiment. Tools and temporary benchmarks is also pushed.
PR Checklist
.h,.cpp,.cs,.ps1and.psm1files have the correct copyright headerWIP:to the beginning of the title and remove the prefix when the PR is ready.[feature]if the change is significant or affects feature tests