Commit Briefs
tests: Fix numerous tests after recent changes in loader-core (master)
BaseLoader.load now returns a dict with an extra error field when a loading fails.
tests: Fix mocking of sleep calls with tenacity 8.4.2
Latest tenacity release adds some internal changes that broke the mocking of sleep calls in tests. Fix it by directly mocking time.sleep (was not working previously).
Replace usage of (deprecated) dir_filter by path_filter in Directory.from_disk()
as well as in GitCheckoutLoader.
test_loader: Fix implementation of test_loader_with_ref_delta_in_pack
Previous implementation was building an invalid pack file with REF_DELTA object types as it was using the new object to deltify as the base of the delta. This was leading to errors and undefined behavior after building an index for such a pack file as the deltified objects could not be properly resolved by dulwich (observed by stsp while working on git loader improvements). The bases for deltified objects are now objects that were previously loaded into the archive. Tag objects produced in that test are also ensured to be valid.
loader: Ensure to fetch latest snapshot produced by a git visit type
SWH data model allows an origin to have multiple visit types, in particular a git origin can have visit types 'git' and 'git-checkout'. We must ensure to retrieve the latest snapshot for a git visit type in the git loader implementation as it can break incremental loading of a git origin having both visit types mentioned above. Indeed a 'git-checkout' visit type produces a snapshot with a single branch while a 'git' visit type produces a snapshot containing all branches of the loaded repository. Previously, if the latest snapshot retrieved was produced by a 'git-checkout' visit type, the loader would refetch all branches and associated git objects while most of them have already been archived. Related to swh/meta#5092.
requirements-test: Add missing swh.loader.core[testing] dependency
Side effect of swh.loader.core v5.18.0 release.
dumb: Handle HEAD file legacy format
Some dumb git servers can send a HEAD file in a legacy format that contains a commit id instead of the string: "ref: <ref_name>". So handle that edge case to avoid an error when loading such repository.
dumb: Synchronize fetch_pack behavior with smart loader
As with the smart git loader, restrain the maximum size for a pack file to download. Move the code writing pack data bytes and checking size in an utility class to avoid code duplication. Add missing tests covering the cases where the pack size limit is reached.
dumb: Fix streaming of HTTP responses
When using the requests library to perform HTTP requests, if responses need to be streamed the stream parameter must be set to True to ensure content is downloaded by chunks. Previously, a whole HTTP response was cached in memory which could lead to OOM errors when dealing with a repository with large pack files.
test_directory: Fix failures after nar extid version bump
Related to swh/devel/swh-loader-core@c9b51f8.
tox: Bump mypy to 1.8.0
Related to swh/meta#5075.
Add INFO-level logging every few minutes while loading
Git loading tasks can take a pretty long time, and it's not easy to diagnose if it's stuck or if it's just taking a while. Instead of only logging at the end of the task, print a log line after each object type has been fully processed. Also print a log line every 3 minutes while objects are being processed.
loader: add some logging during packfile fetching
The packfile fetching operation can take a long time. Send one log line every minute while it progresses.
loader: Push remote messages to a logger instead of stderr
Instead of dumping the dulwich remote communication stream to stderr, add a separate logger for remote messages, and handle the remote stream as proper log entries.
loader: add option to skip certificate verification
This hooks into the right urllib3 and requests settings for both the smart and dumb loader.
loader: add shortcuts for the connect and read timeouts
This sets the connect and read timeout for both the smart loader (via urllib3/dulwich) and for the dumb loader (via requests).
dumb loader: add support for extra requests kwargs
This is useful to override the default settings of the requests Session, e.g. certificate verification of connect/read timeouts.
loader: add support for extra urllib3 kwargs
This is useful to override the default settings of the dulwich urllib3 adapter, e.g. certificate verification of connect/read timeouts.