This saves ~32-64 GB during forward/backward at the cost of ~1-2s per step for CPU↔GPU transfers (negligible vs 30s+ training time). Activate by setting policy.offload_optimizer_for_logprob: true in ...
rank_prefix_list = list(range(0, total_workers, workers_per_group)) With multi-node TP=8 DP=2: - total_workers = 2 Ray actors (1 per DP group, each managing 8 TP ...