Fixing cross-domain Active Directory authentication on RHEL 9 / SSSD
A multi-day forensic write-up: how two seemingly innocuous SSSD settings silently broke forest-wide login on a production cluster, why a fresh lab build didn't reproduce the bug, and the surgical fix we landed without taking down the application.
TL;DR
Two lines in /etc/sssd/sssd.conf on our production hosts — dns_discovery_domain and ldap_referrals = False — were preventing SSSD from discovering sibling domains in the Active Directory forest. Users in the child domain we joined (NOAM) could log in fine; users in sibling child domains (Europe, Asia) couldn't. Removing those two lines + clearing the SSSD cache + restarting SSSD restored forest visibility immediately. We then swapped the application's access gate from access_provider = ad to access_provider = simple + a simple_allow_groups list of forest universal groups to work around a separate SSSD subdomain initgroups routing bug we hit later in the auth path.
Net change per host:
/etc/sssd/sssd.conf: 5 lines (2 removed, 1 changed, 1 added, 1 removed)/etc/rstudio/rserver.conf(the application's own auth gate): 1 line changed
Total user-facing outage during the cutover: ~30 seconds per node, masked by the load balancer.
The setup
We run a clustered web application (a hosted analytics workbench) for an organization with an Active Directory forest spanning three regions:
example.com ← forest root
│
┌───────────────┼───────────────┐
noam.example.com europe.example.com asia.example.com
(primary child) (sibling child) (sibling child)
The application servers (app-node-01, app-node-02) are joined to noam.example.com. The org wants users from all three child domains to be able to log in.
A two-universal-group nesting model exists in AD for access control:
app_users_forest@example.com— universal group at the forest root, nests the regional groups (app_users_noam@noam.example.com,app_users_eu@europe.example.com,app_users_apac@asia.example.com)app_admins_forest@example.com— same pattern for admins
In theory, putting a user in any of the regional groups should make them a transitive member of app_users_forest@example.com, and the application's auth-required-user-group = app_users_forest@example.com,app_admins_forest@example.com setting should let them in.
The problem
In practice:
- NOAM users could log in fine.
- EU/APAC users couldn't.
- Symptoms varied by where in the stack the user landed — sometimes "Incorrect or invalid username/password" from the application, sometimes the user wasn't even resolvable via
id.
Standard troubleshooting (clear SSSD cache, restart, re-realm) produced no change.
The "this looks architectural" hypothesis (wrong)
Initial assumption: cross-domain AD authentication is just hard, maybe SSSD has a fundamental limitation when the joined domain is a child rather than the forest root.
We opened a Red Hat case, expecting either a known bug or a "you're holding it wrong" answer.
The disprove-it experiment
We built a fresh single-host lab (lab-node), joined it to the same noam.example.com child domain using the same realmd join command, and tested cross-domain login.
It worked. First try. EU and APAC canary users authenticated cleanly, resolved via id, and could open sessions.
That ruled out "architectural" — the bug is state-specific to the production hosts, not a general SSSD limitation.
The investigation: comparing working vs broken
We pulled the SSSD and Kerberos config from both production hosts and the lab and diffed.
The lab's sssd.conf (working):
[domain/noam.example.com]
default_shell = /bin/bash
ad_server = noamdc01.noam.example.com
cache_credentials = True
krb5_realm = NOAM.EXAMPLE.COM
id_provider = ad
override_homedir = /nfs/home/%u
fallback_homedir = /nfs/home/%u
ad_domain = noam.example.com
use_fully_qualified_names = False
ldap_id_mapping = True
access_provider = ad
The production sssd.conf (broken — extra lines we believed were perf tuning):
[domain/noam.example.com]
# ... same base settings ...
access_provider = ad
auth_provider = ad
chpass_provider = ad
ldap_schema = ad
enumerate = False
ldap_use_tokengroups = True
ad_gpo_access_control = permissive
entry_cache_timeout = 3600
account_cache_expiration = 7
dns_discovery_domain = noam.example.com # <-- !
ldap_referrals = False # <-- !
Two settings caught our eye:
dns_discovery_domain = noam.example.com
When set, this scopes SSSD's DNS-based AD discovery to a single domain. SSSD looks for DCs only in noam.example.com and doesn't try to enumerate sibling domains via SRV records and trust referrals.
When not set, SSSD's default behavior is to follow the AD trust topology and discover all reachable domains in the forest.
ldap_referrals = False
LDAP referrals are how AD points clients from one domain controller to another — e.g., when querying for a user object that lives in a sibling domain. With referrals disabled, an LDAP query against the local DC won't be redirected to the DC that actually has the data.
Disabling referrals is sometimes recommended for performance with AD because Microsoft's referrals can be chatty, and the recommended replacement (tokenGroups, also enabled in prod) handles group membership lookup without needing to chase referrals.
But tokenGroups returns SIDs only, not names. To resolve those SIDs to names for the access check (and to enumerate forest-root universal groups for visibility), SSSD still needs to traverse the forest — and ldap_referrals = False blocks that path.
The smoking gun
We ran the same one-line diagnostic on both hosts:
sudo sssctl domain-list
On the lab:
noam.example.com
example.com
asia.example.com
europe.example.com
On production:
noam.example.com
example.com
Production's SSSD knew about the joined child and the forest root — but never discovered the sibling child domains. Every cross-domain query was hitting a domain SSSD didn't know existed.
The two-step fix
We split the fix into two phases so we could verify the topology layer before touching the auth-policy layer.
Phase 1 — Restore forest discovery (low risk, no policy change)
A pre-flight test, run on production during a low-traffic period, with the cache wipe as the only mutation:
# Backup
sudo cp -a /etc/sssd/sssd.conf /etc/sssd/sssd.conf.pre-discovery-test-$(date +%Y%m%d-%H%M%S)
# Stop sssd, remove the two restrictive lines, wipe cache
sudo systemctl stop sssd
sudo sed -i \
-e '/^dns_discovery_domain = noam\.example\.com$/d' \
-e '/^ldap_referrals = False$/d' \
/etc/sssd/sssd.conf
sudo sssctl config-check # must report 0 issues
sudo rm -rf /var/lib/sss/db/* /var/lib/sss/mc/*
sudo systemctl start sssd
sleep 60 # subdomain trust discovery warmup
Then five verification queries:
sudo sssctl domain-list
getent group app_users_forest@example.com
id alice_admin # NOAM admin
id canary_eu@europe.example.com
id canary_apac@asia.example.com
Before the fix:
domain-list: 2 domainsgetent group app_users_forest@example.com: returns nothingid alice_admin: works (NOAM, no forest groups in the output)id canary_eu@...: "no such user"id canary_apac@...: "no such user"
After the fix:
domain-list: 4 domainsgetent group app_users_forest@example.com: returns 37 members including users from all three child domainsid alice_admin: now includesapp_users_forest@example.comandapp_admins_forest@example.comin his groups (transitive membership viaapp_users_noam/app_admins_noam)id canary_eu@...: resolves with the EU child-domain UIDid canary_apac@...: resolves with the APAC child-domain UID
Critically, this Phase 1 change is auth-policy-neutral: access_provider = ad is unchanged, so current users see no policy difference. NOAM users continue working exactly as before; cross-domain users now resolve but still can't log in to the application (the application's own auth gate is still NOAM-only). It's a safe, reversible-in-30-seconds prerequisite.
Phase 2 — Switch the access gate (the actual cutover)
Two coordinated edits in the maintenance window.
SSSD-level: flip access_provider from ad to simple + add simple_allow_groups:
-access_provider = ad
+access_provider = simple
+simple_allow_groups = app_users_forest@example.com, app_admins_forest@example.com
-ad_gpo_access_control = permissive # only meaningful with access_provider = ad
Why simple instead of just leaving ad? Because the AD provider's own access check has a separate bug in our SSSD version — its subdomain initgroups routing fails for users not in the joined child domain, even after forest discovery is working. The "simple" provider sidesteps that by doing a straightforward "is the user a member of one of these groups?" check against the group list SSSD already builds correctly via tokenGroups. We've kept auth_provider = ad — Kerberos authentication still goes through AD; only the access check is delegated to simple.
Application-level: the application's own auth gate is in rserver.conf:
-auth-required-user-group=app_users,app_admins
+auth-required-user-group=app_users_forest@example.com,app_admins_forest@example.com
Then restart sssd, sshd reload, restart the application's main daemon. Total downtime per node ~30 seconds, masked by the cluster's load balancer.
The execution model
What made this safe to run on a live production cluster was a few discipline choices:
1. Pre-flight script with explicit gates
Before any change, a single read-only script audited the host:
- All services healthy
- Current configs match the expected baseline (= safe to edit)
- NFS mounts working, idmapd domain correct
- Last 90 days of successful logins — would any user be locked out by the new
simple_allow_groups? - sAMAccountName collisions among forest members (if
jsmithexists in NOAM + EU + APAC, all three try to share/nfs/home/jsmith— known issue, AD-side fix) - Canary users resolve via NSS
- Local application healthy
- Apply + backout scripts pre-staged
The script exits 0 if all gates pass, 1 otherwise. We wouldn't open a maintenance window until both nodes returned green.
2. Apply script with mid-flight verification gate
The cutover script itself didn't just "edit files and pray." Between editing sssd.conf and restarting the application, it had a hard gate:
[4/7] Wipe SSSD cache + restart sssd + wait for forest discovery
sssd restarted; waiting 60 sec for forest trust discovery...
sssctl domain-list output:
noam.example.com
example.com
asia.example.com
europe.example.com
OK — all 4 forest domains visible to SSSD
If SSSD didn't come back showing all four domains within the 60-second warmup, the script aborted before touching the application config. The maximum half-applied state was: SSSD with new auth settings + verified forest discovery + application still using old config. Rollback from that state is a single file restore + sssd cache wipe.
3. Backout script staged before any mutation
Step 1 of the apply script: snapshot sssd.conf, sshd_config, sshd_config.d/, the application's own config, baseline id output for canaries, into a timestamped /root/<change>/ directory. Step 2: write an on-host rollback script that reads that directory. Only then does step 3 start editing.
That way, every failure mode from step 3 onward is "run one command, prod is restored":
sudo /root/ad_forest_backout.sh
4. Parallel rollout, but with paired terminals
Cluster consistency was more important to us than incremental rollout. Mismatched configs across two cluster members are worse than a brief synchronized outage, especially when the launcher in our case can route a user's session to either node regardless of which one their browser hit. So we ran the apply scripts on both nodes simultaneously, in two side-by-side SSH sessions, each tailing the output to a log file. Two more SSH sessions sat at root prompts on each node, pre-staged with the backout command, ready to fire if needed.
5. Six per-region browser canaries after apply
Per-node URLs (not the LB URL — so we knew exactly which node we were testing), each region:
- Primary admin user (NOAM equivalent)
- EU canary
- APAC canary
For each login: open Terminal in the session, id; touch ~/cutover-test-<node>-<timestamp>.txt; ls -la. The file must land in $HOME with the user's AD UID, not as nobody:nogroup (which would indicate idmapd mapping failure).
Things we learned
Multi-domain AD discovery on SSSD silently fails when restrictive flags are set. There's no error log entry saying "I would have discovered sibling child domains, but dns_discovery_domain is constraining me." sssctl domain-list will tell you what SSSD thinks the topology is — make that one of your first diagnostic commands when cross-domain auth misbehaves.
A fresh AD-joined host is the cheapest way to disprove "it's the protocol's fault." If a fresh, identically-joined host works and your in-place install doesn't, the bug is in your config or accumulated state. We spent less than a day building the lab and it saved us probably a week of guessing.
SSSD's access_provider = simple is a legitimate workaround for AD-provider access-check bugs in multi-domain forests. It loses GPO-based access control (which is rarely used on Linux anyway) but keeps auth_provider = ad so Kerberos still does its job. The combination is well-documented and supported.
The forest universal group + nested regional groups model is the right structure. Putting app_users_<region> groups into a single forest-wide app_users_forest universal group lets simple_allow_groups be a stable two-entry list rather than a maintenance burden that grows every time you add a region. Pay the AD-side cost of getting the nesting right; the Linux side becomes trivial.
File-level backup + idempotent edits is much easier to reason about than line-level patching. The apply script edits 5 lines but backs up the whole sssd.conf. The backout doesn't need to know which lines changed — it just restores the file. That meant we could update the apply script multiple times during the project (adding new edits) without ever updating the backout.
If you're running into similar symptoms — child-domain-joined Linux host, cross-domain users not resolving or not authenticating, sssctl domain-list showing fewer domains than your forest actually has — start with dns_discovery_domain and ldap_referrals in your sssd.conf. Remove them, wipe the cache, restart sssd. You may find your forest was always there, just hidden.