How did the Duolingo leak happen?
In March, Ivano Somaini wrote a tweet disclosing an unauthenticated Duolingo API as part of his Open Source Intelligence (OSINT) work.
The issue is pretty straightforward. A simple API call to the https://www.duolingo.com/2017-06-30/users?email endpoint reveals several private details about users and allows attackers to enumerate registered emails. Below an example output:
{
"users": [
{
"joinedClassroomIds": [],
"streak": 0,
"motivation": "none",
"acquisitionSurveyReason": "none",
"shouldForceConnectPhoneNumber": false,
"picture": "//simg-ssl.duolingo.com/avatar/default_2",
"learningLanguage": "ru",
"hasFacebookId": false,
"shakeToReportEnabled": null,
"liveOpsFeatures": [
{
"startTimestamp": 1693007940,
"type": "TIMED_PRACTICE",
"endTimestamp": 1693180740
}
],
"canUseModerationTools": false,
"id": 184078602543312,
"betaStatus": "INELIGIBLE",
"hasGoogleId": false,
"privacySettings": [],
"fromLanguage": "en",
"hasRecentActivity15": false,
"_achievements": [],
"observedClassroomIds": [],
"username": "example",
"bio": "",
"profileCountry": "US",
"chinaUserModerationRecords": [],
"globalAmbassadorStatus": {},
"currentCourseId": "DUOLINGO_RU_EN",
"hasPhoneNumber": false,
"creationDate": 146229322008,
"achievements": [],
"hasPlus": false,
"name": "o",
"roles": ["users"],
"classroomLeaderboardsEnabled": false,
"emailVerified": false,
"courses": [
{
"preload": false,
"placementTestAvailable": false,
"authorId": "duolingo",
"title": "Russian",
"learningLanguage": "ru",
"xp": 370,
"healthEnabled": true,
"fromLanguage": "en",
"crowns": 7,
"id": "DUOLINGO_RU_EN"
}
],
"totalXp": 370,
"streakData": {
"currentStreak": null
}
}
]
}
Armed with this API, an attacker published a dump of 2.6 million user records on VX-Underground.
This kind of incident is far from isolated, and Duolingo is just one of the many examples. In a similar incident in 2021, the “Add Friend” API allowed linking phone numbers to user accounts, costing Facebook over $275 million in fines from the Irish Data Protection Commission.
Introducing Gate
At SlashID, we believe that security begins with Identity. Gate is our identity-aware edge authorizer to protect APIs and workloads.
Gate can be used to monitor or enforce authentication, authorization and identity-based rate limiting on APIs and workloads, as well as to detect, anonymize, or block personally identifiable information (PII) exposed through your APIs or workloads.
Read on to learn how to deploy Gate to prevent data breaches like the ones mentioned above.
Deploying Gate
Gate can be deployed in multiple ways: as a sidecar for your service, as an external authorizer for Envoy, an ingress proxy or a plugin for your favorite API Gateway. See more in the Gate configuration docs.
For this toy example we chose a simple Docker Compose deployment, which looks like this:
version: '3.7'
services:
backend:
build: backend
ports:
- 8000:8000
environment:
- PORT=8000
env_file:
- envs/env.env
restart: on-failure
gate:
image: slashid/gate:latest
volumes:
- ./gate.yaml:/gate/gate.yaml
ports:
- 8080:8080
env_file:
- envs/env.env
command: --yaml /gate/gate.yaml
restart: on-failure
depends_on:
- backend
The Docker Compose spawns two services: Gate and a toy backend.
Simulating the leaky API
Our toy backend contains a REST API that behaves similarly to the Duolingo one:
users = [
{'email': 'test@example.com', 'name': 'Test User', 'id': 1},
{'email': 'john@example.com', 'name': 'John Doe', 'id': 2},
# ... add more users if needed
]
def get_user_by_email(email: str) -> Optional[dict]:
for user in users:
if user['email'] == email:
return user
return None
@app.get("/get_user/", tags=["business"])
async def read_user(email: str = Query(..., description="The email of the user to search for")):
user = get_user_by_email(email)
if user:
return user
else:
raise HTTPException(status_code=404, detail="User not found")
Let’s test it:
curl 'http://gate:8080/get_user/?email=test@example.com' | jq
{
"email": "test@example.com",
"name": "Test User",
"id": 1
}
Detecting PII data through Gate
Gate has a plugin-based architecture and we expose several built-in plugins. In particular, the PII Anonymizer plugin allows the detection and anonymization of PII or other sensitive data.
The PII Anonymizer plugin can be configured to exclusively monitor PII (as opposed to editing the traffic) by setting the
anonymizers
rule tokeep
. We’ll show an example in the next section.
Let’s see a simple Gate configuration that detects email addresses and rewrites the HTTP response to anonymize the field with a hash of the email address:
gate:
port: 8080
log:
format: text
level: info
plugins_http_cache:
- pattern: '*'
cache_control_override: private, max-age=600, stale-while-revalidate=300
plugins:
- id: pii_anonymizer
type: anonymizer
enabled: false
intercept: request_response
parameters:
anonymizers: |
EMAIL_ADDRESS:
type: hash
urls:
- pattern: '*/get_user'
target: http://backend:8000
plugins:
pii_anonymizer:
enabled: true
Let’s test it:
curl 'http://gate:8080/api/get_user/?email=test@example.com' | jq
{
"email": "973dfe463ec85785f5f95af5ba3906eedb2d931c24e69824a89ea65dba4e813b",
"id": 1,
"name": "Test User"
}
Detecting PII and blocking the request with OPA
Note: similarly to the PII detection plugin, the OPA plugin can also be run in monitoring mode. See the end of the blogpost to find out more.
Sometimes hashing the request is not enough and you want to block it entirely, let’s see how to combine the PII detection plugin with the OPA plugin to detect and block requests containing PII data.
Note: In the examples below we embed the OPA policies directly in the Gate config but they can also be served through a bundle, please check out our documentation to learn more about the plugin.
gate:
port: 8080
log:
format: text
level: info
plugins_http_cache:
- pattern: '*'
cache_control_override: private, max-age=600, stale-while-revalidate=300
plugins:
- id: authz_deny_pii
type: opa
enabled: false
intercept: response
parameters:
<<: *slashid_config
policy_decision_path: /authz/allow
policy: |
package authz
import future.keywords.if
default allow := false
no_key_found(obj, key) {
not obj[key]
}
allow if no_key_found(input.response.http.headers, "X-Gate-Anonymize-1")
- id: pii_anonymizer
type: anonymizer
enabled: false
intercept: request_response
parameters:
anonymizers: |
DEFAULT:
type: keep
urls:
- pattern: '*/get_user'
target: http://backend:8000
plugins:
pii_anonymizer:
enabled: true
authz_deny_pii:
enabled: true
The authz_deny_pii
instance of the OPA plugin enforces an OPA policy that blocks a request if the response contains a X-Gate-Anonymize-1
. This is a header added by the PII detection plugin to notify of the presence of PII.
Let’s see it in action:
/usr/server/app $ curl --verbose 'http://gate:8080/api/get_user/?email=test@example.com' | jq
* processing: http://gate:8080/api/get_user/?email=test@example.com
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 172.27.0.5:8080...
* Connected to gate (172.27.0.5) port 8080
> GET /api/get_user/?email=test@example.com HTTP/1.1
> Host: gate:8080
> User-Agent: curl/8.2.1
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Cache-Control: private, max-age=600, stale-while-revalidate=300
< Content-Length: 0
< Content-Type: application/json
< Date: Sat, 02 Sep 2023 13:58:00 GMT
< Server: uvicorn
< Via: 1.0 gate
< X-Gate-Anonymize-1: $.body.email 0 64 EMAIL_ADDRESS
<
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
* Connection #0 to host gate left intact
Note that in this example
pii_anonymizer
is set to monitoring mode:type: keep
for all PII types (DEFAULT
). The plugin allows PII to pass through unchanged, without replacing it with an anonymized version of the data or changing the traffic in any way.
- id: pii_anonymizer
type: anonymizer
enabled: false
intercept: request_response
parameters:
anonymizers: |
DEFAULT:
type: keep
Differential policy enforcement for authenticated users
Let’s now enforce a new OPA policy that blocks requests containing PII only if the user is not authenticated, while allowing PII in requests of authenticated users.
For simplicity, in this example we’ll use SlashID Access to handle authentication, but any Identity Provider (IdP) would be suitable.
gate:
port: 8080
log:
format: text
level: info
plugins_http_cache:
- pattern: '*'
cache_control_override: private, max-age=600, stale-while-revalidate=300
plugins:
- id: authz_allow_if_authed_pii
type: opa
enabled: false
intercept: response
parameters:
<<: *slashid_config
policy_decision_path: /authz/allow
policy: |
package authz
import future.keywords.if
default allow := false
key_found(obj, key) if { obj[key] }
jwks_request := http.send({
"cache": true,
"method": "GET",
"url": "https://api.slashid.com/.well-known/jwks.json"
})
valid_signature if io.jwt.verify_rs256(input.request.token, jwks_request.raw_body)
allow if not key_found(input.response.http.headers, "X-Gate-Anonymize-1")
allow if valid_signature
- id: pii_anonymizer
type: anonymizer
enabled: false
intercept: request_response
parameters:
anonymizers: |
DEFAULT:
type: keep
urls:
- pattern: '*/get_user'
target: http://backend:8000
plugins:
pii_anonymizer:
enabled: true
authz_deny_pii:
enabled: true
This rule is a bit more complicated, let’s see what happens step by step.
-
First, we retrieve the JSON Web Key Set (JWKS) from
https://api.slashid.com/.well-known/jwks.json
. -
Later, we check that either the incoming authorization token has a valid RS256 signature signed by SlashID or that
X-Gate-Anonymize-1
is not present. -
If either condition is true, the request is allowed. Let’s see this in action:
curl --verbose -L 'http://gate:8080/api/get_user/?email=test@example.com' | jq
* processing: http://gate:8080/api/get_user/?email=test@example.com
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 172.27.0.5:8080...
* Connected to gate (172.27.0.5) port 8080
> GET /api/get_user/?email=test@example.com HTTP/1.1
> Host: gate:8080
> User-Agent: curl/8.2.1
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Cache-Control: private, max-age=600, stale-while-revalidate=300
< Content-Length: 0
< Content-Type: application/json
< Date: Sat, 02 Sep 2023 16:04:24 GMT
< Server: uvicorn
< Via: 1.0 gate
< X-Gate-Anonymize-1: $.body.email 0 64 EMAIL_ADDRESS
<
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Connection #0 to host gate left intact
The request above is blocked because there is PII in the response and no valid JWT has been provided.
Let’s send a request with a valid token:
curl -H "Authorization: Bearer <TOKEN>" 'http://gate:8080/api/get_user/?email=test@example.com' | jq
{
"email": "test@example.com",
"id": 1,
"name": "Test User"
}
Note in this case that we configured the PII plugin to alert of PII presence but not to replace or obfuscate it in any way, hence why we see the original clear-text response.
Depending on the IdP you are using, it is also possible to create more complex policies that not only check the validity of the identity token, but also examine specific properties of the token. (Look out for our next Gate blogpost for a deeper dive into this topic!)
Blocking requests to unknown URLs
More often than not, companies don’t really know which APIs are exposed to begin with. Gate can help in this scenario too.
Gate plugin instances can be applied to all routes, or you can select specific routes. In the example config below we enable the PII and OPA plugin instances on all routes and selectively disable them on specific routes:
gate:
port: 8080
log:
format: text
level: info
plugins_http_cache:
- pattern: "*"
cache_control_override: private, max-age=600, stale-while-revalidate=300
plugins:
- id: authz_allow_if_authed_pii
type: opa
enabled: true
intercept: response
parameters:
<<: *slashid_config
policy_decision_path: /authz/allow
policy: |
package authz
import future.keywords.if
default allow := false
key_found(obj, key) if { obj[key] }
jwks_request := http.send({
"cache": true,
"method": "GET",
"url": "https://api.slashid.com/.well-known/jwks.json"
})
valid_signature if io.jwt.verify_rs256(input.request.token, jwks_request.raw_body)
allow if not key_found(input.response.http.headers, "X-Gate-Anonymize-1")
allow if valid_signature
- id: pii_anonymizer
type: anonymizer
enabled: true
intercept: request_response
parameters:
anonymizers: |
DEFAULT:
type: keep
urls:
- pattern: "*/api/echo"
target: http://backend:8000
plugins:
authz_allow_if_authed_pii:
enabled: false
pii_anonymizer:
enabled: false
- pattern: "*"
target: http://backend:8000
Note how the plugins are defined as enabled
by default and how in the URLs we explicitly disable the plugins on selected paths (e.g. "*/api/echo"
).
/usr/server/app $ curl --verbose -X POST 'http://gate:8080/api/echo' -d "email=abc@abc.com" | jq
Note: Unnecessary use of -X or --request, POST is already inferred.
* processing: http://gate:8080/api/echo
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 172.27.0.5:8080...
* Connected to gate (172.27.0.5) port 8080
> POST /api/echo HTTP/1.1
> Host: gate:8080
> User-Agent: curl/8.2.1
> Accept: */*
> Content-Length: 17
> Content-Type: application/x-www-form-urlencoded
>
} [17 bytes data]
< HTTP/1.1 200 OK
< Cache-Control: private, max-age=600, stale-while-revalidate=300
< Content-Length: 360
< Content-Type: application/json
< Date: Sun, 03 Sep 2023 09:30:38 GMT
< Server: uvicorn
< Via: 1.0 gate
<
{ [360 bytes data]
100 377 100 360 100 17 32933 1555 --:--:-- --:--:-- --:--:-- 37700
* Connection #0 to host gate left intact
{
"method": "POST",
"headers": {
"host": "backend:8000",
"user-agent": "curl/8.2.1",
"content-length": "17",
"accept": "*/*",
"content-type": "application/x-www-form-urlencoded",
"x-b3-sampled": "1",
"x-b3-spanid": "39b9a26c103c6b5d",
"x-b3-traceid": "ce0b56fc209ec47fbe0496606595c06b",
"accept-encoding": "gzip"
},
"url": "http://backend:8000/api/echo",
"body": {
"email": "abc@abc.com"
}
}
/usr/server/app $ curl --verbose 'http://gate:8080/api/get_user/?email=test@example.com' | jq
* processing: http://gate:8080/api/get_user/?email=test@example.com
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 172.27.0.5:8080...
* Connected to gate (172.27.0.5) port 8080
> GET /api/get_user/?email=test@example.com HTTP/1.1
> Host: gate:8080
> User-Agent: curl/8.2.1
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Cache-Control: private, max-age=600, stale-while-revalidate=300
< Content-Length: 0
< Content-Type: application/json
< Date: Sun, 03 Sep 2023 09:31:37 GMT
< Server: uvicorn
< Via: 1.0 gate
< X-Gate-Anonymize-1: $.body.email 0 16 EMAIL_ADDRESS
<
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
* Connection #0 to host gate left intact
/usr/server/app $
Running in monitoring mode
Just like the PII detection plugin, the OPA plugin also supports monitoring mode by adding monitoring_mode: true
in its parameters as shown below:
- id: authz_allow_if_authed_pii
type: opa
enabled: true
intercept: response
parameters:
<<: *slashid_config
monitoring_mode: true
policy_decision_path: /authz/allow
policy: |
package authz
import future.keywords.if
default allow := false
key_found(obj, key) if { obj[key] }
jwks_request := http.send({
"cache": true,
"method": "GET",
"url": "https://api.slashid.com/.well-known/jwks.json"
})
valid_signature if io.jwt.verify_rs256(input.request.token, jwks_request.raw_body)
allow if not key_found(input.response.http.headers, "X-Gate-Anonymize-1")
allow if valid_signature
Let’s send a request with an invalid token:
curl -H "Authorization: Bearer abc" 'http://gate:8080/api/get_user/?email=test@example.com' | jq
{
"email": "test@example.com",
"id": 1,
"name": "Test User"
}
The request passes but Gate logs the policy violation:
gate-demo-gate-1 | time=2023-09-04T13:37:06Z level=info msg=OPA decision: false decision_id=d9b20a8d-da43-4786-ae15-1ec91199786d decision_provenance={0.55.0 19fc439d01c8d667b128606390ad2cb9ded04b16-dirty 2023-09-02T15:18:29Z map[gate:{}]} plugin=opa req_path=/api/get_user/ request_host=gate:8080 request_url=/api/get_user/?email=test%40example.com
Performance
Performance is key when intercepting and modifying network traffic, our plugins were built for high performance in mind. For instance we embed an optimized version a rego interpreter vs standing up a separate OPA server.
Let’s look at a simple benchmark to see the impact of the two plugins on the network traffic.
Here’s a simple benchmarking script:
#!/bin/sh
iterations=$1
url=$2
echo "Running $iterations iterations for curl $url"
totaltime=0.0
for run in $(seq 1 $iterations)
do
time=$(curl $url \
-s -o /dev/null -w "%{time_total}")
totaltime=$(echo "$totaltime" + "$time" | bc)
done
avgtimeMs=$(echo "scale=4; 1000*$totaltime/$iterations" | bc)
echo "Averaged $avgtimeMs ms in $iterations iterations"
In our demo, a request without any interception results in the following:
/usr/server/app $ ./benchmark.sh 10000 'http://gate:8080/api/get_user/?email=test@example.com'
Running 10000 iterations for curl http://gate:8080/api/get_user/?email=test@example.com
Averaged 1.1820 ms in 10000 iterations
/usr/server/app $
When we enable PII detection and rewriting (hashing of the email address) coupled with our caching plugin:
/usr/server/app $ ./benchmark.sh 10000 'http://gate:8080/api/get_user/?email=test@example.com'
Running 10000 iterations for curl http://gate:8080/api/get_user/?email=test@example.com
Averaged 1.5955 ms in 10000 iterations
/usr/server/app $
Next, we test PII detection in monitoring mode:
/usr/server/app $ ./benchmark.sh 10000 'http://gate:8080/api/get_user/?email=test@example.com'
Running 10000 iterations for curl http://gate:8080/api/get_user/?email=test@example.com
Averaged 1.5176 ms in 10000 iterations
/usr/server/app $
Last, let’s run PII detection in monitoring mode coupled with OPA like we did in the example in the previous section:
/usr/server/app $ ./benchmark.sh 10000 'http://gate:8080/api/get_user/?email=test@example.com'
Running 10000 iterations for curl http://gate:8080/api/get_user/?email=test@example.com
Averaged 1.8532 ms in 10000 iterations
/usr/server/app $
Thanks to a combination of our caching plugin and Gate’s own architecture, the average overhead in our toy application is 0.6712 ms when both OPA and PII detections are turned on.
Conclusion
In this blogpost we’ve shown how you can combine the Gate PII and OPA plugins to easily detect and prevent PII leakage.
We’d love to hear any feedback you may have! Try out Gate with a free account. If you’d like to use the PII detection plugin, please contact us at at contact@slashid.dev!