Programmatic web page login with cookies
.NET makes it really easy to scrape some data from a public website. You can use HttpClient
to download the web page.
public async Task<string> GetHtmlAsync(string username)
{
var uri = new Uri(
$"https://backloggery.com/ajax_moregames.php?alpha=1&user={username}");
return await httpClient.GetStringAsync(uri).ConfigureAwait(false);
}
The best library for parsing the HTML is probably AngleSharp, but that's not the topic of this post. Instead, I'll focus on what to do if the web page is not public and you need to log in first. Typically, you will then need to submit the login form programmatically and use the cookies from the response in future requests.
You should do the login in a private browser window first, opening the Network tab of Developer Tools to see the interaction required.
Web page login request in the browser Development Tools
I have highlighted the parts you should pay attention to (and blurred out the sensitive data). Let us go through them in order:
- Request URL is the URL you should send the credentials to.
- Request Method is the HTTP verb to use. In most cases, it should be
POST
. - Status Code is often
3**
which means a redirect. If this is the case, you will need to disable auto-redirection withHttpClient
to access the actual login response. - set-cookie headers contain the cookies that the server expects to receive in future requests from the logged-in client (in this case, your application).
- Form Data contains the data that must be submitted to the login URL. This includes the values of all input fields (with the value of their
id
attribute as the key), but it's not uncommon to include several hidden input fields from the login page.
Based on all this information, you should be able to implement a login method similar to the following:
private async Task<IEnumerable<KeyValuePair<string, string>>> LoginAsync(
string username, string password)
{
var loginUri = new Uri("https://backloggery.com/login.php");
var loginParams = new KeyValuePair<string?, string?>[]
{
new ("username", username),
new ("password", password),
};
using var content = new FormUrlEncodedContent(loginParams);
var response = await httpClient.PostAsync(loginUri, content)
.ConfigureAwait(false);
var loginCookies = response.GetCookies();
if (!loginCookies.Any())
{
throw new ArgumentException("Login failed. Invalid username or password.");
}
return loginCookies;
}
The data that needs to be entered by the user is passed in via parameters. The rest is hard-coded. A few details to keep in mind:
- The form data is specified as a sequence of
KeyValuePair<string?, string?>
instances. In my case, only user-supplied data is required. If a site requires you to include hidden fields from the login page, you need to request that page first and parse the HTML using AngleSharp or some other method. - The resulting sequence is then passed to the
FormUrlEncodedContent
constructor. The keys and values in the sequence are nullable strings, because that's what FormUrlEncodedContent expects. The createdFormUrlEncodedContent
instance is used as the content of the POST request in thePostAsync
call. The next step is to parse the
set-cookie
headers from the HTTP response. I do not think there are any helper methods for this in the .NET base class library. Since I could not find any NuGet packages either, I used a regular expression to do the parsing. That worked well enough for me:public static IDictionary<string, string> GetCookies( this HttpResponseMessage response) { if (response == null) { return new Dictionary<string, string>(); } var cookies = response.Headers.GetValues("Set-Cookie") .Select(setCookieString => { var match = Regex.Match( setCookieString, @"(?<key>\w+)=(?<value>\w+);"); return new KeyValuePair<string, string>( match.Groups["key"].Value, match.Groups["value"].Value); }); return new Dictionary<string, string>(cookies); }
To detect a failed login, I only needed to check if there were cookies in the response. For other sites, you might need to look for a specific cookie check status code of the response, or even parse the returned HTML content.
The LoginAsync
method needs to be called before downloading the requested web page, so that the returned login cookies can be used for the request:
public async Task<string> GetHtmlAsync(
string username, string? password = null)
{
IEnumerable<KeyValuePair<string, string>>? loginCookies = null;
if (!string.IsNullOrEmpty(password))
{
loginCookies = await LoginAsync(username, password)
.ConfigureAwait(false);
}
var uri = new Uri(
$"https://backloggery.com/ajax_moregames.php?alpha=1&user={username}");
return await httpClient.GetStringAsync(uri, loginCookies)
.ConfigureAwait(false);
}
Unfortunately, there is no built-in GetStringAsync
method overload with a parameter for cookies. Therefore, I had to create an extension method myself:
public static async Task<string> GetStringAsync(
this HttpClient httpClient,
Uri uri,
IEnumerable<KeyValuePair<string, string>>? cookies)
{
using var request = new HttpRequestMessage(HttpMethod.Get, uri);
if (cookies != null)
{
var cookieValue = string.Join(
"; ",
cookies.Select(cookie => $"{cookie.Key}={cookie.Value}"));
request.Headers.Add("cookie", cookieValue);
}
var response = await httpClient.SendAsync(request).ConfigureAwait(false);
return await response.Content.ReadAsStringAsync().ConfigureAwait(false);
}
I had to combine all the cookies myself into a single cookie
header and then use the SendAsync
method to send the generated HTTP request.
There is one last code change needed. If the login response status code requires redirection, the automatic redirection feature of the HttpClient
must be disabled. This must be done when the HttpClient
instance is created. Since you should do this with IHttpClientFactory
, you can pass the option when registering the HttpClient
for dependency injection:
serviceCollection.AddHttpClient<BackloggeryClient>()
.ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler
{
AllowAutoRedirect = false,
});
If you have problems with the above, you can check out the full code in my GitHub repository.
There is no simple built-in functionality in HttpClient
or . NET in general to programmatically log in to websites and download web pages using that login. You have to manually perform all the steps to do this: submit the login form data, parse the returned cookies, and include those cookies in future requests. The basic building blocks for all these steps are all available.