Java Regex Recipes for Everyday Text Cleanup
Regular expressions (regex) are a powerful tool for textual data transformation and validation. In Java, the java.util.regex
package provides robust support for regex. Whether you’re parsing user input, scrubbing logs, or cleaning text files, mastering regex can greatly simplify your Java development. In this article, we’ll walk through common text-cleaning scenarios using practical regex recipes that every Java developer should have ready.
1. Removing HTML Tags
HTML often finds its way into text—from scraped web data to user-generated input. If you need to sanitize such content for plain text use, regex can help strip out tags efficiently.
public static String removeHtmlTags(String input) {
return input.replaceAll("<[^>]+>", "");
}
Explanation: The pattern <[^>]+>
matches any string that starts with <
, followed by one or more characters that aren’t >
, and ends with >
. This simplistic approach works for well-formed tags without embedded <>
characters.
Use Case: Cleaning quick blog snippets.
Tips: For more complex HTML, use a parser like JSoup, but regex is ideal for lightweight needs.
2. Normalizing Whitespace
Excessive or irregular whitespace can cause unexpected behavior in data processing or UI rendering. Normalizing it can clean up the text for downstream tasks or display.
public static String normalizeWhitespace(String input) {
return input.trim().replaceAll("\\s+", " ");
}
Explanation: \\s+
matches any sequence of one or more whitespace characters (spaces, tabs, newlines). The trim()
call first removes leading/trailing spaces.
Use Case: Pre-processing user input or formatting strings for consistent logging or output.
Performance Tip: Avoid repeatedly compiling regex; compile once using Pattern.compile()
for large-scale processing.
3. Validating Email Addresses
Email validation is one of the most common uses of regex—but also one of the trickiest. Here’s a balanced pattern suitable for most practical use cases:
public static boolean isValidEmail(String email) {
String pattern = "^[A-Za-z0-9+_.-]+@[A-Za-z0-9.-]+$";
return email.matches(pattern);
}
Explanation: This pattern checks for a valid username (letters, numbers, +_.-
) followed by @
then a domain with dots and dashes. It avoids over-engineering but works for 90% of real-world scenarios.
Edge Considerations: For RFC-compliant validation, consider libraries like Apache Commons Validator.
4. Extracting Phone Numbers
Suppose you need to pull phone numbers from a block of text, such as logs or free-form user input. Here’s a simple way for North American numbers:
public static List extractPhoneNumbers(String text) {
List phones = new ArrayList<>();
Matcher m = Pattern.compile("(\\(\\d{3}\\) \\d{3}-\\d{4}|\\d{3}-\\d{3}-\\d{4})").matcher(text);
while (m.find()) {
phones.add(m.group());
}
return phones;
}
Explanation: This pattern captures phone numbers in formats like (123) 456-7890
or 123-456-7890
, using alternation (|
).
Use Case: Data extraction from resumes or user-submitted forms.
Automation Tip: Use with Java Streams and I/O for bulk document processing.
5. Masking Sensitive Data
You might need to sanitize logs or data dumps by masking content like credit card numbers or social security numbers.
public static String maskCreditCard(String input) {
return input.replaceAll("\\b(\\d{4})\\d{8,}(\\d{4})\\b", "$1********$2");
}
Explanation: We capture the first and last four digits using groups (\\d{4})
and replace the middle part with asterisks.
Use Case: Secure log reporting, data anonymization workflows.
Regex Fact: Word boundaries (\\b
) ensure we don’t grab partial matches.
Conclusion
Regex is a must-have tool in your Java arsenal when it comes to text transformation and validation. By mastering common patterns and strategically applying them, you can automate boring cleanup tasks, ensure cleaner data pipelines, and develop more robust input handlers. For advanced tasks, always test your patterns carefully and use precompiled Pattern
objects for better performance in large loops or services.
Whether it’s scraping, parsing, or cleaning—keep these Java regex recipes in your back pocket!
Useful links: