IPv6 validation (and caveats)

By crisp on Friday 12 June 2009 01:23 - Comments (23)
Categories: Internet, PHP, Tweakers.net, Views: 19.120

Recently we got a request to also match IPv6 addresses as a host-part for our auto-links. Basically this seemed pretty straight-forward, but it proved that actually validating an IPv6 address is quite difficult.

I started out with the RFC mentioned in the title of the request, which was RFC-2732 which referred to RFC-2373 for the ABNF syntax of IPv6:
code:
1
2
3
4
5
6
7
8
IPv6address = hexpart [ ":" IPv4address ]
IPv4address = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT

IPv6prefix  = hexpart "/" 1*2DIGIT

hexpart = hexseq | hexseq "::" [ hexseq ] | "::" [ hexseq ]
hexseq  = hex4 *( ":" hex4)
hex4    = 1*4HEXDIG

Now that seemed easy, and a regexp was quickly made:
PHP:

1
2
3
4
5
6
7
8
9
10
<?php
preg_match('/^
        (?:
                [a-f0-9]{1,4}(?::[a-f0-9]{1,4})*
            |
                [a-f0-9]{1,4}(?::[a-f0-9]{1,4})*::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4})*)?
            |
                ::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4})*)?
        )
        (?::\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})?
    $/ix'
$IP);
?>

however, I quickly noticed that this expression (and thus the ABNF in the RFC) couldn't be correct because it doesn't limit the number of hex4 groups and is totally wrong in the way it checks for an IPv4 formatted part.

Luckily this has been fixed in RFC-3986 which mentions the following ABNF:
code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
IPv6address   =                            6( h16 ":" ) ls32
                 /                       "::" 5( h16 ":" ) ls32
                 / [h16] "::" 4( h16 ":" ) ls32
                 / [*1( h16 ":" ) h16] "::" 3( h16 ":" ) ls32
                 / [*2( h16 ":" ) h16] "::" 2( h16 ":" ) ls32
                 / [*3( h16 ":" ) h16] "::"    h16 ":"   ls32
                 / [*4( h16 ":" ) h16] "::"              ls32
                 / [*5( h16 ":" ) h16] "::"              h16
                 / [*6( h16 ":" ) h16] "::"

h16           = 1*4HEXDIG
ls32          = ( h16 ":" h16 ) / IPv4address
IPv4address   = dec-octet "." dec-octet "." dec-octet "." dec-octet

dec-octet     = DIGIT                 ; 0-9
                 / %x31-39 DIGIT         ; 10-99
                 / "1" 2DIGIT            ; 100-199
                 / "2" %x30-34 DIGIT     ; 200-249
                 / "25" %x30-35          ; 250-255

Now that's something else... Also note how this strictly defines the format of an IPv4 address. This either called for a monstrous regular expression or a tokenizing approach. I started out with the latter but quickly abandoned it because it became rather chaotic and didn't perform very well. Only then I started thinking that reinventing the wheel wasn't probably the best way to go about, so I started looking for existing IPv6 validation scripts :P

I first stumbled across this ticket for CakePHP which contained a patch with a reasonable looking regular expression to match IPv6, but on a more closer look this didn't quite match up with RFC-3986 and it failed miserably on a large number of testcases I made.

Then I found PHP's own filter_var (PHP >= 5.2) function (which I previously didn't know about :o), which also has a validation filter specifically for IP addresses, mixed or either IPv4 or IPv6. Hooray! :)

To test for a valid IPv6 address, you simple use:
PHP:

1
<?php
$result = filter_var($IPFILTER_VALIDATE_IPFILTER_FLAG_IPV6);
?>

it either returns the IP address, or false when it doesn't validate *O*

And then it failed 3 of my testcases... -O-

These are 3 examples of invalid addresses which pass using PHP's filter_var:

::01.02.03.04 (leading zero's not allowed in IPv4 part digits)
0:0:0:255.255.255.255 (not enough parts and no compression)
1fff::a88:85a3::172.31.128.1 (only one part may be compressed)

I checked some of the other validation filters of filter_var, such as FILTER_VALIDATE_EMAIL and FILTER_VALIDATE_URL, but based on those results I complete wrote of filter_var as being useless; the e-mail validation passes on faulty domains (it also passes domain-parts that would be valid as an internal hostname, but there is no flag to specifically check for internet e-mail addresses), and the URL validation is internally based on parse_url which specifically (and with reason) is not meant for validation - as also stated in the manual itself!

So if even PHP can't get it right I was forced to write my own validation function. I ended up with 4 functions that all passed my 65 testcases including the tokenized version I wrote earlier. I made the monstrous single regexp version, but also came up with two versions that first check some general characteristics and then use much simpler expressions to validate the overall syntax.

Here's the function that I finally decided on using:
PHP:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<?php

function validateIPv4($IP)
{
    return $IP == long2ip(ip2long($IP));
}

function validateIPv6($IP)
{
    // fast exit for localhost
    if (strlen($IP) < 3)
        return $IP == '::';

    // Check if part is in IPv4 format
    if (strpos($IP'.'))
    {
        $lastcolon = strrpos($IP':');
        if (!($lastcolon && validateIPv4(substr($IP$lastcolon + 1))))
            return false;

        // replace IPv4 part with dummy
        $IP = substr($IP0$lastcolon) . ':0:0';
    }

    // check uncompressed
    if (strpos($IP'::') === false)
    {
        return preg_match('/^(?:[a-f0-9]{1,4}:){7}[a-f0-9]{1,4}$/i'$IP);
    }

    // check colon-count for compressed format
    if (substr_count($IP':') < 8)
    {
        return preg_match('/^(?::|(?:[a-f0-9]{1,4}:)+):(?:(?:[a-f0-9]{1,4}:)*[a-f0-9]{1,4})?$/i'$IP);
    }

    return false;
}

?>

This wasn't the fastest performer (but only by a small margin), but imho the most elegant solution re-using the check we already had for IPv4 addresses using the long2ip/ip2long trick (i.s.o. a regexp)

I put all of my experiments into a single script with benchmark facilities parsing all of my testcases; you can find the source here.

Volgende:  Having fun with IE - part 4: invalid argument 08-'09 Having fun with IE - part 4: invalid argument
Volgende: JSMin+ version 1.3 05-'09 JSMin+ version 1.3

Comments



By Tweakers user Haijo, Friday 12 June 2009 08:19

For curiosity's sake, if the PHP function filter_var works very well for all but three cases why not check for those 3, rather unique, cases afterwards?

By Tweakers user crisp, Friday 12 June 2009 09:05

For curiosity's sake, if the PHP function filter_var works very well for all but three cases why not check for those 3, rather unique, cases afterwards?
Those cases aren't unique; they're just 3 out of the 65 testcases I constructed. It does however appear that it only involves addresses with an IPv4 part.

However, failing *any* testcase leads to mistrust. I have only briefly skimmed the sourcecode of the validation functions for filter_var (which also confirmed that they're using parse_url internally to validate URL's) so I cannot really comment on what exactly is wrong for this particular filter. The function looks somewhat like my tokenizing function, but I left that path early so my function is less than optimal as well.

Besides that, filter_var is only available as of PHP version 5.2 wereas my method will work in earlier PHP versions as well.

By Tweakers user Herko_ter_Horst, Friday 12 June 2009 09:09

Or better yet: submit a patch to PHP rather than roll your own and let the rest of the world figure it out for themselves (with the exception of a handful of Tweakers who just happened to read your blog)...

By Tweakers user Johnny, Friday 12 June 2009 11:36

Any developer faced with such an issue would rather quickly enter a phrase like "ipv6 validation php" in a search engine. This page is currently displayed on the 4th place in Google, so it won't be a big problem to find it.

By Tweakers user Tyrian, Friday 12 June 2009 12:01

Voorlopig is er nog absoluut geen 'need' voor IPv6. IPv6 is inmiddels al weer meer dan 10 jaar oud en voor het probleem waar ze toen bang voor waren (een tekort aan IP adressen) is allang een oplossing gekomen met de opkomst van Network Address Translation. (NAT) Bijkomend feit is dat bijna al onze switches en routers alleen met IPv4 werken en IPv6 pakketjes simpelweg droppen omdat ze niet herkend/ondersteund worden. Het vervangen van al deze hardware is een monumentale investering.

Bovendien zijn maar de helft van de IPv4 adressen in gebruik. Hele class-A netwerken zijn toegekend aan enkele organisaties die er bijna niets mee doen. Als het nodig is kunnen we die adressen ook in gebruik nemen waardoor er weer miljoenen IPv4 adressen beschikbaar komen.

Ik zie IPv6 daarom de komende 10 jaar nog niet komen, en wellicht gaat het uiteindelijk geheel van de baan.

[Comment edited on Friday 12 June 2009 18:31]


By Tweakers user Yggdrasil, Friday 12 June 2009 12:58

@Tyrian,

Ik vind dat een erg na´eve en onge´nformeerde uitspraak.
NAT was inderdaad een mooie oplossing die vele jaren uitstel heeft opgeleverd.
Het terugwinnen van ongebruikte IP ranges, is ook niet langer zinvol. Alle makkelijk terugwinbare blokken zijn al teruggewonnen in de afgelopen jaren.

Je kunt ten eerste niet alle ongebruikte adressen terugwinnen, omdat het subclassing systeem van IPv4 daarvoor simpelweg niet flexibel genoeg is. Als ik 8 adressen krijg en ik gebruik er maar 4 kan ik die niet aan jou geven. Ook zouden de core routers dergelijke complexe routingtables (die nu al enorm groot zijn) niet aankunnen.

Ten tweede blijkt het (ja het is overwogen) vrijwel even lang te duren om de grote blokken terug te winnen dan het kost om ze op te gebruiken. Het tempo waarin adressen gebruikt worden ligt nu eenmaal hoog. Het zou hoogstens een maand of 3 schelen.

Een omschakeling is dus onvermijdelijk. Als je apparatuur daar niet klaar voor is is dat kortzichtig van jullie kant. Het wordt al jaren aangekondigd dus had je je leverancier onder druk moeten zetten voor firmware updates voor je routers. Switches zijn al helemaal geen probleem, aangezien deze frames switchen (layer 3) dus niks merken van welk IP protocol dan ook.

Je mening is dan ook ongefundeerd en slecht ge´nformeerd.

By Arno, Friday 12 June 2009 13:01

>Voorlopig is er nog absoluut geen 'need' voor IPv6. IPv6 is inmiddels al weer meer
>dan 10 jaar oud en voor het probleem waar ze toen bang voor waren (een tekort aan
>IP adressen) is allang een oplossing gekomen met de opkomst van Network
>Address Translation

Oh please. NAT is not a solution, it is a workaround. The problem is that ISPs do not bear the burden of that workaround, so they can happily go on ignoring IPv6 for another year or two, which is when the pool of IPv4 addresses will have dried up. Only then will they start to feel the same pain that consumers and application writers feel today. Some examples:

- UPnP networking. It's only required to punch holes in a NAT'ing firewall.
- Opening ports on a modem just to play an online game? How many gamers were forced to learn about networking because they couldn't play their game like they should?
- P2P networking: how many users of Kazaa and bittorrent show up as leeches because they know nothing about the NAT'ing Internet device they are forced to use?
- do you know that it took Microsoft until version 8 of Msn Messenger before you could succesfully use a webcam when you were both behind a NAT?

Of course the Internet "still works". ISPs wouldn't get away with anything less. But the reason why it still works is not because of those ISPs. It is because modern applications have been written with workarounds to circumvent NAT.

side note: the term "class A" has been obsolete since the introduction of CIDR in 1993. The correct term is "/8 network".

By Tweakers user Yggdrasil, Friday 12 June 2009 13:35

Thanks crisp,

Exactly what I needed!

I also noticed the filter_var method wasn't good enough. There are a lot of 'good enough' solutions out there, but I kept running into the edge cases.

Your research and testcases give me a lot more confidence in your solution.

P.S. You may want to submit a bug report to PHP, on their built-in methods.

By Tweakers user Tyrian, Friday 12 June 2009 15:01

@Yggdrasil:

Mijn mening ongefundeerd en slecht ge´nformeerd? Luister eens naar de podcast Security Now! van de gerespecteerde ICT beveiligingsdeskundige Steve Gibson.

Security Now episode 199:
The Geek Atlas, IPv6 & a non-VPN

Steve and Leo explore three topics this week: A terrific new book for geeks and non-geeks alike, the uncertain future of IPv6 (and a few cautions about rushing to adoption) and a idea Steve has been mulling around for a "lightweight" means for making secure Internet connections with a VPN tunnel.
MP3 download
HTML transcript

Ik wil over 10 jaar nog wel eens zien waar we zijn met IPv6. Ik denk de meesten van ons nog gewoon op IPv4 zullen zitten.

[Comment edited on Friday 12 June 2009 15:02]


By Tweakers user ari3, Friday 12 June 2009 15:41

Een VPN-tunnel lost het probleem niet op. Het is hooguit een alternatief voor NAT. De kern van het probleem is dat er onvoldoende netwerkadressen zijn. Dat los je niet op met een VPN-tunnel.

By Tweakers user Tyrian, Friday 12 June 2009 16:02

@ari3:

Die VPN tunnel heeft niets met het IPv6 onderwerp te maken. De podcast gaat niet enkel over IPv6, er wordt ook over losstaande andere onderwerpen gesproken, waaronder de VPN.


By Tweakers user Tyrian, Friday 12 June 2009 18:28

@Yggdrasil:

Update op mijn vorige reactie: Ik moet toegeven dat een switch inderdaad niets met met IP adressen te maken heeft. Steve Gibson had dit gezegd tijdens zijn podcast en heeft zichzelf inmiddels verbeterd in de zojuist gereleaste episode 200. Dus: een switch werkt op layer 2, met mac adressen en heeft niets met IP adressen te doen.

By Tweakers user crisp, Friday 12 June 2009 22:28

Besides the fact that I already see some issues with those expressions it suggests that you should check the IP against all of those, and if one matches you'd have a valid address. It's a simple approach, but also more costly ;)

By Tweakers user Jaap-Jan, Tuesday 16 June 2009 12:37

What the heck? I didn't thought that my 'simple' feature request would turn out to be that hard to implement. :P But indeed a quite elegant solution. :)

@Tyrian. Ik snap niet waarom je doet alsof het zo kostbaar is? Je hoeft niet ineens over op IPv6. Als je je router een keer moet vervangen (ik ga dat binnenkort ook doen, want mijn WRT54G kan het door de snelheidsverhoging van Ziggo straks niet meer bijbenen), dan kun je altijd kijken naar een model wat IPv6 ondersteund of in de pijplijn heeft. Veel consumentenrouters moeten vervangen worden door ouderdom, maar ook in bedrijfsnetwerken kan er al met een schuin oog naar gekeken worden bij de aanschaf van nieuwe routers/ layer 3 switches. Ik denk dat je iets te makkelijk Steve Gibson napraat. :)

[Comment edited on Tuesday 16 June 2009 12:39]


By Tweakers user JayVee, Friday 03 July 2009 15:29

Ik krijg een timeout op je xs4all pagina! (www.xs4all.nl doet het wel)

By Martin, Sunday 23 August 2009 23:01

Thanks for the code, this is a very nice function to add to my library !

By WCP, Monday 05 October 2009 17:40

Thank you for the helpful information.

I think there are still two bugs in your validation and the test cases. The test case 0:a:b:c:d:e:f:: you have marked as invalid address and your validation agrees on that. Yet a 1 to 1 implementation of the ABFN from RFC-3986 regards the test cases as fully valid. RFC-4291 states that "one or more groups of 16 bits of zeros" can be compressed, which is misleading, because a group consists of at least 2 members. Yet as I see it, it is possible to compress a single zero as well. The RFC also states that "the "::" can also be used to compress leading or trailing zeros in an address". If single zeros are allowed to be compressed at all, then this clearly say that this works by replacing it by "::". Therefore the above test case is a valid IPv6 address, replacing a single trailing zero by "::". The RFCs may be a bit unclear, nevertheless the ABFN in RFC-3986 favors that interpretation. The RFC also says that zeros "can" be compressed, not "have to", so it's my interpretation that the test cases ::0:a:b:c:d:e:f and a:b:c:d:e:f:0:: that you also marked as invalid are actually valid examples. Again, the 1 to 1 implementation of the ABFN favors that view... unless I made a mistake implementing it of course. I guess I'll go with the official ABFN version until proven wrong.

Here's my version:
$dec_octet = "([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])";
$ipv4address = "($dec_octet\.){3}$dec_octet";
$h16 = "[0-9a-fA-F]{1,4}";
$ls32 = "($h16:$h16|$ipv4address)";
$ipv6address = "(" .
"(" .
"($h16:){6}" .
"|" .
"::($h16:){5}" .
"|" .
"($h16)?::($h16:){4}" .
"|" .
"(($h16:){0,1}$h16)?::($h16:){3}" .
"|" .
"(($h16:){0,2}$h16)?::($h16:){2}" .
"|" .
"(($h16:){0,3}$h16)?::$h16:" .
"|" .
"(($h16:){0,4}$h16)?::" .
")" .
"$ls32" .
"|" .
"(" .
"(($h16:){0,5}$h16)?::$h16" .
"|" .
"(($h16:){0,6}$h16)?::" .
")" .
")";

By Tweakers user crisp, Wednesday 07 October 2009 00:04

WCP: interesting. Maybe more interesting is the fact that the prose "one or more groups of 16 bits of zeros" has been changed from "multiple groups of 16 bits of zeros" in RFC-2373.

Anyway, the current prose does seem to support the ABNF (although you may argue that substituting a single leading or trailing (group of 16 bits of) zero(s) is not really 'compression'), so most validation routines out there - including mine and PHP's filter_var - are incorrect here :o

By Tweakers user s.stok, Sunday 16 May 2010 16:02

http://www.faqs.org/rfcs/rfc3513.html
The RFC also says that zeros "can" be compressed, not "have to", so it's my interpretation that the test cases ::0:a:b:c:d:e:f and a:b:c:d:e:f:0:: that you also marked as invalid are actually valid examples.
The use of "::" indicates one or more groups of 16 bits of zeros.
The "::" can only appear once in an address.
So you made a mistake WCP, it can only exists once ;)

By Tweakers user s.stok, Monday 17 May 2010 13:08

http://forums.dartware.com/viewtopic.php?t=452

Your TestCases have some bugs, var_filters() works perfect by the way ;)
As long as you trim spaces from the beginning and end.

By Tweakers user crisp, Monday 17 May 2010 22:13

@s.stok: would you care to point out which of my testcases are incorrect and why?

And did you read my follow-up post as well?

[Comment edited on Monday 17 May 2010 22:13]


Comments are closed