Jump to content

Extracting substring


smandape

Recommended Posts

Hello Seniors, I am trying to extract the subsrtings from the following XML code, i want to extract the altname full and short and run it in loop..i am not sure as what to do..i was just playing with functions like functx:substring-after-last-match, but couldn't get it..could you please help me..

<GBSeq_definition>RecName: Full=Solute carrier family 2, facilitated glucose transporter member 1; AltName: Full=Glucose transporter type 1, erythrocyte/brain; Short=GLUT-1; AltName: Full=HepG2 glucose transporter</GBSeq_definition>

i want the output as

<aliases> <full_name>Glucose transporter type 1, erythrocyte/brain</full_name><short_name>GLUT-1</short_name><full_name>HepG2 glucose transporter</full_name>

also i have another XML code in the same fashion as below:

<GBSeq_comment>On or before Feb 16, 2007 this sequence version replaced gi:121945493, gi:121751.; [FUNCTION] Facilitative glucose transporter. This isoform may be responsible for constitutive or basal glucose uptake. Has a very broad substrate specificity; can transport a wide range of aldoses including both pentoses and hexoses.; [sUBCELLULAR LOCATION] Cell membrane; Multi-pass membrane protein (By similarity). Melanosome. Note=Localizes primarily at the cell surface (By similarity). Identified by mass spectrometry in melanosome fractions from stage I to stage IV.; [TISSUE SPECIFICITY] Expressed at variable levels in many human tissues.; [PTM] Phosphorylated upon DNA damage, probably by ATM or ATR.; [DISEASE] Defects in SLC2A1 are the cause of glucose transporter type 1 deficiency syndrome (GLUT1DS) [MIM:606777]; also known as blood-brain barrier glucose transport defect. This disease causes a defect in glucose transport across the blood-brain barrier. It is characterized by infantile seizures, delayed development, and acquired microcephaly.; [DISEASE] Defects in SLC2A1 are the cause of dystonia type 18 (DYT18) [MIM:612126]. DYT18 is an exercise-induced paroxysmal dystonia/dyskinesia. Dystonia is defined by the presence of sustained involuntary muscle contraction, often leading to abnormal postures. DYT18 is characterized by attacks of involuntary movements triggered by certain stimuli such as sudden movement or prolonged exercise. In some patients involuntary exertion-induced dystonic, choreoathetotic, and ballistic movements may be associated with macrocytic hemolytic anemia.; [sIMILARITY] Belongs to the major facilitator superfamily. Sugar transporter (TC 2.A.1.1) family. Glucose transporter subfamily.; [WEB RESOURCE] Name=GeneReviews; URL="http://www.ncbi.nlm.nih.gov/sites/GeneTests/lab/gene/SLC2A1"; [WEB RESOURCE] Name=Wikipedia; Note=GLUT1 entry; URL="http://en.wikipedia.org/wiki/GLUT1"</GBSeq_comment>

and i am trying to extract a part of this called [FUNCTION] that ends before ;[sUBCELLULAR LOCATION]i cannot do substring-after/before..is there any other function just to get the middle part of it..thank you for your time and help..sammed

Link to comment
Share on other sites

hello seniors.. i was searching for the code that would extract above data and found that it is easy in XSLT 2.0 than in 1.0. One of the forums I read the reply from Martin which was similar to this problem, but I didn't get it..well, I am using XSLT 1.0...can anyone of you please help me. Thank you,Sammed

Link to comment
Share on other sites

The substring-before() and substring-after() functions are also present in XSLT 1.0. Just use them.The only problem with them is they simply aren't as flexible as regular expression matches, which are only available in XSLT 2.0.You could try to use EXSLT extensions if basic XSLT 1.0 is not enough.What XSLT processor and in what environment are you using anyway?

Link to comment
Share on other sites

thank you boen_robot for your reply...i am currently using saxon6.5.5 processor in oxygen xml editor..i m not sure about EXSLT functions..but i am trying to get hold of it..I tried using substring-before and after,but, can i use it together..also is there any way that I can run it in loops to extract the AltName: full and short separately from the code below:

<GBSeq_definition>RecName: Full=Solute carrier family 2, facilitated glucose transporter member 1; AltName: Full=Glucose transporter type 1, erythrocyte/brain; Short=GLUT-1; AltName: Full=HepG2 glucose transporter</GBSeq_definition>

and want to get the output like below<aliases> <full_name>Glucose transporter type 1, erythrocyte/brain</full_name><short_name>GLUT-1</short_name><full_name>HepG2 glucose transporter</full_name>Thank you,Sammed

Link to comment
Share on other sites

Oxygen XML editor is going to be your final environmen, or only a testing one? Because if it's going to be your final, you might as well switch to Saxon 8, 9, or whatever higher version of Saxon you have. That version supports XSLT 2.0, and AFAIK, Oxygen supports it.Yes, you can use substring-before() and substring-after() together, or more precisely, within one another, like:

substring-before(substring-after(GBSeq_definition, '='), ';')

(outputs "Solute carrier family 2, facilitated glucose transporter member 1")But that's not exactly flexible either... and no, you can't really loop in a normal way as in imperative languages.If you really must limit yourself to XSLT 1.0, you can use the str:tokenize function/template (Saxon 6 supports func:function, so you can safely use the str:tokenize() function from that page). With it, you can split your string into several nodes, denoted by various delimiters (I'm guessing you'll first need to first tokenize by ":" and then tokenize each second result element by ";", and tokenize each result node of that by "="). After you have all the tokens in place, you can manipulate them in any way you want.

Link to comment
Share on other sites

Thank you for your reply boen_robot. Well, i think i would shift to XSLT 2.0...i have saxon EE-9.3.0.4/saxon PE 9.3.0.4/saxon HE 9.3.0.4 processor, i am not sure which would be the most appropriate for my research... and how can I do the extraction using them and how can I loop over them..can you please let me know..Thank you,Sammed

Link to comment
Share on other sites

Any one of them will do. "HE" is the free version that only includes XSLT 2.0 support, but doesn't include a schema validator and other similar goodies.How you do it in XSLT 2.0 is basically the same as with XSLT 1.0, except the function tokenize() is built in, and you can actually loop over the results immediatly "as is" instead of iterating over temporary "custom" elements.

Link to comment
Share on other sites

Thank you once again for your reply boen_robot. But when i use tokenize function, it gives me the output without the pattern. I am confused as how to get the data, how to loop in order to get the output as shown in the codebox.....

<GBSeq_definition>RecName: Full=Solute carrier family 2, facilitated glucose transporter member 1; AltName: Full=Glucose transporter type 1, erythrocyte/brain; Short=GLUT-1; AltName: Full=HepG2 glucose transporter</GBSeq_definition>

and want to get the output like below<aliases> Glucose transporter type 1, erythrocyte/brain;GLUT-1;HepG2 glucose transporter.</aliases>where as it gives me the output as follows:<field name="alises">RecName: Full=Solute carrier family 2, facilitated glucose transporter member 8; Full=Glucose transporter type 8; Short=GLUT-8; Full=Glucose transporter type X1</field>I am really sorry to pester you with questions, but, I tried it by many different ways and also tried to google it and find out different ways..i was unable to do so..I really appreciate your help..Thank you,Sammed

Link to comment
Share on other sites

Do you have a specification of the format? Just in case... it seems somewhat awkward.The basic idea is to find common "tokens" you can tokenize it to using XSLT 2.0's tokenize() function, but the pattern you'd use depends on the format's specs.

Link to comment
Share on other sites

thats the hard part that the code and specs are the same as I mentioned.. the XML code is

<GBSet><GBSeq>  <GBSeq_locus>GTR8_HUMAN</GBSeq_locus>  <GBSeq_length>477</GBSeq_length>  <GBSeq_moltype>AA</GBSeq_moltype>  <GBSeq_topology>linear</GBSeq_topology>  <GBSeq_division>PRI</GBSeq_division>  <GBSeq_update-date>30-NOV-2010</GBSeq_update-date>  <GBSeq_create-date>16-NOV-2001</GBSeq_create-date>  <GBSeq_definition>RecName: Full=Solute carrier family 2, facilitated glucose transporter member 8; AltName: Full=Glucose transporter type 8; Short=GLUT-8; AltName: Full=Glucose transporter type X1</GBSeq_definition>  <GBSeq_primary-accession>Q9NY64</GBSeq_primary-accession>

</GBSeq></GBSet>I am just giving u the initial part..this is the important part of the XML code..and the output i want to extract the aliases is as follows

<aliases>Glucose transporter type 1, erythrocyte/brain;GLUT-1;HepG2 glucose transporter.</aliases>

Thank you,Sammed

Link to comment
Share on other sites

No, I meant the formal definition of the format of the data in the "GBSeq_definition" element. You know, just like JSON has a spec, like XML itself has a spec, etc.... a document which says in one way or another "These things are separated by a ';', followed by optional spaces...".

Link to comment
Share on other sites

Thank you for your reply boen_robot.. it took almost a day to know wat specs mean for my file..if I am not wrong that would be the dtd file that would tell what GBSeq_definition mean...i tried my best to put in watever it has that would help to explain the format of GBSeq_definition..the following link provides the format for the data..(I tried to put wat it says in the codebox, but just for your reference the link)http://www.ncbi.nlm.nih.gov/data_specs/dtd...I_GBSeq.mod.dtdit has something like this

<!ELEMENT GBSeq (GBSeq_definition,>)<!ELEMENT GBSeq_definition (#PCDATA)>

basically i want to separate the alternate names(AltName) and the recommended names(RecName)the following is the link for the example that includes the format of the GBSeq_definition..just in case if you need..http://www.ncbi.nlm.nih.gov/protein/Q9NY64and the link to the XML file is, in case for your reference..http://eutils.ncbi.nlm.nih.gov/entrez/euti...amp;retmode=xmli really thank you for considering this and for your time..i really appreciate your help...thank you once again...Thank you,Sammed

Link to comment
Share on other sites

I'm talking about the text in the element, not the element iself.Here's just a sample kind of description I'm looking for, but it's possible I'm wrong (which is why I'm looking for the formal source):

<Contents> ::= <Group>+<Group> ::= <Group Identifier>? <Group Content> ";"<Group Identifier> ::= <Whitespace>* <Group Name> <Whitespace>* ":"<Whitespace> ::= \s<Group Name> ::= [A-Za-z]+<Group Content> ::= <Item>+<Item> ::= <Whitespace>* <Item Name> <Whitespace>* "=" <Whitespace>* <Item Value> <Whitespace>* ";"<Item Name> ::= [A-Za-z]+<Item Value> ::= [^;]+

The notation I'm using here is a sort of BNF kind of notation, but there are a lot of other ways to describe the pattern behind

RecName: Full=Solute carrier family 2, facilitated glucose transporter member 8; AltName: Full=Glucose transporter type 8; Short=GLUT-8; AltName: Full=Glucose transporter type X1

Link to comment
Share on other sites

Hey boen-robot..thank you for your reply..i got the solution..something like this

<field name="aliases">                 <xsl:for-each select="tokenize(substring-after(GBSeq_definition,'AltName:'),';')">                    <xsl:for-each select="tokenize(.,'=')">                        <xsl:if test="position() >  1">                            <xsl:value-of select="normalize-space(.)"/>                            <xsl:text> ;                            </xsl:text>                        </xsl:if>                     </xsl:for-each>                </xsl:for-each>            </field>

this seems to work..i thank you for your concern..i thank you for your time and help till now...thank you,Sammed

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...